From: "Dan Williams (nvidia)" <djbw@kernel.org>
To: Jonathan Cameron <jic23@kernel.org>,
"Bowman, Terry" <terry.bowman@amd.com>
Cc: dave@stgolabs.net, dave.jiang@intel.com,
alison.schofield@intel.com, djbw@kernel.org,
bhelgaas@google.com, shiju.jose@huawei.com,
ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
rrichter@amd.com, dan.carpenter@linaro.org,
PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
Benjamin.Cheatham@amd.com,
sathyanarayanan.kuppuswamy@linux.intel.com,
vishal.l.verma@intel.com, alucerop@amd.com,
ira.weiny@intel.com, corbet@lwn.net, rafael@kernel.org,
xueshuai@linux.alibaba.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-doc@vger.kernel.org,
Mauro Carvalho Chehab <mchehab@kernel.org>
Subject: Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events
Date: Fri, 08 May 2026 20:49:17 -0700 [thread overview]
Message-ID: <69feaebd471c3_1b86a100b@djbw-dev.notmuch> (raw)
In-Reply-To: <20260508150533.04e19cf9@jic23-huawei>
Jonathan Cameron wrote:
> On Thu, 7 May 2026 13:33:45 -0500
> "Bowman, Terry" <terry.bowman@amd.com> wrote:
[..]
> > > This concerns me (sorry I wasn't paying attention to the v16 thread).
> > > It is a userspace regression against code that is out in the wild and typically
> > > not updated in sync with the kernel.
> > >
> > > If you are suggesting breaking ras-daemon at the very least +CC the maintainer.
Sorry, that was not the intent, see below.
> > >
> > > To get to a unified tracepoint add a new one that does what you want, but
> > > maintain the existing ones as well. Userspace can then migrate and maybe
> > > in 5+ years time we can delete the non unified ones.
> > >
> > > No actually comments on the code, just left it all here for Mauro,
> > >
> > > Thanks,
> > >
> > > Jonathan
> > >
> >
> > Dan was clear about using a single set of CE and UE handlers for all CXL RAS
> > protocol errors. While I understand there may be concerns, please direct any
> > objections to Dan and clarify what changes are required to avoid this
> > repeatedly going back and forth.
> >
> > [1] https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-mobl4.notmuch/
>
> Sure - Dan's on this thread so I'm sure he'll see it sooner or later.
>
> Perhaps I'm missing something that makes this less critical than it appears.
No, it is breakage and a thinko on my part on the advice to Terry on the
backwards compatibility rules for tracepoints. At the time I was only
tracking data type and order of the payload. I.e. string at same
position. However, the name of the argument is ABI.
Something like this incremental fixup I think gets this back on track.
It keeps legacy ABI support for "memdev" field in the payload. It
incrementally lets updated userspace understand "port" and "dport"
events. It stops us from growing a new set of events just to update the
arguments. It enhances the CPER events to now handle switch ports in
addition to endpoint ports.
The bulk of the change is passing @port and @dport to the CXL trace
events instead of a plain @dev.
-- >8 --
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index ff39985d363f..ed3a56966369 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -183,9 +183,10 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
#ifdef CONFIG_CXL_RAS
int cxl_ras_init(void);
void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
-void cxl_handle_cor_ras(struct device *dev, u64 serial,
- void __iomem *ras_base);
+bool cxl_handle_ras(struct cxl_port *port, struct cxl_dport *dport, u64 serial,
+ void __iomem *ras_base);
+void cxl_handle_cor_ras(struct cxl_port *port, struct cxl_dport *dport,
+ u64 serial, void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
void cxl_handle_rdport_errors(struct pci_dev *pdev);
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 6f3957b3c3af..3857d2fc279d 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -49,20 +49,24 @@
)
TRACE_EVENT(cxl_aer_uncorrectable_error,
- TP_PROTO(const struct device *dev, u32 status, u32 fe, u32 *hl,
- u64 serial),
- TP_ARGS(dev, status, fe, hl, serial),
+ TP_PROTO(struct cxl_port *port, struct cxl_dport *dport, u32 status,
+ u32 fe, u32 *hl, u64 serial),
+ TP_ARGS(port, dport, status, fe, hl, serial),
TP_STRUCT__entry(
- __string(device, dev_name(dev))
- __string(host, dev_name(dev->parent))
+ __string(memdev, cxl_trace_memdev_name(port))
+ __string(host, cxl_trace_host_name(port))
__field(u64, serial)
__field(u32, status)
__field(u32, first_error)
__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
+ __string(port, cxl_trace_port_name(port))
+ __string(dport, cxl_trace_dport_name(dport))
),
TP_fast_assign(
- __assign_str(device);
+ __assign_str(memdev);
__assign_str(host);
+ __assign_str(port);
+ __assign_str(dport);
__entry->serial = serial;
__entry->status = status;
__entry->first_error = fe;
@@ -72,8 +76,9 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
*/
memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
),
- TP_printk("device=%s host=%s serial=%lld status: '%s' first_error: '%s'",
- __get_str(device), __get_str(host), __entry->serial,
+ TP_printk("memdev=%s port=%s dport=%s host=%s serial=%lld status: '%s' first_error: '%s'",
+ __get_str(memdev), __get_str(port), __get_str(dport),
+ __get_str(host), __entry->serial,
show_uc_errs(__entry->status),
show_uc_errs(__entry->first_error)
)
@@ -98,22 +103,27 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
)
TRACE_EVENT(cxl_aer_correctable_error,
- TP_PROTO(const struct device *dev, u32 status, u64 serial),
- TP_ARGS(dev, status, serial),
+ TP_PROTO(struct cxl_port *port, struct cxl_dport *dport, u32 status, u64 serial),
+ TP_ARGS(port, dport, status, serial),
TP_STRUCT__entry(
- __string(device, dev_name(dev))
- __string(host, dev_name(dev->parent))
+ __string(memdev, cxl_trace_memdev_name(port))
+ __string(host, cxl_trace_host_name(port))
__field(u64, serial)
__field(u32, status)
+ __string(port, cxl_trace_port_name(port))
+ __string(dport, cxl_trace_dport_name(dport))
),
TP_fast_assign(
- __assign_str(device);
+ __assign_str(memdev);
+ __assign_str(port);
+ __assign_str(dport);
__assign_str(host);
__entry->serial = serial;
__entry->status = status;
),
- TP_printk("device=%s host=%s serial=%lld status: '%s'",
- __get_str(device), __get_str(host), __entry->serial,
+ TP_printk("memdev=%s port=%s dport=%s host=%s serial=%lld status: '%s'",
+ __get_str(memdev), __get_str(port), __get_str(dport),
+ __get_str(host), __entry->serial,
show_ce_errs(__entry->status)
)
);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 776c50d1db51..83e161d48405 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -101,6 +101,12 @@ static inline bool is_cxl_endpoint(struct cxl_port *port)
return is_cxl_memdev(port->uport_dev);
}
+/* trace-event helpers */
+const char *cxl_trace_memdev_name(struct cxl_port *port);
+const char *cxl_trace_host_name(struct cxl_port *port);
+const char *cxl_trace_port_name(struct cxl_port *port);
+const char *cxl_trace_dport_name(struct cxl_dport *dport);
+
struct cxl_memdev *__devm_cxl_add_memdev(struct cxl_dev_state *cxlds,
const struct cxl_memdev_attach *attach);
struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds,
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index b45e2b539b5f..33e78f155916 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -8,16 +8,20 @@
#include <cxlpci.h>
#include "trace.h"
-static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev, u64 serial,
- struct cxl_ras_capability_regs *ras_cap)
+static void
+cxl_cper_trace_corr_prot_err(struct cxl_port *port, struct cxl_dport *dport,
+ u64 serial,
+ struct cxl_ras_capability_regs *ras_cap)
{
u32 status = ras_cap->cor_status & ~ras_cap->cor_mask;
- trace_cxl_aer_correctable_error(&pdev->dev, status, serial);
+ trace_cxl_aer_correctable_error(port, dport, status, serial);
}
-static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev, u64 serial,
- struct cxl_ras_capability_regs *ras_cap)
+static void
+cxl_cper_trace_uncorr_prot_err(struct cxl_port *port, struct cxl_dport *dport,
+ u64 serial,
+ struct cxl_ras_capability_regs *ras_cap)
{
u32 status = ras_cap->uncor_status & ~ras_cap->uncor_mask;
u32 fe;
@@ -28,10 +32,12 @@ static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev, u64 serial,
else
fe = status;
- trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
+ trace_cxl_aer_uncorrectable_error(port, dport, status, fe,
ras_cap->header_log, serial);
}
+static struct cxl_port *find_cxl_port_by_dev(struct device *dev, struct cxl_dport **dport);
+
void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
{
unsigned int devfn = PCI_DEVFN(data->prot_err.agent_addr.device,
@@ -40,19 +46,26 @@ void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
data->prot_err.agent_addr.bus,
devfn);
+ struct cxl_dport *dport;
if (!pdev)
return;
- guard(device)(&pdev->dev);
- if (!pdev->dev.driver)
+ struct cxl_port *port __free(put_cxl_port) =
+ find_cxl_port_by_dev(&pdev->dev, &dport);
+
+ if (!port)
+ return;
+
+ guard(device)(&port->dev);
+ if (!port->dev.driver)
return;
if (data->severity == AER_CORRECTABLE)
- cxl_cper_trace_corr_prot_err(pdev, pci_get_dsn(pdev),
+ cxl_cper_trace_corr_prot_err(port, dport, pci_get_dsn(pdev),
&data->ras_cap);
else
- cxl_cper_trace_uncorr_prot_err(pdev, pci_get_dsn(pdev),
+ cxl_cper_trace_uncorr_prot_err(port, dport, pci_get_dsn(pdev),
&data->ras_cap);
}
EXPORT_SYMBOL_GPL(cxl_cper_handle_prot_err);
@@ -222,13 +235,12 @@ static void __iomem *to_ras_base(struct cxl_port *port, struct cxl_dport *dport)
static void cxl_do_recovery(struct pci_dev *pdev, struct cxl_port *port, struct cxl_dport *dport)
{
- struct device *dev = &pdev->dev;
bool ue;
if (pci_dev_is_disconnected(pdev))
panic("CXL cachemem error: device disconnected during UE recovery");
- ue = cxl_handle_ras(dev, pci_get_dsn(pdev),
+ ue = cxl_handle_ras(port, dport, pci_get_dsn(pdev),
to_ras_base(port, dport));
if (ue)
panic("CXL cachemem error.");
@@ -238,7 +250,8 @@ static void cxl_do_recovery(struct pci_dev *pdev, struct cxl_port *port, struct
pci_aer_clear_fatal_status(pdev);
}
-void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
+void cxl_handle_cor_ras(struct cxl_port *port, struct cxl_dport *dport,
+ u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
u32 status;
@@ -250,7 +263,7 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(dev, status, serial);
+ trace_cxl_aer_correctable_error(port, dport, status, serial);
}
}
@@ -275,7 +288,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
+bool cxl_handle_ras(struct cxl_port *port, struct cxl_dport *dport, u64 serial,
+ void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
@@ -302,7 +316,7 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
+ trace_cxl_aer_uncorrectable_error(port, dport, status, fe, hl, serial);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
@@ -358,7 +372,7 @@ static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
cxl_handle_rdport_errors(pdev);
if (severity == AER_CORRECTABLE) {
- cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
+ cxl_handle_cor_ras(port, dport, pci_get_dsn(pdev),
to_ras_base(port, dport));
pcie_clear_device_status(pdev);
} else {
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index cbd02cabefbc..1bcd3c491aaa 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -113,9 +113,8 @@ void cxl_handle_rdport_errors(struct pci_dev *pdev)
pci_print_aer(pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
- cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
+ cxl_handle_cor_ras(port, dport, pci_get_dsn(pdev),
dport->regs.ras);
else
- cxl_handle_ras(&pdev->dev, pci_get_dsn(pdev),
- dport->regs.ras);
+ cxl_handle_ras(port, dport, pci_get_dsn(pdev), dport->regs.ras);
}
diff --git a/drivers/cxl/core/trace.c b/drivers/cxl/core/trace.c
index 7f2a9dd0d0e3..df42d119c53d 100644
--- a/drivers/cxl/core/trace.c
+++ b/drivers/cxl/core/trace.c
@@ -2,7 +2,42 @@
/* Copyright(c) 2022 Intel Corporation. All rights reserved. */
#include <cxl.h>
+#include <cxlmem.h>
#include "core.h"
+const char *cxl_trace_memdev_name(struct cxl_port *port)
+{
+ if (is_cxl_endpoint(port)) {
+ struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+
+ return dev_name(&cxlmd->dev);
+ }
+
+ return "";
+}
+
+const char *cxl_trace_host_name(struct cxl_port *port)
+{
+ if (is_cxl_endpoint(port)) {
+ struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+
+ return dev_name(cxlmd->dev.parent);
+ }
+
+ return dev_name(port->uport_dev);
+}
+
+const char *cxl_trace_port_name(struct cxl_port *port)
+{
+ return dev_name(&port->dev);
+}
+
+const char *cxl_trace_dport_name(struct cxl_dport *dport)
+{
+ if (dport)
+ return dev_name(dport->dport_dev);
+ return "";
+}
+
#define CREATE_TRACE_POINTS
#include "trace.h"
next prev parent reply other threads:[~2026-05-09 3:49 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 20:26 ` sashiko-bot
2026-05-05 21:17 ` Dave Jiang
2026-05-07 17:53 ` Jonathan Cameron
2026-05-07 18:26 ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:07 ` sashiko-bot
2026-05-05 21:46 ` Dave Jiang
2026-05-07 18:08 ` Jonathan Cameron
2026-05-07 18:33 ` Bowman, Terry
2026-05-08 14:05 ` Jonathan Cameron
2026-05-09 3:49 ` Dan Williams (nvidia) [this message]
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 21:30 ` sashiko-bot
2026-05-05 22:02 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06 ` Dave Jiang
2026-05-07 18:11 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 21:52 ` sashiko-bot
2026-05-05 22:16 ` Dave Jiang
2026-05-07 18:14 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 22:28 ` sashiko-bot
2026-05-07 18:22 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:34 ` sashiko-bot
2026-05-05 23:59 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43 ` Dave Jiang
2026-05-07 18:25 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06 1:01 ` sashiko-bot
2026-05-06 18:00 ` Dave Jiang
2026-05-07 18:29 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34 ` Dave Jiang
2026-05-07 18:51 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=69feaebd471c3_1b86a100b@djbw-dev.notmuch \
--to=djbw@kernel.org \
--cc=Benjamin.Cheatham@amd.com \
--cc=PradeepVineshReddy.Kodamati@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=bhelgaas@google.com \
--cc=corbet@lwn.net \
--cc=dan.carpenter@linaro.org \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jic23@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mchehab@kernel.org \
--cc=ming.li@zohomail.com \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=shiju.jose@huawei.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox