Linux CXL
 help / color / mirror / Atom feed
From: Jonathan Cameron <jic23@kernel.org>
To: "Bowman, Terry" <terry.bowman@amd.com>
Cc: dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, djbw@kernel.org, bhelgaas@google.com,
	shiju.jose@huawei.com, ming.li@zohomail.com,
	Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
	dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com,
	lukas@wunner.de, Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	vishal.l.verma@intel.com, alucerop@amd.com, ira.weiny@intel.com,
	corbet@lwn.net, rafael@kernel.org, xueshuai@linux.alibaba.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-doc@vger.kernel.org,
	Mauro Carvalho Chehab <mchehab@kernel.org>
Subject: Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events
Date: Fri, 8 May 2026 15:05:33 +0100	[thread overview]
Message-ID: <20260508150533.04e19cf9@jic23-huawei> (raw)
In-Reply-To: <8913c666-a343-4717-8ab2-0b8546d1bdfb@amd.com>

On Thu, 7 May 2026 13:33:45 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 5/7/2026 1:08 PM, Jonathan Cameron wrote:
> > [Some people who received this message don't often get email from jic23@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Tue, 5 May 2026 12:30:20 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >   
> >> From: Dan Williams <djbw@kernel.org>  
> > 
> > +CC Mauro - rasdaemon related - see below.
> >   
> >>
> >> CXL protocol error logging uses two parallel sets of trace events. The
> >> cxl_port_aer_correctable_error() and cxl_port_aer_uncorrectable_error()
> >> events are used by CPER for CXL Port devices. The cxl_aer_correctable_error()
> >> and cxl_aer_uncorrectable_error() events are used for CXL Endpoints. Update
> >> the trace routines to use the latter for all CXL devices on both the CPER
> >> and native AER paths.
> >>
> >> Generalize cxl_aer_correctable_error()/cxl_aer_uncorrectable_error to
> >> take a struct device * and a u64 serial argument supplied by the caller.
> >> cxl_handle_ras() and cxl_handle_cor_ras() gain the new u64 serial parameter,
> >> sourced from pci_get_dsn().
> >>
> >> The CPER path keeps its existing Port-vs-Endpoint dispatch and passes the
> >> new arguments to the unified trace events. The CPER path will be folded
> >> together in a following patch.
> >>
> >> Remove the now-unused cxl_port_aer_correctable_error() and
> >> cxl_port_aer_uncorrectable_error().
> >>
> >> **WARNING: ABI BREAK**
> >> Rename the trace event field "memdev" to "device" so all CXL device types
> >> (Ports and Endpoints) can be reported under a common field name. Note this
> >> is an ABI break for userspace tools that key off the old "memdev" field.
> >> Specifically, rasdaemon's ras-cxl-handler.c looks up "memdev" and bails on
> >> NULL, so an unmodified rasdaemon will drop every CXL CE/UCE event once this
> >> kernel ships. A rasdaemon update is needed in a separate series.
> >>
> >> The need for the field rename was discussed in v16 review [1].  
> > 
> > This concerns me (sorry I wasn't paying attention to the v16 thread).
> > It is a userspace regression against code that is out in the wild and typically
> > not updated in sync with the kernel.
> > 
> > If you are suggesting breaking ras-daemon at the very least +CC the maintainer.
> > 
> > To get to a unified tracepoint add a new one that does what you want, but
> > maintain the existing ones as well.  Userspace can then migrate and maybe
> > in 5+ years time we can delete the non unified ones.
> > 
> > No actually comments on the code, just left it all here for Mauro,
> > 
> > Thanks,
> > 
> > Jonathan
> >   
> 
> Dan was clear about using a single set of CE and UE handlers for all CXL RAS 
> protocol errors. While I understand there may be concerns, please direct any 
> objections to Dan and clarify what changes are required to avoid this 
> repeatedly going back and forth.
> 
> [1] https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-mobl4.notmuch/

Sure - Dan's on this thread so I'm sure he'll see it sooner or later.

Perhaps I'm missing something that makes this less critical than it appears.

You can have a single set of handlers, but at the point of spitting the actual
tracepoints out we need to keep spitting the old ones (+ possibly a new unified
one if you want to one day get rid of the separation.)  Bit fiddly but seems
unlikely to be that bad.  e.g. put a wrapper where you currently have
trace_cxl_aer_uncorrectable_error() and have that omit the new and old (based
on device type) tracepoints.  Then when we eventually drop this after enough
years that we can be sure the new one is in use, the code cleanup is all in
one place.

Note this issue isn't a "maybe" thing - we are talking userspace ABI breakage
in an interface known to be in use in commonly used software that is not
typically updated in time with the kernel cadence.  There have been arguments
that some trace points are not 'stable' but that definitely isn't the case for
the RAS ones which are the main interface from kernel to userspace tooling.

In theory you could make such a change and maybe get away with it (on basis
a regression only exists if anyone notices) but you'd need ALL the distros
on board that ship rasdaemon + if you really don't want to end up reverting
you'd have to work closely with the hyperscalers who might decide to throw a
'regression + revert' request at the list which either means a scramble to
put in place what I describe above, or this series being reverted.  Note there
are downstream forks of rasdaemon to content with as well.

That pain just isn't worth it.

Mauro, any idea if any distros scan for RAS tracepoints for compatibility breakage?
They probably should like they do of ioctls and similar but no idea if anyone
actually does yet.  If they do we'd get the revert request pretty quickly...
If not we get to wait for some one to hit it in a functional test 
- thankfully RAS paths are definitely in those test sets but they tend to
run later and hence when a revert / fix is more painful.

Jonathan


  reply	other threads:[~2026-05-08 14:05 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron [this message]
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-07 18:22   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06 18:00   ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260508150533.04e19cf9@jic23-huawei \
    --to=jic23@kernel.org \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mchehab@kernel.org \
    --cc=ming.li@zohomail.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox