public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: "Bowman, Terry" <terry.bowman@amd.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <dan.j.williams@intel.com>,
	<bhelgaas@google.com>, <shiju.jose@huawei.com>,
	<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
	<rrichter@amd.com>, <dan.carpenter@linaro.org>,
	<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
	<Benjamin.Cheatham@amd.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	<linux-cxl@vger.kernel.org>, <vishal.l.verma@intel.com>,
	<alucerop@amd.com>, <ira.weiny@intel.com>,
	<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>
Subject: Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
Date: Thu, 5 Feb 2026 17:16:27 +0000	[thread overview]
Message-ID: <20260205171627.000013da@huawei.com> (raw)
In-Reply-To: <e296e48e-40d1-483b-9f49-78e8bd3b885f@amd.com>

On Tue, 3 Feb 2026 12:21:56 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 2/3/2026 9:40 AM, Jonathan Cameron wrote:
> > On Mon, 2 Feb 2026 20:52:40 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >   
> >> Introduce CXL Port protocol error handling callbacks to unify detection,
> >> logging, and recovery across CXL Ports and Endpoints, including RCH
> >> downstream ports. Establish a consistent flow for correctable and
> >> uncorrectable CXL protocol errors.
> >>
> >> Provide the solution by adding cxl_port_cor_error_detected() and
> >> cxl_port_error_detected() to handle correctable and uncorrectable handling
> >> through CXL RAS helpers, coordinating uncorrectable recovery in
> >> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
> >> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
> >> driver being bound to avoid processing errors on disabled devices.
> >>
> >> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
> >> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
> >> for Upstream Ports/Endpoints.
> >>
> >> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
> >> cxl_core to clear PCIe/AER state in these flows.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> >> Reviewed-by: Dave Jiang dave.jiang@intel.com  
> > 
> > Hi Terry,
> > 
> > A few comments inline.
> > 
> > Thanks,
> > 
> > Jonathan
> >   
> 
> Thanks for reviewing.
> 
> >>
> >> ---
> >>
> >> Changes in v14->v15:
> >> - Update commit message and title. Added Bjorn's ack.
> >> - Move CE and UCE handling logic here
> >>
> >> Changes in v13->v14:
> >> - Add Dave Jiang's review-by
> >> - Update commit message & headline (Bjorn)
> >> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> >>   one line (Jonathan)
> >> - Remove cxl_walk_port() (Dan)
> >> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> >>   sufficient (Dan)
> >> - Remove device_lock_if()
> >> - Combined CE and UCE here (Terry)
> >>
> >> Changes in v12->v13:
> >> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> >>   patch (Terry)
> >> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> >> - Remove check for dport->dport_dev (Dave)
> >> - Remove whitespace (Terry)
> >>
> >> Changes in v11->v12:
> >> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> >>   pci_to_cxl_dev()
> >> - Change cxl_error_detected() -> cxl_cor_error_detected()
> >> - Remove NULL variable assignments
> >> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> >>   port searches.
> >>
> >> Changes in v10->v11:
> >> - None
> >> ---
> >>  drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
> >>  drivers/pci/pci.c      |   1 +
> >>  drivers/pci/pci.h      |   2 -
> >>  drivers/pci/pcie/aer.c |   1 +
> >>  include/linux/aer.h    |   2 +
> >>  include/linux/pci.h    |   2 +
> >>  6 files changed, 140 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> >> index a6c0bc6d7203..0216dafa6118 100644
> >> --- a/drivers/cxl/core/ras.c
> >> +++ b/drivers/cxl/core/ras.c
> >> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> >>  	return NULL;
> >>  }
> >>  
> >> +static void __iomem *cxl_get_ras_base(struct device *dev)
> >> +{
> >> +	struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> +	switch (pci_pcie_type(pdev)) {
> >> +	case PCI_EXP_TYPE_ROOT_PORT:
> >> +	case PCI_EXP_TYPE_DOWNSTREAM:
> >> +	{
> >> +		struct cxl_dport *dport;  
> > 
> > 		struct cxl_dport *dport = NULL;
> >   
> >> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);  
> > 
> > as if this failed, dport is not written. Alternative is check port, not dport as port
> > will always be initialized whether or not failure occurs in find_cxl_port()
> >   
> 
> Ok.
> 
> >   
> >> +
> >> +		if (!dport) {
> >> +			pci_err(pdev, "Failed to find the CXL device");
> >> +			return NULL;
> >> +		}
> >> +		return dport->regs.ras;
> >> +	}
> >> +	case PCI_EXP_TYPE_UPSTREAM:
> >> +	case PCI_EXP_TYPE_ENDPOINT:
> >> +	{
> >> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
> >> +
> >> +		if (!port) {
> >> +			pci_err(pdev, "Failed to find the CXL device");
> >> +			return NULL;
> >> +		}
> >> +		return port->regs.ras;
> >> +	}
> >> +	}
> >> +	dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
> >> +	return NULL;
> >> +}  
> >   
> >>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> >>  {
> >>  	void __iomem *addr;
> >> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> >>  	return true;
> >>  }
> >>  
> >> +static void cxl_port_cor_error_detected(struct device *dev)
> >> +{
> >> +	struct pci_dev *pdev = to_pci_dev(dev);
> >> +	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> >> +
> >> +	if (is_cxl_endpoint(port)) {
> >> +		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> >> +		struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >> +
> >> +		guard(device)(&cxlmd->dev);  
> > 
> > Maybe add a comment on why this lock needs to be held and then why the dev->drvier
> > below needs to be true.
> >   
> This can be removed. This was added to ensure the EP's RAS registers remained accessible 
> during handling. This was when when the mapped RAS registers were owned by the CXL memory 
> device. This has changed such that the EP RAS registers are now owned by the EP Port. And, 
> the Endpoint Port is already locked in cxl_proto_err_work_fn() before calling this funcion.
> 
> >> +
> >> +		if (!dev->driver) {
> >> +			dev_warn(&pdev->dev,
> >> +				 "%s: memdev disabled, abort error handling\n",
> >> +				 dev_name(dev));  
> > 
> > Same question as below on why pdev->dev / dev_name(dev) here.
> > Maybe pci_warn() is more appropriate.
> >   
> 
> I believe the driver check can be removed but would like your input. The check 
> for the driver is another piece of code specifically for when the handler was accessing 
> the memdev's RAS registers. It was a last check to make certain the device is bound 
> to a driver before accessing. EP RAS is now owned by the Endpoint Port.

Sounds fine to me to drop this check if we don't need anything that
belongs to that driver any more.

Jonathan

  reply	other threads:[~2026-02-05 17:16 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03  2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-02-03  2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
2026-02-04  4:25   ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 2/9] cxl: Update CXL Endpoint tracing Terry Bowman
2026-02-04  4:29   ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 3/9] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
2026-02-03  2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2026-02-03 15:26   ` Jonathan Cameron
2026-02-03 17:00     ` Bowman, Terry
2026-02-05 17:13       ` Jonathan Cameron
2026-02-04  4:46   ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-02-03 15:40   ` Jonathan Cameron
2026-02-03 18:21     ` Bowman, Terry
2026-02-05 17:16       ` Jonathan Cameron [this message]
2026-02-04  5:08   ` dan.j.williams
2026-02-04 17:11     ` Bowman, Terry
2026-02-04 21:22       ` dan.j.williams
2026-02-05 16:07         ` Bowman, Terry
2026-02-05 21:17           ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors Terry Bowman
2026-02-03 15:54   ` Jonathan Cameron
2026-02-03  2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-02-03 16:18   ` Jonathan Cameron
2026-02-03 17:31   ` Dave Jiang
2026-02-03 18:35     ` Bowman, Terry
2026-02-03 18:49       ` Dave Jiang
2026-02-03 20:21         ` Dave Jiang
2026-02-03  2:52 ` [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-02-03 16:27   ` Jonathan Cameron
2026-02-03  2:52 ` [PATCH v15 9/9] cxl: Enable CXL protocol error reporting Terry Bowman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260205171627.000013da@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=dan.carpenter@linaro.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox