All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <dan.j.williams@intel.com>,
	<bhelgaas@google.com>, <shiju.jose@huawei.com>,
	<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
	<rrichter@amd.com>, <dan.carpenter@linaro.org>,
	<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
	<Benjamin.Cheatham@amd.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	<linux-cxl@vger.kernel.org>, <vishal.l.verma@intel.com>,
	<alucerop@amd.com>, <ira.weiny@intel.com>,
	<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>
Subject: Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
Date: Tue, 3 Feb 2026 16:18:41 +0000	[thread overview]
Message-ID: <20260203161841.000006a1@huawei.com> (raw)
In-Reply-To: <20260203025244.3093805-8-terry.bowman@amd.com>

On Mon, 2 Feb 2026 20:52:42 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL drivers now implement protocol RAS support. PCI protocol errors,
> however, continue to be reported via the AER capability and must still be
> handled by a PCI error recovery callback.
> 
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
> new cxl_pci_error_detected() implementation that handles only uncorrectable
> PCI protocol errors reported through AER.
> 
> Introduce helper named cxl_handler_aer() amd implement to handle and
> log the CXL device's AER error.
> 
> This cleanly separates CXL protocol error handling from PCI AER handling
> and ensures that each subsystem processes only the errors it is
> responsible.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> ---
> 
> Changes in v14->v15:
> - Title update (Terry)
> - Change cxl_pci_error-detected() to handle & log AER (Terry)`
> - Update commit message (Terry)
> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
> 
> Changes in v13->v14:
> - Update commit headline (Bjorn)
> - Rename pci_error_detected()/pci_cor_error_detected() ->
>   cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
> - Split into separate patches for UCE and CE (Terry)
> 
> Changes in v12->v13:
> - Update commit messaqge (Terry)
> - Updated all the implementation and commit message. (Terry)
> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>   pdev (Dave Jiang)
> 
> Changes in v11->v12:
> - None
> 
> Changes in v10->v11:
> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
> - cxl_error_detected() - Remove extra line (Shiju)
> - Changes moved to core/ras.c (Terry)
> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
> - Move #include "pci.h from cxl.h to core.h (Terry)
> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
> ---
>  drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
>  drivers/cxl/cxlpci.h   |  9 +++---
>  drivers/cxl/pci.c      |  6 ++--
>  3 files changed, 31 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 970ff3df442c..061e6aaec176 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>  
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state)
> +static bool cxl_handle_aer(struct pci_dev *pdev)
>  {
> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> -	struct device *dev = &cxlmd->dev;
> -	bool ue;
> -
> -	scoped_guard(device, dev) {
> -		if (!dev->driver) {
> -			dev_warn(&pdev->dev,
> -				 "%s: memdev disabled, abort error handling\n",
> -				 dev_name(dev));
> -			return PCI_ERS_RESULT_DISCONNECT;
> -		}
> +	struct aer_capability_regs aer;

I don't see a strong reason to use this structure given you just want two
of the registers and read into them one by one.

> +	u32 aer_cap = pdev->aer_cap;
>  
> -		if (cxlds->rcd)
> -			cxl_handle_rdport_errors(cxlds);
> -		/*
> -		 * A frozen channel indicates an impending reset which is fatal to
> -		 * CXL.mem operation, and will likely crash the system. On the off
> -		 * chance the situation is recoverable dump the status of the RAS
> -		 * capability registers and bounce the active state of the memdev.
> -		 */
> -		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
> -				    cxlmd->endpoint->regs.ras);
> +	if (!aer_cap) {
> +		pr_warn_ratelimited("%s: AER capability isn't present\n",
> +				    pci_name(pdev));

These could use dev_warn_rate_limited()
or even add a wrapper similar to pci_info_rate_limited()

> +		return false;
>  	}
>  
> -	switch (state) {
> -	case pci_channel_io_normal:
> -		if (ue) {
> -			device_release_driver(dev);
> -			return PCI_ERS_RESULT_NEED_RESET;
> -		}
> -		return PCI_ERS_RESULT_CAN_RECOVER;
> -	case pci_channel_io_frozen:
> -		dev_warn(&pdev->dev,
> -			 "%s: frozen state error detected, disable CXL.mem\n",
> -			 dev_name(dev));
> -		device_release_driver(dev);
> -		return PCI_ERS_RESULT_NEED_RESET;
> -	case pci_channel_io_perm_failure:
> -		dev_warn(&pdev->dev,
> -			 "failure state error detected, request disconnect\n");
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> -	return PCI_ERS_RESULT_NEED_RESET;
> +	pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
> +	pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
> +
> +	/* The AER driver logged the error */
> +	pci_aer_clear_nonfatal_status(pdev);
> +	pci_aer_clear_fatal_status(pdev);
> +
> +	return (aer.uncor_status & aer.uncor_mask);
> +}


  reply	other threads:[~2026-02-03 16:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03  2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-02-03  2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
2026-02-04  4:25   ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 2/9] cxl: Update CXL Endpoint tracing Terry Bowman
2026-02-04  4:29   ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 3/9] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
2026-02-03  2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2026-02-03 15:26   ` Jonathan Cameron
2026-02-03 17:00     ` Bowman, Terry
2026-02-05 17:13       ` Jonathan Cameron
2026-02-04  4:46   ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-02-03 15:40   ` Jonathan Cameron
2026-02-03 18:21     ` Bowman, Terry
2026-02-05 17:16       ` Jonathan Cameron
2026-02-04  5:08   ` dan.j.williams
2026-02-04 17:11     ` Bowman, Terry
2026-02-04 21:22       ` dan.j.williams
2026-02-05 16:07         ` Bowman, Terry
2026-02-05 21:17           ` dan.j.williams
2026-02-03  2:52 ` [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors Terry Bowman
2026-02-03 15:54   ` Jonathan Cameron
2026-02-03  2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-02-03 16:18   ` Jonathan Cameron [this message]
2026-02-03 17:31   ` Dave Jiang
2026-02-03 18:35     ` Bowman, Terry
2026-02-03 18:49       ` Dave Jiang
2026-02-03 20:21         ` Dave Jiang
2026-02-03  2:52 ` [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-02-03 16:27   ` Jonathan Cameron
2026-02-03  2:52 ` [PATCH v15 9/9] cxl: Enable CXL protocol error reporting Terry Bowman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260203161841.000006a1@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=dan.carpenter@linaro.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.