public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
From: "Bowman, Terry" <terry.bowman@amd.com>
To: Alejandro Lucero Palau <alucerop@amd.com>,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, nifan.cxl@gmail.com,
	ming4.li@intel.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	mahesh@linux.ibm.com, ira.weiny@intel.com, oohall@gmail.com,
	Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com,
	lukas@wunner.de, PradeepVineshReddy.Kodamati@amd.com
Subject: Re: [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver
Date: Fri, 13 Dec 2024 09:07:18 -0600	[thread overview]
Message-ID: <d37e6345-30e9-4765-b2be-51e139738c33@amd.com> (raw)
In-Reply-To: <58b9bcce-ac76-e0d0-1a8d-a8d8c0b20b43@amd.com>




On 12/12/2024 3:28 AM, Alejandro Lucero Palau wrote:
> On 12/11/24 23:39, Terry Bowman wrote:
>> Existing recovery procedure for PCIe Uncorrectable Errors (UCE) does not
>> apply to CXL devices. Recovery can not be used for CXL devices because of
>> potential corruption on what can be system memory. Also, current PCIe UCE
>> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
>> does not begin at the RP/DSP but begins at the first downstream device.
>> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
>> CXL recovery is needed because of the different handling requirements
>>
>> Add a new function, cxl_do_recovery() using the following.
>>
>> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
>> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
>> will begin iteration at the RP or DSP rather than beginning at the
>> first downstream device.
>>
>> Add cxl_report_error_detected() as an analog to report_error_detected().
>> It will call pci_driver::cxl_err_handlers for each iterated downstream
>> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
>> indicating if there was a UCE error detected during handling.
>>
>> cxl_do_recovery() uses the status from cxl_report_error_detected() to
>> determine how to proceed. Non-fatal CXL UCE errors will be treated as
>> fatal. If a UCE was present during handling then cxl_do_recovery()
>> will kernel panic.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>   drivers/pci/pci.h      |  3 +++
>>   drivers/pci/pcie/aer.c |  4 ++++
>>   drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 61 insertions(+)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 14d00ce45bfa..5a67e41919d8 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   		pci_channel_state_t state,
>>   		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
>>   
>> +/* CXL error reporting and handling */
>> +void cxl_do_recovery(struct pci_dev *dev);
>> +
>>   bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
>>   int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
>>   
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index c1eb939c1cca..861521872318 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1024,6 +1024,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>   			err_handler->error_detected(dev, pci_channel_io_normal);
>>   		else if (info->severity == AER_FATAL)
>>   			err_handler->error_detected(dev, pci_channel_io_frozen);
>> +
>> +		cxl_do_recovery(dev);
>>   	}
>>   out:
>>   	device_unlock(&dev->dev);
>> @@ -1048,6 +1050,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>   			pdrv->cxl_err_handler->cor_error_detected(dev);
>>   
>>   		pcie_clear_device_status(dev);
>> +	} else {
>> +		cxl_do_recovery(dev);
>>   	}
>>   }
>>   
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 31090770fffc..6f7cf5e0087f 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   
>>   	return status;
>>   }
>> +
>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>> +			    int (*cb)(struct pci_dev *, void *),
>> +			    void *userdata)
>> +{
>> +	bool *status = userdata;
>> +
>> +	cb(bridge, status);
>> +	if (bridge->subordinate && !*status)
>
> I would prefer to use not a pointer for status as you are not changing 
> what it points to here, so first a cast then using just !status in the 
> conditional.
>

Hi Alejandro,

I agree. This can be significantly improved. I'll remove the condition on
'status' and instead introduce condition on cb() and pci_walk_bus() return value.
But, 'status' does need to be a pointer so the caller (cxl_do_recovery()) can
determine if to invoke panic.

>> +		pci_walk_bus(bridge->subordinate, cb, status);
>> +}
>> +
>> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
>> +{
>> +	struct pci_driver *pdrv = dev->driver;
>> +	bool *status = data;
>> +
>> +	device_lock(&dev->dev);
>> +	if (pdrv && pdrv->cxl_err_handler &&
>> +	    pdrv->cxl_err_handler->error_detected) {
>> +		const struct cxl_error_handlers *cxl_err_handler =
>> +			pdrv->cxl_err_handler;
>> +		*status |= cxl_err_handler->error_detected(dev);
>
> This implies status should not be a bool pointer as different bits can 
> be set by the returning value, but as the code seems to only care about 
> any bit implying an error and therefore error detected, I guess that is 
> fine. However, the next function calling this one is using an int ...
>
>
> Confusing to me. I would expect here not an OR but returning just when a 
> first error is detected, handling the lock properly, with the walk 
> function behind the scenes breaking the walk if the return is anything 
> other than zero.

The cxl_err_handler->error_detected() return value is a bool. But, The bitwise OR is not necessary. I'll refactor as part of mentioned above.

Thanks for the feedback.

-Terry

>
>
>> +	}
>> +	device_unlock(&dev->dev);
>> +	return *status;
>> +}
>> +
>> +void cxl_do_recovery(struct pci_dev *dev)
>> +{
>> +	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> +	int type = pci_pcie_type(dev);
>> +	struct pci_dev *bridge;
>> +	int status;
>> +
>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> +	    type == PCI_EXP_TYPE_DOWNSTREAM ||
>> +	    type == PCI_EXP_TYPE_UPSTREAM ||
>> +	    type == PCI_EXP_TYPE_ENDPOINT)
>> +		bridge = dev;
>> +	else
>> +		bridge = pci_upstream_bridge(dev);
>> +
>> +	cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
>> +	if (status)
>> +		panic("CXL cachemem error.");
>> +
>> +	if (host->native_aer || pcie_ports_native) {
>> +		pcie_clear_device_status(dev);
>> +		pci_aer_clear_nonfatal_status(dev);
>> +	}
>> +
>> +	pci_info(bridge, "CXL uncorrectable error.\n");
>> +}


  reply	other threads:[~2024-12-13 15:07 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
2024-12-11 23:39 ` [PATCH v4 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-12-11 23:39 ` [PATCH v4 02/15] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
2024-12-11 23:39 ` [PATCH v4 03/15] cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-12-11 23:39 ` [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-12-12  1:34   ` Li Ming
2024-12-12 19:59     ` Bowman, Terry
2024-12-14 13:34       ` Li Ming
2024-12-11 23:39 ` [PATCH v4 05/15] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
2024-12-11 23:39 ` [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
2024-12-24 18:28   ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver Terry Bowman
2024-12-12  9:28   ` Alejandro Lucero Palau
2024-12-13 15:07     ` Bowman, Terry [this message]
2024-12-24 18:31   ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
2024-12-12 10:36   ` Alejandro Lucero Palau
2024-12-13 15:10     ` Bowman, Terry
2024-12-24 18:38   ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
2024-12-24 18:41   ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2024-12-12 10:38   ` Alejandro Lucero Palau
2024-12-24 18:42   ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 11/15] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
2024-12-11 23:39 ` [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
2024-12-12  2:19   ` Li Ming
2024-12-24 18:43   ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-12-12  9:46   ` Alejandro Lucero Palau
2024-12-24 18:46   ` Jonathan Cameron
2024-12-26 17:01     ` Bowman, Terry
2024-12-11 23:40 ` [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-12-12  2:31   ` Li Ming
2024-12-17 14:39     ` Bowman, Terry
2024-12-24 18:50   ` Jonathan Cameron
2024-12-26 17:07     ` Bowman, Terry
2025-01-07 11:32       ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
2024-12-12  9:44   ` Alejandro Lucero Palau
2024-12-12 10:44     ` Alejandro Lucero Palau
2024-12-13 15:22       ` Bowman, Terry
2024-12-13 15:34     ` Bowman, Terry
2024-12-24 18:53   ` Jonathan Cameron
2024-12-26 17:19     ` Bowman, Terry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d37e6345-30e9-4765-b2be-51e139738c33@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=ming4.li@intel.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox