Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver

public inbox for linux-cxl@vger.kernel.org
 help / color / mirror / Atom feed

From: "Bowman, Terry" <kibowman@amd.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
	Terry Bowman <Terry.Bowman@amd.com>
Cc: ming4.li@intel.com, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	mahesh@linux.ibm.com, oohall@gmail.com,
	Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, smita.koralahallichannabasappa@amd.com
Subject: Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
Date: Thu, 17 Oct 2024 11:21:36 -0500	[thread overview]
Message-ID: <c756f2e4-2fa8-413a-b6b8-2f3ca8ec27a5@amd.com> (raw)
In-Reply-To: <20241017144315.0000074c@Huawei.com>

Hi Jonathan,

On 10/17/2024 8:43 AM, Jonathan Cameron wrote:
> On Wed, 16 Oct 2024 13:07:37 -0500
> Terry Bowman <Terry.Bowman@amd.com> wrote:
> 
>> Hi Jonathan,
>>
>> On 10/16/24 11:54, Jonathan Cameron wrote:
>>> On Tue, 8 Oct 2024 17:16:49 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>    
>>>> The current pcie_do_recovery() handles device recovery as result of
>>>> uncorrectable errors (UCE). But, CXL port devices require unique
>>>> recovery handling.
>>>>
>>>> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
>>>> specific handling to the new recovery function.
>>>>
>>>> The CXL port UCE recovery must invoke the AER service driver's CXL port
>>>> UCE callback. This is different than the standard pcie_do_recovery()
>>>> recovery that calls the pci_driver::err_handler UCE handler instead.
>>>>
>>>> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
>>>> "recover" the error. A panic is called instead of attempting recovery
>>>> to avoid potential system corruption.
>>>>
>>>> The uncorrectable support added here will be used to complete CXL PCIe
>>>> port error handling in the future.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>
>>> Hi Terry,
>>>
>>> I'm a little bothered by the subtle difference in the bus walks
>>> in here vs the existing cases. If we need them, comments needed
>>> to explain why.
>>>    
>>
>> Yes, I will add more details in the commit message about "why".
>> I added explanation following your below comment.
>>
>>> If we are going to have separate handling, see if you can share
>>> a lot more of the code by factoring out common functions for
>>> the pci and cxl handling with callbacks to handle the differences.
>>>    
>>
>> Dan requested separate paths for the PCIe and CXL recovery. The intent,
>> as I understand, is to isolate the handling of PCIe and CXL protocol
>> errors. This is to create 2 different classes of protocol errors.
> Function call chain wise I'm reasonably convinced that might be a good
> idea.  But not code wise if it means we end up with more hard to review
> code.
> 
>>
>>> I've managed to get my head around this code a few times in the past
>>> (I think!) and really don't fancy having two subtle variants to
>>> consider next time we get a bug :( The RC_EC additions hurt my head.
>>>
>>> Jonathan
>>
>> Right, the UCE recovery logic is not straightforward. The code can  be
>> refactored to take advantage of reuse. I'm interested in your thoughts
>> after I have provided some responses here.
>>
>>>    
>>>>   static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 31090770fffc..de12f2eb19ef 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static int cxl_report_error_detected(struct pci_dev *dev,
>>>> +				     pci_channel_state_t state,
>>>> +				     enum pci_ers_result *result)
>>>> +{
>>>> +	struct cxl_port_err_hndlrs *cxl_port_hndlrs;
>>>> +	struct pci_driver *pdrv;
>>>> +	pci_ers_result_t vote;
>>>> +
>>>> +	device_lock(&dev->dev);
>>>> +	cxl_port_hndlrs = find_cxl_port_hndlrs();
>>>
>>> Can we refactor to have a common function under this and report_error_detected()?
>>>    
>>
>> Sure, this can be refactored.
>>
>> The difference between cxl_report_error_detected() and report_error_detected() is the
>> handlers that are called.
>>
>> cxl_report_error_detected() calls the CXL driver's registered port error handler.
>>
>> report_error_recovery() calls the pcie_dev::err_handlers.
>>
>> Let me know if I should refactor for common code here?
> 
> It certainly makes sense to do that somewhere in here.  Just have light
> wrappers that provide callbacks so the bulk of the code is shared.
> 

Ok, Ill start on that. I have a v2 ready to-go without the reuse changes.
You want me to wait on sending v2 till it has reuse refactoring?

>>
>>
>>>> +	pdrv = dev->driver;
>>>> +	if (pci_dev_is_disconnected(dev)) {
>>>> +		vote = PCI_ERS_RESULT_DISCONNECT;
>>>> +	} else if (!pci_dev_set_io_state(dev, state)) {
>>>> +		pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
>>>> +			dev->error_state, state);
>>>> +		vote = PCI_ERS_RESULT_NONE;
>>>> +	} else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
>>>> +		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>>>> +			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>>>> +			pci_info(dev, "can't recover (no error_detected callback)\n");
>>>> +		} else {
>>>> +			vote = PCI_ERS_RESULT_NONE;
>>>> +		}
>>>> +	} else {
>>>> +		vote = cxl_port_hndlrs->error_detected(dev, state);
>>>> +	}
>>>> +	pci_uevent_ers(dev, vote);
>>>> +	*result = merge_result(*result, vote);
>>>> +	device_unlock(&dev->dev);
>>>> +	return 0;
>>>> +}
>>>    
>>>>   static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
>>>>   {
>>>>   	pm_runtime_get_sync(&pdev->dev);
>>>> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
>>>>   		cb(bridge, userdata);
>>>>   }
>>>>   
>>>> +/**
>>>> + * cxl_walk_bridge - walk bridges potentially AER affected
>>>> + * @bridge:	bridge which may be a Port, an RCEC, or an RCiEP
>>>> + * @cb:		callback to be called for each device found
>>>> + * @userdata:	arbitrary pointer to be passed to callback
>>>> + *
>>>> + * If the device provided is a bridge, walk the subordinate bus, including
>>>> + * the device itself and any bridged devices on buses under this bus.  Call
>>>> + * the provided callback on each device found.
>>>> + *
>>>> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
>>>> + * call the callback on the device itself.
>>> only call the callback on the device itself.
>>>
>>> (as you call it as stated above either way).
>>>    
>>
>> Thanks. I will update the function header to include "only".
>>
>>>> + */
>>>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>>>> +			    int (*cb)(struct pci_dev *, void *),
>>>> +			    void *userdata)
>>>> +{
>>>> +	cb(bridge, userdata);
>>>> +	if (bridge->subordinate)
>>>> +		pci_walk_bus(bridge->subordinate, cb, userdata);
>>> The difference between this and pci_walk_bridge() is subtle and
>>> I'd like to avoid having both if we can.
>>>    
>>
>> The cxl_walk_bridge() was added because pci_walk_bridge() does not report
>> CXL errors as needed. If the erroring device is a bridge then pci_walk_bridge()
>> does not call report_error_detected() for the root port itself. If the bridge
>> is a CXL root port then the CXL port error handler is not called. This has 2
>> problems: 1. Error logging is not provided, 2. A result vote is not provided
>> by the root port's CXL port handler.
> 
> So what happens for PCIe errors on the root port?  How are they reported?
> What I'm failing to understand is why these should be different.
> Maybe there is something missing on the PCIe side though!
> That code plays a game with what bridge and I thought that was there to handle
> this case.
> 

PCIe errors (not CXL errors) on a root port will be processed as they are today.

An AER error is treated as a CXL error if *all* of the following are met:
- The AER error is not an internal error
    - Check is in AER's is_internal_error(info) function.
- The device is not a CXL device
    - Check is in AER's handles_cxl_errors() function.

Root port device PCIe error processing will not call the the pci_dev::err_handlers::error_detected().
because of the walk_bridge() implementation. The result vote to direct handling
is determined by downstream devices. This has probably been Ok until now because ports have been
fairly vanilla and standard until CXL.


>>
>>>> +}
>>>> +
>>>>   pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>   		pci_channel_state_t state,
>>>>   		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>>> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>   
>>>>   	return status;
>>>>   }
>>>> +
>>>> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
>>>> +				 pci_channel_state_t state,
>>>> +				 pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>>> +{
>>>> +	struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
>>>> +	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>> +	int type = pci_pcie_type(bridge);
>>>> +
>>>> +	if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
>>>> +	    (type != PCI_EXP_TYPE_RC_EC) &&
>>>> +	    (type != PCI_EXP_TYPE_DOWNSTREAM) &&
>>>> +	    (type != PCI_EXP_TYPE_UPSTREAM)) {
>>>> +		pci_dbg(bridge, "Unsupported device type (%x)\n", type);
>>>> +		return status;
>>>> +	}
>>>> +
>>>
>>> Would similar trick to in pcie_do_recovery work here for the upstream
>>> and downstream ports use pci_upstream_bridge() and for the others pass the dev into
>>> pci_walk_bridge()?
>>>    
>>
>> Yes, that would be a good starting point to begin reuse refactoring.
>> I'm interested in getting yours and others feedback on the separation of the
>> PCI and CXL protocol errors and how much separation is or not needed.
> 
> Separation may make sense (I'm still thinking about it) for separate passes
> through the topology and separate callbacks / handling when an error is seen.
> What I don't want to see is two horribly complex separate walking codes if
> we can possibly avoid it.  Long term to me that just means two sets of bugs
> and problem corners instead of one.
> 
> Jonathan
> 

I understand. I will look to make changes here for reuse.

Regards,
Terry

next prev parent reply	other threads:[~2024-10-17 16:21 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
2024-10-22  1:53   ` Dan Williams
2024-10-22 13:50     ` Terry Bowman
2024-10-22 17:09       ` Dan Williams
2024-10-22 18:40         ` Terry Bowman
2024-10-22 23:43           ` Dan Williams
2024-10-24 15:20             ` Bowman, Terry
2024-10-24 19:10               ` Dan Williams
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
2024-10-16 16:11   ` Jonathan Cameron
2024-10-22  2:17   ` Dan Williams
2024-10-22 13:54     ` Terry Bowman
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
2024-10-10 19:11   ` Bjorn Helgaas
2024-10-14 17:27     ` Terry Bowman
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-16 16:22   ` Jonathan Cameron
2024-10-16 17:18     ` Terry Bowman
2024-10-16 17:29       ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-16 16:28   ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
2024-10-16 16:30   ` Jonathan Cameron
2024-10-16 17:31     ` Terry Bowman
2024-10-17 13:31       ` Jonathan Cameron
2024-10-17 14:50         ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-16 16:54   ` Jonathan Cameron
2024-10-16 18:07     ` Terry Bowman
2024-10-17 13:43       ` Jonathan Cameron
2024-10-17 16:21         ` Bowman, Terry [this message]
2024-10-17 17:08           ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static Terry Bowman
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
2024-10-16 17:14   ` Jonathan Cameron
2024-10-16 18:16     ` Terry Bowman
2024-10-17 13:50       ` Jonathan Cameron
2024-10-17 16:26         ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 10/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-08 22:16 ` [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-17 13:57   ` Jonathan Cameron
2024-10-17 16:42     ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-10-17 14:04   ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-10-16 17:22   ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
2024-10-16 17:21   ` Jonathan Cameron
2024-10-16 17:24     ` Terry Bowman
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2024-10-14 17:22   ` Terry Bowman
2024-10-14 17:29     ` Bjorn Helgaas
2024-10-14 17:33       ` Terry Bowman
2024-10-17 16:34 ` Fan Ni
2024-10-17 17:27   ` Bowman, Terry
2024-10-21 22:19     ` Fan Ni
2024-10-18 23:22 ` Bjorn Helgaas
2024-10-21 19:22   ` Terry Bowman
2024-10-22  1:43 ` Dan Williams
2024-10-22 13:29   ` Terry Bowman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c756f2e4-2fa8-413a-b6b8-2f3ca8ec27a5@amd.com \
    --to=kibowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=Terry.Bowman@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mahesh@linux.ibm.com \
    --cc=ming4.li@intel.com \
    --cc=nathan.fontenot@amd.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=smita.koralahallichannabasappa@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox