From: "Bowman, Terry" <kibowman@amd.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
Terry Bowman <Terry.Bowman@amd.com>
Cc: ming4.li@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
dave@stgolabs.net, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
dan.j.williams@intel.com, bhelgaas@google.com,
mahesh@linux.ibm.com, oohall@gmail.com,
Benjamin.Cheatham@amd.com, rrichter@amd.com,
nathan.fontenot@amd.com, smita.koralahallichannabasappa@amd.com
Subject: Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
Date: Thu, 17 Oct 2024 11:21:36 -0500 [thread overview]
Message-ID: <c756f2e4-2fa8-413a-b6b8-2f3ca8ec27a5@amd.com> (raw)
In-Reply-To: <20241017144315.0000074c@Huawei.com>
Hi Jonathan,
On 10/17/2024 8:43 AM, Jonathan Cameron wrote:
> On Wed, 16 Oct 2024 13:07:37 -0500
> Terry Bowman <Terry.Bowman@amd.com> wrote:
>
>> Hi Jonathan,
>>
>> On 10/16/24 11:54, Jonathan Cameron wrote:
>>> On Tue, 8 Oct 2024 17:16:49 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>
>>>> The current pcie_do_recovery() handles device recovery as result of
>>>> uncorrectable errors (UCE). But, CXL port devices require unique
>>>> recovery handling.
>>>>
>>>> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
>>>> specific handling to the new recovery function.
>>>>
>>>> The CXL port UCE recovery must invoke the AER service driver's CXL port
>>>> UCE callback. This is different than the standard pcie_do_recovery()
>>>> recovery that calls the pci_driver::err_handler UCE handler instead.
>>>>
>>>> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
>>>> "recover" the error. A panic is called instead of attempting recovery
>>>> to avoid potential system corruption.
>>>>
>>>> The uncorrectable support added here will be used to complete CXL PCIe
>>>> port error handling in the future.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>
>>> Hi Terry,
>>>
>>> I'm a little bothered by the subtle difference in the bus walks
>>> in here vs the existing cases. If we need them, comments needed
>>> to explain why.
>>>
>>
>> Yes, I will add more details in the commit message about "why".
>> I added explanation following your below comment.
>>
>>> If we are going to have separate handling, see if you can share
>>> a lot more of the code by factoring out common functions for
>>> the pci and cxl handling with callbacks to handle the differences.
>>>
>>
>> Dan requested separate paths for the PCIe and CXL recovery. The intent,
>> as I understand, is to isolate the handling of PCIe and CXL protocol
>> errors. This is to create 2 different classes of protocol errors.
> Function call chain wise I'm reasonably convinced that might be a good
> idea. But not code wise if it means we end up with more hard to review
> code.
>
>>
>>> I've managed to get my head around this code a few times in the past
>>> (I think!) and really don't fancy having two subtle variants to
>>> consider next time we get a bug :( The RC_EC additions hurt my head.
>>>
>>> Jonathan
>>
>> Right, the UCE recovery logic is not straightforward. The code can be
>> refactored to take advantage of reuse. I'm interested in your thoughts
>> after I have provided some responses here.
>>
>>>
>>>> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 31090770fffc..de12f2eb19ef 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
>>>> return 0;
>>>> }
>>>>
>>>> +static int cxl_report_error_detected(struct pci_dev *dev,
>>>> + pci_channel_state_t state,
>>>> + enum pci_ers_result *result)
>>>> +{
>>>> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
>>>> + struct pci_driver *pdrv;
>>>> + pci_ers_result_t vote;
>>>> +
>>>> + device_lock(&dev->dev);
>>>> + cxl_port_hndlrs = find_cxl_port_hndlrs();
>>>
>>> Can we refactor to have a common function under this and report_error_detected()?
>>>
>>
>> Sure, this can be refactored.
>>
>> The difference between cxl_report_error_detected() and report_error_detected() is the
>> handlers that are called.
>>
>> cxl_report_error_detected() calls the CXL driver's registered port error handler.
>>
>> report_error_recovery() calls the pcie_dev::err_handlers.
>>
>> Let me know if I should refactor for common code here?
>
> It certainly makes sense to do that somewhere in here. Just have light
> wrappers that provide callbacks so the bulk of the code is shared.
>
Ok, Ill start on that. I have a v2 ready to-go without the reuse changes.
You want me to wait on sending v2 till it has reuse refactoring?
>>
>>
>>>> + pdrv = dev->driver;
>>>> + if (pci_dev_is_disconnected(dev)) {
>>>> + vote = PCI_ERS_RESULT_DISCONNECT;
>>>> + } else if (!pci_dev_set_io_state(dev, state)) {
>>>> + pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
>>>> + dev->error_state, state);
>>>> + vote = PCI_ERS_RESULT_NONE;
>>>> + } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
>>>> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>>>> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>>>> + pci_info(dev, "can't recover (no error_detected callback)\n");
>>>> + } else {
>>>> + vote = PCI_ERS_RESULT_NONE;
>>>> + }
>>>> + } else {
>>>> + vote = cxl_port_hndlrs->error_detected(dev, state);
>>>> + }
>>>> + pci_uevent_ers(dev, vote);
>>>> + *result = merge_result(*result, vote);
>>>> + device_unlock(&dev->dev);
>>>> + return 0;
>>>> +}
>>>
>>>> static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
>>>> {
>>>> pm_runtime_get_sync(&pdev->dev);
>>>> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
>>>> cb(bridge, userdata);
>>>> }
>>>>
>>>> +/**
>>>> + * cxl_walk_bridge - walk bridges potentially AER affected
>>>> + * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
>>>> + * @cb: callback to be called for each device found
>>>> + * @userdata: arbitrary pointer to be passed to callback
>>>> + *
>>>> + * If the device provided is a bridge, walk the subordinate bus, including
>>>> + * the device itself and any bridged devices on buses under this bus. Call
>>>> + * the provided callback on each device found.
>>>> + *
>>>> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
>>>> + * call the callback on the device itself.
>>> only call the callback on the device itself.
>>>
>>> (as you call it as stated above either way).
>>>
>>
>> Thanks. I will update the function header to include "only".
>>
>>>> + */
>>>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>>>> + int (*cb)(struct pci_dev *, void *),
>>>> + void *userdata)
>>>> +{
>>>> + cb(bridge, userdata);
>>>> + if (bridge->subordinate)
>>>> + pci_walk_bus(bridge->subordinate, cb, userdata);
>>> The difference between this and pci_walk_bridge() is subtle and
>>> I'd like to avoid having both if we can.
>>>
>>
>> The cxl_walk_bridge() was added because pci_walk_bridge() does not report
>> CXL errors as needed. If the erroring device is a bridge then pci_walk_bridge()
>> does not call report_error_detected() for the root port itself. If the bridge
>> is a CXL root port then the CXL port error handler is not called. This has 2
>> problems: 1. Error logging is not provided, 2. A result vote is not provided
>> by the root port's CXL port handler.
>
> So what happens for PCIe errors on the root port? How are they reported?
> What I'm failing to understand is why these should be different.
> Maybe there is something missing on the PCIe side though!
> That code plays a game with what bridge and I thought that was there to handle
> this case.
>
PCIe errors (not CXL errors) on a root port will be processed as they are today.
An AER error is treated as a CXL error if *all* of the following are met:
- The AER error is not an internal error
- Check is in AER's is_internal_error(info) function.
- The device is not a CXL device
- Check is in AER's handles_cxl_errors() function.
Root port device PCIe error processing will not call the the pci_dev::err_handlers::error_detected().
because of the walk_bridge() implementation. The result vote to direct handling
is determined by downstream devices. This has probably been Ok until now because ports have been
fairly vanilla and standard until CXL.
>>
>>>> +}
>>>> +
>>>> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>> pci_channel_state_t state,
>>>> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>>> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>
>>>> return status;
>>>> }
>>>> +
>>>> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
>>>> + pci_channel_state_t state,
>>>> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
>>>> +{
>>>> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
>>>> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>> + int type = pci_pcie_type(bridge);
>>>> +
>>>> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
>>>> + (type != PCI_EXP_TYPE_RC_EC) &&
>>>> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
>>>> + (type != PCI_EXP_TYPE_UPSTREAM)) {
>>>> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
>>>> + return status;
>>>> + }
>>>> +
>>>
>>> Would similar trick to in pcie_do_recovery work here for the upstream
>>> and downstream ports use pci_upstream_bridge() and for the others pass the dev into
>>> pci_walk_bridge()?
>>>
>>
>> Yes, that would be a good starting point to begin reuse refactoring.
>> I'm interested in getting yours and others feedback on the separation of the
>> PCI and CXL protocol errors and how much separation is or not needed.
>
> Separation may make sense (I'm still thinking about it) for separate passes
> through the topology and separate callbacks / handling when an error is seen.
> What I don't want to see is two horribly complex separate walking codes if
> we can possibly avoid it. Long term to me that just means two sets of bugs
> and problem corners instead of one.
>
> Jonathan
>
I understand. I will look to make changes here for reuse.
Regards,
Terry
next prev parent reply other threads:[~2024-10-17 16:21 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
2024-10-22 1:53 ` Dan Williams
2024-10-22 13:50 ` Terry Bowman
2024-10-22 17:09 ` Dan Williams
2024-10-22 18:40 ` Terry Bowman
2024-10-22 23:43 ` Dan Williams
2024-10-24 15:20 ` Bowman, Terry
2024-10-24 19:10 ` Dan Williams
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
2024-10-16 16:11 ` Jonathan Cameron
2024-10-22 2:17 ` Dan Williams
2024-10-22 13:54 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
2024-10-10 19:11 ` Bjorn Helgaas
2024-10-14 17:27 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-16 16:22 ` Jonathan Cameron
2024-10-16 17:18 ` Terry Bowman
2024-10-16 17:29 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-16 16:28 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
2024-10-16 16:30 ` Jonathan Cameron
2024-10-16 17:31 ` Terry Bowman
2024-10-17 13:31 ` Jonathan Cameron
2024-10-17 14:50 ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-16 16:54 ` Jonathan Cameron
2024-10-16 18:07 ` Terry Bowman
2024-10-17 13:43 ` Jonathan Cameron
2024-10-17 16:21 ` Bowman, Terry [this message]
2024-10-17 17:08 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static Terry Bowman
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
2024-10-16 17:14 ` Jonathan Cameron
2024-10-16 18:16 ` Terry Bowman
2024-10-17 13:50 ` Jonathan Cameron
2024-10-17 16:26 ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 10/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-08 22:16 ` [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-17 13:57 ` Jonathan Cameron
2024-10-17 16:42 ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-10-17 14:04 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-10-16 17:22 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
2024-10-16 17:21 ` Jonathan Cameron
2024-10-16 17:24 ` Terry Bowman
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2024-10-14 17:22 ` Terry Bowman
2024-10-14 17:29 ` Bjorn Helgaas
2024-10-14 17:33 ` Terry Bowman
2024-10-17 16:34 ` Fan Ni
2024-10-17 17:27 ` Bowman, Terry
2024-10-21 22:19 ` Fan Ni
2024-10-18 23:22 ` Bjorn Helgaas
2024-10-21 19:22 ` Terry Bowman
2024-10-22 1:43 ` Dan Williams
2024-10-22 13:29 ` Terry Bowman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c756f2e4-2fa8-413a-b6b8-2f3ca8ec27a5@amd.com \
--to=kibowman@amd.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=Terry.Bowman@amd.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=mahesh@linux.ibm.com \
--cc=ming4.li@intel.com \
--cc=nathan.fontenot@amd.com \
--cc=oohall@gmail.com \
--cc=rrichter@amd.com \
--cc=smita.koralahallichannabasappa@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox