From: Bjorn Helgaas <helgaas@kernel.org>
To: Robert Richter <rrichter@amd.com>
Cc: alison.schofield@intel.com, dave.jiang@intel.com,
Terry Bowman <terry.bowman@amd.com>,
vishal.l.verma@intel.com, linuxppc-dev@lists.ozlabs.org,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org,
Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
bhelgaas@google.com, Oliver O'Halloran <oohall@gmail.com>,
Jonathan.Cameron@huawei.com, bwidawsk@kernel.org,
dan.j.williams@intel.com, ira.weiny@intel.com
Subject: Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler
Date: Fri, 14 Apr 2023 16:32:54 -0500 [thread overview]
Message-ID: <20230414213254.GA219190@bhelgaas> (raw)
In-Reply-To: <ZDfbLF1ZYc3uIC19@rric.localdomain>
On Thu, Apr 13, 2023 at 01:40:52PM +0200, Robert Richter wrote:
> On 12.04.23 17:02:33, Bjorn Helgaas wrote:
> > On Tue, Apr 11, 2023 at 01:03:01PM -0500, Terry Bowman wrote:
> > > From: Robert Richter <rrichter@amd.com>
> ...
> Let's assume just a simple CXL RCH topology:
>
> PCI hierarchy:
>
> -----------------
> | ACPI0016 |-------------- Host bridge (CXL host)
> | - CEDT | |
> -----------| - RCRB base | |
> | ----------------- :
> | |
> | |
> | ------------------- ---------
> | | RCiEP |.....| RCEC | Endpoint (CXL dev)
> | --------| - BDF | | - BDF |
> | | | - PCIe AER | ---------
> | | | - CXL dvsec |
> | | | (v2: reg loc) |
> | | | - Comp regs |
> | | | - CXL RAS |
> | | -------------------
> : :
>
> CXL hierarchy:
>
> : :
> : ------------------ |
> | | CXL root port |<------------
> | | |
> |--------->| - dport RCRB |<------------
> | | - PCIe AER | |
> | | - Comp regs | |
> | | - CXL RAS | |
> | ------------------ |
> | : |
> | | ------------------ |
> | ------->| CXL endpoint |-------------
> | | (v1: RCRB) |
> ---------->| - uport RCRB |
> | - Comp regs |
> | - CXL RAS |
> ------------------
>
> Dport detected errors are reported using PCIe AER and CXL RAS caps in
> the dports RCRB.
>
> Uport detected errors are reported using RCiEP's PCIe AER cap and
> either the uport's RCRB RAS cap or the RAS cap of the comp regs
> located using CXL DVSEC register locator.
>
> In all cases the RCEC is used with either the RCEC (dport errors) or
> the RCiEP (uport errors) error source id (BDF: bus, dev, func).
I'm mostly interested in the PCI entities involved because that's all
aer.c can deal with. For the above, I think the PCI core only knows
about these:
00:00.0 RCEC with AER, RCEC EA includes 00:01.0
00:01.0 RCiEP with AER
aer_irq() would handle AER interrupts from 00:00.0.
cxl_handle_error() would be called for 00:00.0 and would call
handle_error_source() for everything below it (only 00:01.0 here).
> > The current code uses pcie_walk_rcec() in this path, which basically
> > searches below a Root Port or RCEC for devices that have an AER error
> > status bit set, add them to the e_info[] list, and call
> > handle_error_source() for each one:
>
> For reference, this series adds support to handle RCH downstream
> port-detected errors as described in CXL 3.0, 12.2.1.1.
>
> This flow looks correct to me, see comments inline.
We seem to be on the same page here, so I'll trim it out.
> ...
> > So we insert cxl_handle_error() in handle_error_source(), where it
> > gets called for the RCEC, and then it uses pcie_walk_rcec() again to
> > forcibly call handle_error_source() for *every* device "below" the
> > RCEC (even though they don't have AER error status bits set).
>
> The CXL device contains the links to the dport's caps. Also, there can
> be multiple RCs with CXL devs connected to it. So we must search for
> all CXL devices now, determine the corresponding dport and inspect
> both, PCIe AER and CXL RAS caps.
>
> > Then handle_error_source() ultimately calls the CXL driver err_handler
> > entry points (.cor_error_detected(), .error_detected(), etc), which
> > can look at the CXL-specific error status in the CXL RAS or RCRB or
> > whatever.
>
> The AER driver (portdrv) does not have the knowledge of CXL internals.
> Thus the approach is to pass dport errors to the cxl_mem driver to
> handle it there in addition to cxl mem dev errors.
>
> > So this basically looks like a workaround for the fact that the AER
> > code only calls handle_error_source() when it finds AER error status,
> > and CXL doesn't *set* that AER error status. There's not that much
> > code here, but it seems like a quite a bit of complexity in an area
> > that is already pretty complicated.
My main point here (correct me if I got this wrong) is that:
- A RCEC generates an AER interrupt
- find_source_device() searches all devices below the RCEC and
builds a list everything for which to call handle_error_source()
- cxl_handle_error() *again* looks at all devices below the same
RCEC and calls handle_error_source() for each one
So the main difference here is that the existing flow only calls
handle_error_source() when it finds an error logged in an AER status
register, while the new CXL flow calls handle_error_source() for
*every* device below the RCEC.
I think it's OK to do that, but the almost recursive structure and the
unusual reference counting make the overall AER flow much harder to
understand.
What if we changed is_error_source() to add every CXL.mem device it
finds to the e_info[] list, which I think could nicely encapsulate the
idea that "CXL devices have error state we don't know how to interpret
here"? Would the existing loop in aer_process_err_devices() then do
what you need?
> > Here's another idea: the ACPI GHES code (ghes_handle_aer()) basically
> > receives a packet of error status from firmware and queues it for
> > recovery via pcie_do_recovery(). What if you had a CXL module that
> > knew how to look for the CXL error status, package it up similarly,
> > and queue it via aer_recover_queue()?
>
> ...
> But first, RCEC error notifications (RCEC AER interrupts) must be sent
> to the CXL driver to look into the dport's RCRB.
Right. I think it could be solvable to have aer_irq() call or wake a
CXL interface that has been registered. But maybe changing
is_error_source() would be simpler.
Bjorn
next prev parent reply other threads:[~2023-04-14 21:33 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman
2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman
2023-04-13 15:30 ` Jonathan Cameron
2023-04-13 19:13 ` Terry Bowman
2023-04-14 11:47 ` Jonathan Cameron
2023-04-14 11:51 ` Robert Richter
2023-04-17 23:00 ` Dan Williams
2023-04-18 15:59 ` Terry Bowman
2023-04-27 13:52 ` Robert Richter
2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman
2023-04-12 11:04 ` Ard Biesheuvel
2023-04-13 16:08 ` Jonathan Cameron
2023-04-13 19:40 ` Terry Bowman
2023-04-14 11:48 ` Jonathan Cameron
2023-04-14 12:44 ` Robert Richter
[not found] ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com>
2023-04-14 15:17 ` Terry Bowman
2023-04-17 23:08 ` Dan Williams
2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman
2023-04-13 16:13 ` Jonathan Cameron
2023-04-17 23:11 ` Dan Williams
2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman
2023-04-12 1:32 ` kernel test robot
2023-04-12 3:04 ` kernel test robot
2023-04-13 16:50 ` Jonathan Cameron
2023-04-14 16:36 ` Terry Bowman
2023-04-17 16:56 ` Jonathan Cameron
2023-04-18 0:06 ` Dan Williams
2023-04-24 18:39 ` Terry Bowman
2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman
2023-04-12 22:02 ` Bjorn Helgaas
2023-04-13 11:40 ` Robert Richter
2023-04-14 21:32 ` Bjorn Helgaas [this message]
2023-04-17 22:00 ` Robert Richter
2023-04-19 14:17 ` Robert Richter
2023-04-14 12:19 ` Jonathan Cameron
2023-04-14 14:35 ` Robert Richter
2023-04-17 16:54 ` Jonathan Cameron
2023-04-17 20:36 ` Robert Richter
2023-04-18 1:01 ` Dan Williams
2023-04-19 13:30 ` Robert Richter
2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman
2023-04-12 21:29 ` Bjorn Helgaas
2023-04-13 13:38 ` Robert Richter
2023-04-13 17:05 ` Jonathan Cameron
2023-04-14 11:58 ` Robert Richter
2023-04-14 21:49 ` Bjorn Helgaas
2023-04-13 17:01 ` Jonathan Cameron
2023-04-13 22:52 ` Ira Weiny
2023-04-14 11:21 ` Robert Richter
2023-04-14 11:55 ` Jonathan Cameron
2023-04-14 14:47 ` Robert Richter
2023-04-18 2:37 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230414213254.GA219190@bhelgaas \
--to=helgaas@kernel.org \
--cc=Jonathan.Cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=bwidawsk@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mahesh@linux.ibm.com \
--cc=oohall@gmail.com \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox