Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bjorn Helgaas <helgaas@kernel.org>
To: Robert Richter <rrichter@amd.com>
Cc: Terry Bowman <terry.bowman@amd.com>,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, bwidawsk@kernel.org,
	dan.j.williams@intel.com, dave.jiang@intel.com,
	Jonathan.Cameron@huawei.com, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, bhelgaas@google.com,
	linux-pci@vger.kernel.org,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler
Date: Tue, 28 Mar 2023 12:21:04 -0500	[thread overview]
Message-ID: <20230328172104.GA2897826@bhelgaas> (raw)
In-Reply-To: <ZCIPuPM+LZsOFIIZ@rric.localdomain>

[+cc linux-pci, more error handling folks; beginning of thread at
https://lore.kernel.org/all/20230323213808.398039-1-terry.bowman@amd.com/]

On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote:
> On 24.03.23 17:36:56, Bjorn Helgaas wrote:

> > > The CXL device driver is then responsible to
> > > enable error reporting in the RCEC's AER cap
> > 
> > I don't know exactly what you mean by "error reporting in the RCEC's
> > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> > register and should already be enabled by pci_aer_init().
> > 
> > Maybe you mean setting AER mask/severity specifically for Internal
> > Errors?  I'm hoping to get as much of AER management as we can in the
> 
> Richt, this is implemented in patch #5 in function
> rcec_enable_aer_ints().

I think we should add a PCI core interface for this so we can enforce
the AER ownership question (all the crud like pcie_aer_is_native()) in
one place.

> > PCI core and out of drivers, so maybe we need a new PCI interface to
> > do that.
> > 
> > In any event, I assume this sort of configuration would be an
> > enumeration-time thing, while *this* patch is a run-time thing, so
> > maybe this information belongs with a different patch?
> 
> Do you mean once a Restricted CXL host (RCH) is detected, the internal
> errors should be enabled in the device mask, all this done during
> device enumeration? But wouldn't interrupts being enabled then before
> the CXL device is ready?

I'm not sure what you mean by "before the CXL device is ready."  What
makes a CXL device ready, and how do we know when it is ready?

pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc
as soon as we enumerate the device, before any driver claims the
device.  I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and
PCI_ERR_UNC_INTN fiddling around the same time?

> > I haven't worked all the way through this, but I thought Sean Kelley's
> > and Qiuxu Zhuo's work was along the same line and might cover this,
> > e.g.,
> > 
> >   a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> >   579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> >   af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
> > 
> > But I guess maybe it's not quite the same case?
> 
> Actually, we use this code to handle errors that are reported to the
> RCEC and only implement here the CXL specifics. That is, checking if
> the RCEC receives something from a CXL downstream port and forwarding
> that to a CXL handler (this patch). The handler then checks the AER
> err cap in the RCRB of all CXL downstream ports associated to the RCEC
> (not visible in the PCI hierarchy), but discovered through the :00.0
> RCiEP (patch #5).

There are two calls to pcie_walk_rcec():

  1) The existing one in find_source_device()
  2) The one you add in handle_cxl_error()

Does the call in handle_cxl_error() look at devices that the existing
call in find_source_device() does not?  I'm trying to understand why
we need both calls.

> > > +static bool is_internal_error(struct aer_err_info *info)
> > > +{
> > > +	if (info->severity == AER_CORRECTABLE)
> > > +		return info->status & PCI_ERR_COR_INTERNAL;
> > > +
> > > +	return info->status & PCI_ERR_UNC_INTN;
> > > +}
> > > +
> > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> > > +{
> > > +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > > +	    is_internal_error(info))
> > 
> > What's unique about Internal Errors?  I'm trying to figure out why you
> > wouldn't do this for *all* CXL errors.
> 
> Per CXL specification downstream port errors are signaled using
> internal errors. 

Maybe a spec reference here to explain is_internal_error()?  Is the
point of the check to *exclude* non-internal errors?  Or is basically
documentation that there shouldn't ever *be* any non-internal errors?
I guess the latter wouldn't make sense because at this point we don't
know whether this is a CXL hierarchy.

> All other errors would be device specific, we cannot
> handle that in a generic CXL driver.

I'm missing the point here.  We don't have any device-specific error
handling in aer.c; it only connects the generic *reporting* mechanism
(AER log registers and Root Port interrupts) to the drivers that do
the device-specific things via err_handler hooks.  I assume we want a
similar model for CXL.

Bjorn

next      parent reply	other threads:[~2023-03-28 17:21 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ZCIPuPM+LZsOFIIZ@rric.localdomain>
2023-03-28 17:21 ` Bjorn Helgaas [this message]
2023-03-29 15:59   ` [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler Robert Richter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230328172104.GA2897826@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=bwidawsk@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).