Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Bjorn Helgaas <helgaas@kernel.org>
To: Robert Richter <rrichter@amd.com>
Cc: Terry Bowman <terry.bowman@amd.com>,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, bwidawsk@kernel.org,
	dan.j.williams@intel.com, dave.jiang@intel.com,
	Jonathan.Cameron@huawei.com, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, bhelgaas@google.com,
	linux-pci@vger.kernel.org,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler
Date: Tue, 28 Mar 2023 12:21:04 -0500	[thread overview]
Message-ID: <20230328172104.GA2897826@bhelgaas> (raw)
In-Reply-To: <ZCIPuPM+LZsOFIIZ@rric.localdomain>

[+cc linux-pci, more error handling folks; beginning of thread at
https://lore.kernel.org/all/20230323213808.398039-1-terry.bowman@amd.com/]

On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote:
> On 24.03.23 17:36:56, Bjorn Helgaas wrote:

> > > The CXL device driver is then responsible to
> > > enable error reporting in the RCEC's AER cap
> > 
> > I don't know exactly what you mean by "error reporting in the RCEC's
> > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> > register and should already be enabled by pci_aer_init().
> > 
> > Maybe you mean setting AER mask/severity specifically for Internal
> > Errors?  I'm hoping to get as much of AER management as we can in the
> 
> Richt, this is implemented in patch #5 in function
> rcec_enable_aer_ints().

I think we should add a PCI core interface for this so we can enforce
the AER ownership question (all the crud like pcie_aer_is_native()) in
one place.

> > PCI core and out of drivers, so maybe we need a new PCI interface to
> > do that.
> > 
> > In any event, I assume this sort of configuration would be an
> > enumeration-time thing, while *this* patch is a run-time thing, so
> > maybe this information belongs with a different patch?
> 
> Do you mean once a Restricted CXL host (RCH) is detected, the internal
> errors should be enabled in the device mask, all this done during
> device enumeration? But wouldn't interrupts being enabled then before
> the CXL device is ready?

I'm not sure what you mean by "before the CXL device is ready."  What
makes a CXL device ready, and how do we know when it is ready?

pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc
as soon as we enumerate the device, before any driver claims the
device.  I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and
PCI_ERR_UNC_INTN fiddling around the same time?

> > I haven't worked all the way through this, but I thought Sean Kelley's
> > and Qiuxu Zhuo's work was along the same line and might cover this,
> > e.g.,
> > 
> >   a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> >   579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> >   af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
> > 
> > But I guess maybe it's not quite the same case?
> 
> Actually, we use this code to handle errors that are reported to the
> RCEC and only implement here the CXL specifics. That is, checking if
> the RCEC receives something from a CXL downstream port and forwarding
> that to a CXL handler (this patch). The handler then checks the AER
> err cap in the RCRB of all CXL downstream ports associated to the RCEC
> (not visible in the PCI hierarchy), but discovered through the :00.0
> RCiEP (patch #5).

There are two calls to pcie_walk_rcec():

  1) The existing one in find_source_device()
  2) The one you add in handle_cxl_error()

Does the call in handle_cxl_error() look at devices that the existing
call in find_source_device() does not?  I'm trying to understand why
we need both calls.

> > > +static bool is_internal_error(struct aer_err_info *info)
> > > +{
> > > +	if (info->severity == AER_CORRECTABLE)
> > > +		return info->status & PCI_ERR_COR_INTERNAL;
> > > +
> > > +	return info->status & PCI_ERR_UNC_INTN;
> > > +}
> > > +
> > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> > > +{
> > > +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > > +	    is_internal_error(info))
> > 
> > What's unique about Internal Errors?  I'm trying to figure out why you
> > wouldn't do this for *all* CXL errors.
> 
> Per CXL specification downstream port errors are signaled using
> internal errors. 

Maybe a spec reference here to explain is_internal_error()?  Is the
point of the check to *exclude* non-internal errors?  Or is basically
documentation that there shouldn't ever *be* any non-internal errors?
I guess the latter wouldn't make sense because at this point we don't
know whether this is a CXL hierarchy.

> All other errors would be device specific, we cannot
> handle that in a generic CXL driver.

I'm missing the point here.  We don't have any device-specific error
handling in aer.c; it only connects the generic *reporting* mechanism
(AER log registers and Root Port interrupts) to the drivers that do
the device-specific things via err_handler hooks.  I assume we want a
similar model for CXL.

Bjorn

WARNING: multiple messages have this Message-ID (diff)

From: Bjorn Helgaas <helgaas@kernel.org>
To: Robert Richter <rrichter@amd.com>
Cc: alison.schofield@intel.com, dave.jiang@intel.com,
	Terry Bowman <terry.bowman@amd.com>,
	vishal.l.verma@intel.com, linuxppc-dev@lists.ozlabs.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	bhelgaas@google.com, Oliver O'Halloran <oohall@gmail.com>,
	Jonathan.Cameron@huawei.com, bwidawsk@kernel.org,
	dan.j.williams@intel.com, ira.weiny@intel.com
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler
Date: Tue, 28 Mar 2023 12:21:04 -0500	[thread overview]
Message-ID: <20230328172104.GA2897826@bhelgaas> (raw)
In-Reply-To: <ZCIPuPM+LZsOFIIZ@rric.localdomain>

[+cc linux-pci, more error handling folks; beginning of thread at
https://lore.kernel.org/all/20230323213808.398039-1-terry.bowman@amd.com/]

On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote:
> On 24.03.23 17:36:56, Bjorn Helgaas wrote:

> > > The CXL device driver is then responsible to
> > > enable error reporting in the RCEC's AER cap
> > 
> > I don't know exactly what you mean by "error reporting in the RCEC's
> > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> > register and should already be enabled by pci_aer_init().
> > 
> > Maybe you mean setting AER mask/severity specifically for Internal
> > Errors?  I'm hoping to get as much of AER management as we can in the
> 
> Richt, this is implemented in patch #5 in function
> rcec_enable_aer_ints().

I think we should add a PCI core interface for this so we can enforce
the AER ownership question (all the crud like pcie_aer_is_native()) in
one place.

> > PCI core and out of drivers, so maybe we need a new PCI interface to
> > do that.
> > 
> > In any event, I assume this sort of configuration would be an
> > enumeration-time thing, while *this* patch is a run-time thing, so
> > maybe this information belongs with a different patch?
> 
> Do you mean once a Restricted CXL host (RCH) is detected, the internal
> errors should be enabled in the device mask, all this done during
> device enumeration? But wouldn't interrupts being enabled then before
> the CXL device is ready?

I'm not sure what you mean by "before the CXL device is ready."  What
makes a CXL device ready, and how do we know when it is ready?

pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc
as soon as we enumerate the device, before any driver claims the
device.  I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and
PCI_ERR_UNC_INTN fiddling around the same time?

> > I haven't worked all the way through this, but I thought Sean Kelley's
> > and Qiuxu Zhuo's work was along the same line and might cover this,
> > e.g.,
> > 
> >   a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> >   579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> >   af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
> > 
> > But I guess maybe it's not quite the same case?
> 
> Actually, we use this code to handle errors that are reported to the
> RCEC and only implement here the CXL specifics. That is, checking if
> the RCEC receives something from a CXL downstream port and forwarding
> that to a CXL handler (this patch). The handler then checks the AER
> err cap in the RCRB of all CXL downstream ports associated to the RCEC
> (not visible in the PCI hierarchy), but discovered through the :00.0
> RCiEP (patch #5).

There are two calls to pcie_walk_rcec():

  1) The existing one in find_source_device()
  2) The one you add in handle_cxl_error()

Does the call in handle_cxl_error() look at devices that the existing
call in find_source_device() does not?  I'm trying to understand why
we need both calls.

> > > +static bool is_internal_error(struct aer_err_info *info)
> > > +{
> > > +	if (info->severity == AER_CORRECTABLE)
> > > +		return info->status & PCI_ERR_COR_INTERNAL;
> > > +
> > > +	return info->status & PCI_ERR_UNC_INTN;
> > > +}
> > > +
> > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> > > +{
> > > +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > > +	    is_internal_error(info))
> > 
> > What's unique about Internal Errors?  I'm trying to figure out why you
> > wouldn't do this for *all* CXL errors.
> 
> Per CXL specification downstream port errors are signaled using
> internal errors. 

Maybe a spec reference here to explain is_internal_error()?  Is the
point of the check to *exclude* non-internal errors?  Or is basically
documentation that there shouldn't ever *be* any non-internal errors?
I guess the latter wouldn't make sense because at this point we don't
know whether this is a CXL hierarchy.

> All other errors would be device specific, we cannot
> handle that in a generic CXL driver.

I'm missing the point here.  We don't have any device-specific error
handling in aer.c; it only connects the generic *reporting* mechanism
(AER log registers and Root Port interrupts) to the drivers that do
the device-specific things via err_handler hooks.  I assume we want a
similar model for CXL.

Bjorn

next prev parent reply	other threads:[~2023-03-28 17:21 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-23 21:38 [PATCH v2 0/5] cxl/pci: Add support for RCH RAS error handling Terry Bowman
2023-03-23 21:38 ` [PATCH v2 1/5] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman
2023-03-24  8:53   ` kernel test robot
2023-03-24 13:12     ` Terry Bowman
2023-03-23 21:38 ` [PATCH v2 2/5] efi/cper: Export cper_mem_err_unpack() for CXL logging Terry Bowman
2023-03-23 22:29   ` Terry Bowman
2023-03-23 21:38 ` [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging Terry Bowman
2023-03-23 22:20   ` Terry Bowman
2023-03-23 22:26     ` Sathyanarayanan Kuppuswamy
2023-04-14 20:41       ` Terry Bowman
2023-03-24 21:41   ` Bjorn Helgaas
2023-03-24 21:52     ` Terry Bowman
2023-03-23 21:38 ` [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman
2023-03-23 22:27   ` Terry Bowman
2023-03-24 22:36   ` Bjorn Helgaas
2023-03-27 21:51     ` Robert Richter
2023-03-28 17:21       ` Bjorn Helgaas [this message]
2023-03-28 17:21         ` Bjorn Helgaas
2023-03-29 15:59         ` Robert Richter
2023-03-29 15:59           ` Robert Richter
2023-03-28 13:41     ` Terry Bowman
2023-03-23 21:38 ` [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging Terry Bowman
2023-03-24  5:39   ` kernel test robot
2023-03-24  6:09   ` kernel test robot
2023-03-24  6:30   ` kernel test robot
2023-03-24 17:41     ` Terry Bowman
2023-03-27 23:21   ` Dave Jiang
2023-03-28 13:53     ` Terry Bowman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230328172104.GA2897826@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=bwidawsk@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.