Linux CXL
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>, <linux-cxl@vger.kernel.org>,
	<dan.j.williams@intel.com>, <ira.weiny@intel.com>,
	<vishal.l.verma@intel.com>, <alison.schofield@intel.com>,
	Bjorn Helgaas <bhelgaas@google.com>, "Stefan Roese" <sr@denx.de>,
	Kuppuswamy Sathyanarayanan 
	<sathyanarayanan.kuppuswamy@linux.intel.com>
Subject: Re: [PATCH v5] cxl: add RAS status unmasking for CXL
Date: Thu, 5 Jan 2023 16:31:27 +0000	[thread overview]
Message-ID: <20230105163127.00005ae2@huawei.com> (raw)
In-Reply-To: <20221229172731.GA611562@bhelgaas>

On Thu, 29 Dec 2022 11:27:31 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Stefan, Kuppuswamy]
> 
> On Sat, Dec 17, 2022 at 05:52:04PM +0000, Jonathan Cameron wrote:
> > On Fri, 16 Dec 2022 09:10:02 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >   
> > > By default the CXL RAS mask registers bits are defaulted to 1's and
> > > suppress all error reporting. If the kernel has negotiated ownership
> > > of error handling for CXL then unmask the mask registers by writing 0s.  
> 
> > +CC Bjorn for a question on the AER control element of this.  
> 
> Timely question (unlike my response, sorry, I've been on vacation).
> 
> > I realized that adding this patch still only enables error because I
> > didn't check the PCIe spec when writing the QEMU emulation. I had
> > changed the value of  "Correctable Internal Error Mask" to default
> > to unmasked.  PCIe 6.0 says it defaults to masked.  For some reason
> > I thought these masks were impdef (should have checked ;)  
> 
> I assume you refer to the AER "Corrected Internal Error Mask" bit
> (PCIe r6.0, sec 7.8.4.6), which indeed defaults to 1b (masked) if the
> bit is implemented.

Spot on.  I keep confusing the correctable / corrected stuff in PCIe.
Made more confusing by the CXL stuff layered on top.

> 
> > I'll drop that change in QEMU, but upshot is that if we want the
> > correctable ones to be delivered, we'll need to clear that bit as
> > well as the ones you do in this patch.  
> 
> I don't think Corrected Internal Error Mask affects reporting of other
> Correctable Errors, e.g., Receiver Error, Bad TLP, Bad DLLP, etc.  As
> I read sec 6.2.10, those would not be classified as "internal" errors.

I should have been more detailed in this part. You are absolutely correct
for errors defined by the PCIe spec.

CXL handles it's Protocol errors by reporting them all as 'internal'
errors as defined by PCIe and then providing extra register to demultiplex
that into a range of different CXL protocol errors. A similar set of
mask /severity registers to those found in PCIe for PCIe defined protocol errors
are used for fine grained control.  Currently there are about 7
correctable and 7 uncorrectable errors in CXL r3.0
It was the CXL correctable errors that I meant to refer to with
the above comment.


> 
> > This got me thinking about why I never saw the same issue for UNC
> > errors.  Upshot, QEMU never sets the UNCOR mask so defaults to everything on
> > (it also doesn't allow anyone to write the register). Curiously it has
> > the code that uses the mask even though there was no means to set any
> > bits in it. 
> > I'll fix that (draft patch below) + we should update lspci to cover more of these AER
> > flags as it's a PITA debugging by reading the hex dumps!
> > 
> > Bjorn, is there any convention on drivers enabling these 'default' masked
> > AER errors?  Or is expectation that this is policy and userspace should
> > be dealing with it?  
> 
> I think AER configuration should generally be a system policy
> decision, not a driver-level choice (unless we need quirks to work
> around device defects, of course).

Makes sense, though we may need a second layer of system policy to
cover CXL errors - currently that's what this patch is pushing to the
driver.

> 
> We now have f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
> native"), which turns on error reporting in Device Control for all
> devices at enumeration-time when the OS has control of AER.  But this
> is only the generic device-level control; it doesn't configure any
> *AER* registers.
> 
> I'm surprised to learn that the only writes to PCI_ERR_UNCOR_MASK are
> some mips and powerpc arch-specific code and a few individual drivers.
> It seems like maybe pci_aer_init() should do some more configuration
> of the AER mask and severity registers.

Sounds good.  Any thoughts on where to get the policy from?
Feels like an administrator thing rather than a kernel config one
to me, so maybe pci_aer_init() is too early or we'd benefit from
a nice easy per device interface to tweak a default?

Jonathan

> 
> Bjorn


  reply	other threads:[~2023-01-05 16:31 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-16 16:10 [PATCH v5] cxl: add RAS status unmasking for CXL Dave Jiang
2022-12-17 17:52 ` Jonathan Cameron
2022-12-17 18:05   ` Jonathan Cameron
2023-01-05 16:53     ` Dave Jiang
2023-01-06 11:22       ` Jonathan Cameron
2022-12-29 17:27   ` Bjorn Helgaas
2023-01-05 16:31     ` Jonathan Cameron [this message]
2023-01-05 16:54       ` Bjorn Helgaas
2023-01-06 11:26         ` Jonathan Cameron
2023-02-10  8:07           ` Ira Weiny
2023-02-10 12:29             ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230105163127.00005ae2@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=helgaas@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sr@denx.de \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox