linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Sathyanarayanan Kuppuswamy  <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>
Subject: Re: [PATCH v2] PCI/EDR: Clear PCIe Device Status errors after EDR error recovery
Date: Thu, 30 Mar 2023 10:45:42 -0500	[thread overview]
Message-ID: <20230330154542.GA3147375@bhelgaas> (raw)
In-Reply-To: <b6fd4af3-18f7-0a7e-96e7-4ca3c4ada279@linux.intel.com>

On Wed, Mar 29, 2023 at 03:38:04PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 3/29/23 3:09 PM, Bjorn Helgaas wrote:
> > On Wed, Mar 15, 2023 at 04:54:49PM -0700, Kuppuswamy Sathyanarayanan wrote:
> >> Commit 068c29a248b6 ("PCI/ERR: Clear PCIe Device Status errors only if
> >> OS owns AER") adds support to clear error status in the Device Status
> >> Register(DEVSTA) only if OS owns the AER support. But this change
> >> breaks the requirement of the EDR feature which requires OS to cleanup
> >> the error registers even if firmware owns the control of AER support.

> > I assume we should have a Fixes: tag here, since this patch should be
> > backported to every kernel that contains 068c29a248b6.  Possibly even
> > a stable tag, although it's arguable whether it's "critical" per
> > Documentation/process/stable-kernel-rules.rst.
> 
> Yes. But this error is only reproducible in the EDR use case. So I
> am not sure whether it can be considered a critical fix. 

I don't know how widespread EDR implementation is.  What is the
user-visible issue without this fix?  "lspci" shows status bits set
even after recovery?  Subsequent EDR notifications cause us to report
errors that were previously reported and recovered?  Spurious EDR
notifications because of status bits that should have been cleared?
This kind of information would be useful in the commit log anyway.

Since the risk is low (the change only affects EDR processing) and the
the experience without this change might be poor (please clarify what
that experience is), I think I would be inclined to mark it for
stable.

> > It's a little weird to work around a change inside pcie_do_recovery()
> > by clearing it here, and that means we clear it twice in the AER
> > native case, but I don't see any simpler way to do this, so this seems
> > fine as the fix for the current issue.
> 
> In AER native case, edr_handle_event() will never be triggered. So it
> won't be cleared twice.

This sounds like a plausible assumption.  But is there actually spec
language that says EDR notification is not allowed in the AER native
case (when OS owns the AER Capability)?  I looked but didn't find
anything.

> Other way is to add a new parameter to pcie_do_recovery(..., edr) and use
> it to conditionally call pcie_clear_device_status(). But I think current
> way is less complex.

I agree.

> > Question though: in the AER native case, pcie_do_recovery() calls
> > both:
> > 
> >   pcie_clear_device_status() and
> >   pci_aer_clear_nonfatal_status()
> > 
> > In this patch, you only call pcie_clear_device_status().  Do you care
> > about pci_aer_clear_nonfatal_status(), too?
> 
> Yes, we care about it. Since we call dpc_process_error() in EDR handler,
> it will eventually clear error status via pci_aer_clear_nonfatal_status()
> and pci_aer_clear_fatal_status() within dpc_process_error().

dpc_process_error() calls pci_aer_clear_nonfatal_status() in *some*
(but not all) cases.  I didn't try to work out whether those match the
cases where pcie_do_recovery() called it before 068c29a248b6.  I guess
we can assume it's equivalent for now.

> > The overall design for clearing status has gotten pretty complicated
> > as we've added error handling methods (firmware-first, DPC, EDR), and
> > there are so many different places and cases that it's hard to be sure
> > we do them all correctly.
> > 
> > I don't really know how to clean this up, so I'm just attaching my
> > notes about the current state:
> 
> Good summary! I can see a lot of overlap in clearing
> PCI_ERR_UNCOR_STATUS and PCI_EXP_DEVSTA.

Actually I do have one idea: in the firmware-first case, firmware
collects all the status information, clears it, and then passes the
status on to the OS.  In this case we don't need to clear the status
registers in handle_error_source(), pcie_do_recovery(), etc.

So I think the OS *should* be able to do something similar by
collecting the status information and clearing it first, before
starting error handling.  This might let us collect the status
clearing together in one place and also converge the firmware-first
and native error handling paths.

Obviously that would be a major future project.

Bjorn

  reply	other threads:[~2023-03-30 15:46 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-15 23:54 [PATCH v2] PCI/EDR: Clear PCIe Device Status errors after EDR error recovery Kuppuswamy Sathyanarayanan
2023-03-24 17:53 ` Sathyanarayanan Kuppuswamy
2023-03-29 22:09 ` Bjorn Helgaas
2023-03-29 22:38   ` Sathyanarayanan Kuppuswamy
2023-03-30 15:45     ` Bjorn Helgaas [this message]
2023-03-31  6:46       ` Sathyanarayanan Kuppuswamy
2023-03-31 15:10         ` Bjorn Helgaas
2023-04-06 21:07 ` Bjorn Helgaas
2023-04-06 21:52   ` Sathyanarayanan Kuppuswamy
2023-04-06 22:21     ` Bjorn Helgaas
2023-04-06 22:46       ` Natu, Mahesh
2023-04-07  5:31       ` Sathyanarayanan Kuppuswamy
2023-04-07 16:46         ` Bjorn Helgaas
2023-04-07 22:19           ` Sathyanarayanan Kuppuswamy
2023-04-07 22:41             ` Bjorn Helgaas
2023-04-07 21:51 ` Bjorn Helgaas
2023-04-07 21:52   ` Bjorn Helgaas
2023-04-07 22:25     ` Sathyanarayanan Kuppuswamy
2023-04-07 22:21   ` Sathyanarayanan Kuppuswamy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230330154542.GA3147375@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).