From: Lukas Wunner <lukas@wunner.de>
To: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Subject: Re: [PATCH v1] PCI: pciehp: Make sure DPC trigger status is reset in PDC handler
Date: Fri, 16 Jun 2023 11:06:35 +0200 [thread overview]
Message-ID: <20230616090635.GA17565@wunner.de> (raw)
In-Reply-To: <713d71dc-c4a5-cd7b-2deb-343c244dd14d@linux.intel.com>
[cc += Smita]
On Thu, Jun 15, 2023 at 04:03:54PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 6/15/23 11:35 AM, Lukas Wunner wrote:
> > On Wed, Jun 14, 2023 at 11:25:59PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > > During the EDR-based DPC recovery process, for devices with persistent
> > > issues, the firmware may choose not to handle the DPC error and leave
> > > the port in DPC triggered state. In such scenarios, if the user
> > > replaces the faulty device with a new one, the OS is expected to clear
> > > the DPC trigger status in the hotplug error handler to enable the new
> > > device enumeration.
[...]
> >
> > pciehp_unconfigure_device() seems like a more appropriate place to me.
> >
>
> I initially thought to add it there. Spec also recommends clearing it
> when removing the device. But I wasn't sure if pciehp_unconfigure_device()
> would be called only during device removal.
It is.
> > > More details about this issue can be found in PCIe
> > > firmware specification, r3.3, sec titled "DPC Event Handling"
> > > Implementation note.
> >
> > That Implementation Note contains a lot of text and a fairly complex
> > flow chart. If you could point to specific paragraphs or numbers in
> > the Implementation Note that would make life easier for a reviewer
> > to make the connection between your code and the spec.
>
> It is the text at the end of the flowchart. Copied it here for reference.
>
> For devices with persistent errors, a port may be kept in the DPC triggered
> state (disabled) to keep those devices from continuing to generate errors.
> For hot-plug slots, the errant device may be removed and replaced with a new
> device.
> If the DPC trigger state is not cleared, then the port above the newly
> inserted device will still be disabled and will be non-operational.
> Therefore, operating systems may need to modify their hot-plug interrupt
> handling code to clear DPC Trigger Status when a device is removed so that
> a subsequent insertion will succeed.
Please add that excerpt to the commit message.
> > This may run concurrently to dpc_reset_link(), so I'd expect that
> > you need some kind of serialization. What happens if pciehp clears
> > trigger status behind the DPC driver's back while it is handling an
> > error?
>
> Currently, we only call pci_dpc_reset_trigger() in PDC interrupt handler.
>
> Do you think there would be a race between error handler and PDC handler?
Yes I think so.
We need to differentiate between two cases:
(1) DPC handled by firmware, hotplug handled by OS:
In this case clearing DPC trigger status from pciehp device removal
code path seems reasonable. But it must be constrained to
!host_bridge->native_dpc.
(2) DPC handled by OS:
In this case clearing DPC trigger status from pciehp could race with
the dpc interrupt handler so must not be done. Instead, I recommend
clearing trigger status from the dpc interrupt handler. You should
see a Surprise Down error handled by the dpc interrupt handler.
Make sure DPC trigger status is *always* cleared in that case.
Note that Smita Koralahalli is currently working on something similar:
https://lore.kernel.org/linux-pci/20230418210526.36514-2-Smita.KoralahalliChannabasappa@amd.com/
(@Smita sorry for the delay, I'll get to your patches ASAP.)
I recommend splitting the two cases above into two commits, one for
firmware-handled DPC and one for OS-native DPC. IIUC, you only need
the former to address Dell's finding.
Thanks,
Lukas
next prev parent reply other threads:[~2023-06-16 9:08 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-15 6:25 [PATCH v1] PCI: pciehp: Make sure DPC trigger status is reset in PDC handler Kuppuswamy Sathyanarayanan
2023-06-15 8:18 ` kernel test robot
2023-06-15 18:35 ` Lukas Wunner
2023-06-15 23:03 ` Sathyanarayanan Kuppuswamy
2023-06-16 9:06 ` Lukas Wunner [this message]
2023-06-16 23:27 ` Sathyanarayanan Kuppuswamy
2023-06-17 8:31 ` Lukas Wunner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230616090635.GA17565@wunner.de \
--to=lukas@wunner.de \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=bhelgaas@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.