From: Brian Norris <briannorris@chromium.org>
To: Lukas Wunner <lukas@wunner.de>
Cc: manivannan.sadhasivam@oss.qualcomm.com,
Bjorn Helgaas <bhelgaas@google.com>,
Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
Oliver O'Halloran <oohall@gmail.com>,
Will Deacon <will@kernel.org>,
Lorenzo Pieralisi <lpieralisi@kernel.org>,
Krzysztof Wilczynski <kwilczynski@kernel.org>,
Manivannan Sadhasivam <mani@kernel.org>,
Rob Herring <robh@kernel.org>, Heiko Stuebner <heiko@sntech.de>,
Philipp Zabel <p.zabel@pengutronix.de>,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org,
linux-arm-kernel@lists.infradead.org,
linux-arm-msm@vger.kernel.org,
linux-rockchip@lists.infradead.org,
Niklas Cassel <cassel@kernel.org>,
Wilfred Mallawa <wilfred.mallawa@wdc.com>,
Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Subject: Re: [PATCH v6 2/4] PCI: host-common: Add link down handling for Root Ports
Date: Fri, 29 Aug 2025 16:58:24 -0700 [thread overview]
Message-ID: <aLI-oKWVJHFfst-i@google.com> (raw)
In-Reply-To: <aLFmSFe5iyYDrIjt@wunner.de>
Hi Lukas,
On Fri, Aug 29, 2025 at 10:35:20AM +0200, Lukas Wunner wrote:
> On Thu, Aug 28, 2025 at 01:25:12PM -0700, Brian Norris wrote:
> > On the flip side: it's not clear
> > PCI_ERS_RESULT_NEED_RESET+pci_channel_io_normal works as documented
> > either. An endpoint might think it's requesting a slot reset, but
> > pcie_do_recovery() will ignore that and skip reset_subordinates()
> > (pci_host_reset_root_port()).
> >
> > All in all, the docs sound like endpoints _should_ have control over
> > whether we exercise a full port/slot reset for all types of errors. But
> > in practice, we do not actually give it that control. i.e., your commit
> > message is correct, and the docs are not.
> >
> > I have half a mind to suggest the appended change, so the behavior
> > matches (some of) the docs a little better [1].
>
> A change similar to the one you're proposing is already queued on the
> pci/aer topic branch for v6.18:
>
> https://git.kernel.org/pci/pci/c/d0a2dee7d458
Wow, nice coincidence. It's a reminder I should work off the maintainer
/ -next branch, instead of just mainline...
> Here's the corresponding cover letter:
>
> https://lore.kernel.org/r/cover.1755008151.git.lukas@wunner.de
>
> There was a discussion why I didn't take the exact same approach you're
> proposing, but only a similar one:
>
> https://lore.kernel.org/r/aJ2uE6v46Zib30Jh@wunner.de
> https://lore.kernel.org/r/aKHWf3L0NCl_CET5@wunner.de
Wow, that's a ton of great background and explanation. Thanks!
> > Specifically, I'm trying to see what's supposed to happen with
> > PCI_ERS_RESULT_CAN_RECOVER. I see that for pci_channel_io_frozen, almost
> > all endpoint drivers return PCI_ERS_RESULT_NEED_RESET, but if drivers
> > actually return PCI_ERS_RESULT_CAN_RECOVER, it's unclear what should
> > happen.
> >
> > Today, we don't actually respect it; pcie_do_recovery() just calls
> > reset_subordinates() (pci_host_reset_root_port()) unconditionally. The
> > only thing that return code affects is whether we call
> > report_mmio_enabled() vs report_slot_reset() afterward. This seems odd.
>
> In the series queued on pci/aer, I've only allowed drivers to opt in
> to a reset on Non-Fatal Errors. I didn't dare also letting them opt
> out of a reset on Fatal Errors.
Right, I can see where the latter is risky. Frankly, while I have
endpoint drivers suggesting they should be able to do this, I'm not sure
that's a great idea. Or at least, I can see how it would potentially
break other clients, as you explain.
> These changes of behavior are always risky, so it seemed prudent to not
> introduce too many changes at once. There was no urgent need to also
> change behavior for Fatal Errors for the use case at hand (the xe graphics
> driver). I went through all drivers with pci_error_handlers to avoid
> breaking any of them. It's very tedious work, takes weeks. It would
> be necessary to do that again when changing behavior for Fatal Errors.
>
> pcieaer-howto.rst justifies the unconditional reset on Fatal Errors by
> saying that the link is unreliable and that a reset is thus required.
>
> On the other hand, pci-error-recovery.rst (which is a few months older
> than pcieaer-howto.rst) says in section "STEP 3: Link Reset":
> "This is a PCIe specific step and is done whenever a fatal error has been
> detected"
>
> I'm wondering if the authors of pcieaer-howto.rst took that at face value
> and thought they'd *have* to reset the link on Fatal Errors.
>
> Looking through the Fatal Errors in PCIe r7.0 sec 6.2.7, I think a reset
> is justified for some of them, but optional for others. Which leads me
> to believe that the AER driver should actually enforce a reset only for
> certain Fatal Errors, not all of them. So this seems like something
> worth revisiting in the future.
Hmm, possibly. I haven't looked so closely at the details on all Fatal
Errors, but I may have a look eventually.
> > All in all, the docs sound like endpoints _should_ have control over
> > whether we exercise a full port/slot reset for all types of errors. But
> > in practice, we do not actually give it that control. i.e., your commit
> > message is correct, and the docs are not.
>
> Indeed the documentation is no longer in sync with the code. I've just
> submitted a series to rectify that and cc'ed you:
>
> https://lore.kernel.org/r/cover.1756451884.git.lukas@wunner.de
Thanks! I'll try to take a pass at reviewing, but it may not be prompt.
Thanks again for all the info and work here.
Brian
WARNING: multiple messages have this Message-ID (diff)
From: Brian Norris <briannorris@chromium.org>
To: Lukas Wunner <lukas@wunner.de>
Cc: manivannan.sadhasivam@oss.qualcomm.com,
Bjorn Helgaas <bhelgaas@google.com>,
Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
Oliver O'Halloran <oohall@gmail.com>,
Will Deacon <will@kernel.org>,
Lorenzo Pieralisi <lpieralisi@kernel.org>,
Krzysztof Wilczynski <kwilczynski@kernel.org>,
Manivannan Sadhasivam <mani@kernel.org>,
Rob Herring <robh@kernel.org>, Heiko Stuebner <heiko@sntech.de>,
Philipp Zabel <p.zabel@pengutronix.de>,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org,
linux-arm-kernel@lists.infradead.org,
linux-arm-msm@vger.kernel.org,
linux-rockchip@lists.infradead.org,
Niklas Cassel <cassel@kernel.org>,
Wilfred Mallawa <wilfred.mallawa@wdc.com>,
Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Subject: Re: [PATCH v6 2/4] PCI: host-common: Add link down handling for Root Ports
Date: Fri, 29 Aug 2025 16:58:24 -0700 [thread overview]
Message-ID: <aLI-oKWVJHFfst-i@google.com> (raw)
In-Reply-To: <aLFmSFe5iyYDrIjt@wunner.de>
Hi Lukas,
On Fri, Aug 29, 2025 at 10:35:20AM +0200, Lukas Wunner wrote:
> On Thu, Aug 28, 2025 at 01:25:12PM -0700, Brian Norris wrote:
> > On the flip side: it's not clear
> > PCI_ERS_RESULT_NEED_RESET+pci_channel_io_normal works as documented
> > either. An endpoint might think it's requesting a slot reset, but
> > pcie_do_recovery() will ignore that and skip reset_subordinates()
> > (pci_host_reset_root_port()).
> >
> > All in all, the docs sound like endpoints _should_ have control over
> > whether we exercise a full port/slot reset for all types of errors. But
> > in practice, we do not actually give it that control. i.e., your commit
> > message is correct, and the docs are not.
> >
> > I have half a mind to suggest the appended change, so the behavior
> > matches (some of) the docs a little better [1].
>
> A change similar to the one you're proposing is already queued on the
> pci/aer topic branch for v6.18:
>
> https://git.kernel.org/pci/pci/c/d0a2dee7d458
Wow, nice coincidence. It's a reminder I should work off the maintainer
/ -next branch, instead of just mainline...
> Here's the corresponding cover letter:
>
> https://lore.kernel.org/r/cover.1755008151.git.lukas@wunner.de
>
> There was a discussion why I didn't take the exact same approach you're
> proposing, but only a similar one:
>
> https://lore.kernel.org/r/aJ2uE6v46Zib30Jh@wunner.de
> https://lore.kernel.org/r/aKHWf3L0NCl_CET5@wunner.de
Wow, that's a ton of great background and explanation. Thanks!
> > Specifically, I'm trying to see what's supposed to happen with
> > PCI_ERS_RESULT_CAN_RECOVER. I see that for pci_channel_io_frozen, almost
> > all endpoint drivers return PCI_ERS_RESULT_NEED_RESET, but if drivers
> > actually return PCI_ERS_RESULT_CAN_RECOVER, it's unclear what should
> > happen.
> >
> > Today, we don't actually respect it; pcie_do_recovery() just calls
> > reset_subordinates() (pci_host_reset_root_port()) unconditionally. The
> > only thing that return code affects is whether we call
> > report_mmio_enabled() vs report_slot_reset() afterward. This seems odd.
>
> In the series queued on pci/aer, I've only allowed drivers to opt in
> to a reset on Non-Fatal Errors. I didn't dare also letting them opt
> out of a reset on Fatal Errors.
Right, I can see where the latter is risky. Frankly, while I have
endpoint drivers suggesting they should be able to do this, I'm not sure
that's a great idea. Or at least, I can see how it would potentially
break other clients, as you explain.
> These changes of behavior are always risky, so it seemed prudent to not
> introduce too many changes at once. There was no urgent need to also
> change behavior for Fatal Errors for the use case at hand (the xe graphics
> driver). I went through all drivers with pci_error_handlers to avoid
> breaking any of them. It's very tedious work, takes weeks. It would
> be necessary to do that again when changing behavior for Fatal Errors.
>
> pcieaer-howto.rst justifies the unconditional reset on Fatal Errors by
> saying that the link is unreliable and that a reset is thus required.
>
> On the other hand, pci-error-recovery.rst (which is a few months older
> than pcieaer-howto.rst) says in section "STEP 3: Link Reset":
> "This is a PCIe specific step and is done whenever a fatal error has been
> detected"
>
> I'm wondering if the authors of pcieaer-howto.rst took that at face value
> and thought they'd *have* to reset the link on Fatal Errors.
>
> Looking through the Fatal Errors in PCIe r7.0 sec 6.2.7, I think a reset
> is justified for some of them, but optional for others. Which leads me
> to believe that the AER driver should actually enforce a reset only for
> certain Fatal Errors, not all of them. So this seems like something
> worth revisiting in the future.
Hmm, possibly. I haven't looked so closely at the details on all Fatal
Errors, but I may have a look eventually.
> > All in all, the docs sound like endpoints _should_ have control over
> > whether we exercise a full port/slot reset for all types of errors. But
> > in practice, we do not actually give it that control. i.e., your commit
> > message is correct, and the docs are not.
>
> Indeed the documentation is no longer in sync with the code. I've just
> submitted a series to rectify that and cc'ed you:
>
> https://lore.kernel.org/r/cover.1756451884.git.lukas@wunner.de
Thanks! I'll try to take a pass at reviewing, but it may not be prompt.
Thanks again for all the info and work here.
Brian
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
next prev parent reply other threads:[~2025-08-30 0:52 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-15 14:21 [PATCH v6 0/4] PCI: Add support for resetting the Root Ports in a platform specific way Manivannan Sadhasivam
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` [PATCH v6 1/4] PCI/ERR: " Manivannan Sadhasivam
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-17 18:28 ` [PATCH v6 1/4] PCI/ERR: Add support for resetting the Root Ports in a platform specific wayy Frank Li
2025-07-17 18:28 ` Frank Li
2025-07-15 14:21 ` [PATCH v6 2/4] PCI: host-common: Add link down handling for Root Ports Manivannan Sadhasivam
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-17 18:31 ` [PATCH v6 2/4] PCI: host-common: Add link down handling for Root Portsy Frank Li
2025-07-17 18:31 ` Frank Li
2025-08-28 20:25 ` [PATCH v6 2/4] PCI: host-common: Add link down handling for Root Ports Brian Norris
2025-08-28 20:25 ` Brian Norris
2025-08-29 8:35 ` Lukas Wunner
2025-08-29 23:58 ` Brian Norris [this message]
2025-08-29 23:58 ` Brian Norris
2025-07-15 14:21 ` [PATCH v6 3/4] PCI: qcom: Add support for resetting the Root Port due to link down event Manivannan Sadhasivam
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` [PATCH v6 4/4] PCI: dw-rockchip: Add support to reset Root Port upon " Manivannan Sadhasivam
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-15 14:21 ` Manivannan Sadhasivam via B4 Relay
2025-07-18 3:58 ` [PATCH v6 0/4] PCI: Add support for resetting the Root Ports in a platform specific way Krishna Chaitanya Chundru
2025-07-18 3:58 ` Krishna Chaitanya Chundru
2025-07-18 10:28 ` Niklas Cassel
2025-07-18 10:28 ` Niklas Cassel
2025-07-18 10:39 ` Niklas Cassel
2025-07-18 10:39 ` Niklas Cassel
2025-07-24 5:30 ` Manivannan Sadhasivam
2025-07-24 5:30 ` Manivannan Sadhasivam
2025-08-15 9:07 ` Niklas Cassel
2025-08-15 9:07 ` Niklas Cassel
2025-08-29 16:14 ` Manivannan Sadhasivam
2025-08-29 16:14 ` Manivannan Sadhasivam
2025-09-04 14:03 ` Niklas Cassel
2025-09-04 14:03 ` Niklas Cassel
2026-03-10 13:37 ` Manivannan Sadhasivam
2026-03-10 13:37 ` Manivannan Sadhasivam
2026-03-11 10:59 ` Niklas Cassel
2026-03-11 10:59 ` Niklas Cassel
2026-03-11 14:19 ` Manivannan Sadhasivam
2026-03-11 14:19 ` Manivannan Sadhasivam
2025-07-24 9:28 ` Hongxing Zhu
2025-07-24 9:28 ` Hongxing Zhu
2025-08-28 20:01 ` Brian Norris
2025-08-28 20:01 ` Brian Norris
2025-08-29 13:56 ` Manivannan Sadhasivam
2025-08-29 13:56 ` Manivannan Sadhasivam
2025-09-23 13:06 ` David Bremner
2025-09-23 13:06 ` David Bremner
2025-09-23 13:46 ` Niklas Cassel
2025-09-23 13:46 ` Niklas Cassel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aLI-oKWVJHFfst-i@google.com \
--to=briannorris@chromium.org \
--cc=bhelgaas@google.com \
--cc=cassel@kernel.org \
--cc=heiko@sntech.de \
--cc=krishna.chundru@oss.qualcomm.com \
--cc=kwilczynski@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-rockchip@lists.infradead.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lpieralisi@kernel.org \
--cc=lukas@wunner.de \
--cc=mahesh@linux.ibm.com \
--cc=mani@kernel.org \
--cc=manivannan.sadhasivam@oss.qualcomm.com \
--cc=oohall@gmail.com \
--cc=p.zabel@pengutronix.de \
--cc=robh@kernel.org \
--cc=wilfred.mallawa@wdc.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.