From: Bjorn Helgaas <helgaas@kernel.org>
To: "Maciej W. Rozycki" <macro@orcam.me.uk>
Cc: "Bjorn Helgaas" <bhelgaas@google.com>,
"Mahesh J Salgaonkar" <mahesh@linux.ibm.com>,
"Oliver O'Halloran" <oohall@gmail.com>,
"Michael Ellerman" <mpe@ellerman.id.au>,
"Nicholas Piggin" <npiggin@gmail.com>,
"Christophe Leroy" <christophe.leroy@csgroup.eu>,
"Saeed Mahameed" <saeedm@nvidia.com>,
"Leon Romanovsky" <leon@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Alex Williamson" <alex.williamson@redhat.com>,
"Lukas Wunner" <lukas@wunner.de>,
"Mika Westerberg" <mika.westerberg@linux.intel.com>,
"Stefan Roese" <sr@denx.de>, "Jim Wilson" <wilson@tuliptree.org>,
"David Abdurachmanov" <david.abdurachmanov@gmail.com>,
"Pali Rohár" <pali@kernel.org>,
linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures
Date: Thu, 4 May 2023 17:20:48 -0500 [thread overview]
Message-ID: <20230504222048.GA887151@bhelgaas> (raw)
In-Reply-To: <alpine.DEB.2.21.2304060116380.13659@angie.orcam.me.uk>
On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> Attempt to handle cases such as with a downstream port of the ASMedia
> ASM2824 PCIe switch where link training never completes and the link
> continues switching between speeds indefinitely with the data link layer
> never reaching the active state.
We're going to land this series this cycle, come hell or high water.
We talked about reusing pcie_retrain_link() earlier. IIRC that didn't
work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
because the erratum makes PCI_EXP_LNKSTA_LT flap.
What if we made pcie_retrain_link() reusable by making it:
bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)
so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?
Maybe do it two steps?
1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
it take link->pdev instead of link.
2) Add the bit parameter.
I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
logic about restricting to 2.5GT/s, retraining, removing the
restriction, retraining again is stuff I'd rather have in quirks.c so
it doesn't clutter pci.c.
I think it'd be good if the pci_device_add() path made clear that this
is a workaround for a problem, e.g.,
void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
{
...
if (pcie_link_failed(dev))
pcie_fix_link_train(dev);
where pcie_fix_link_train() could live in quirks.c (with a stub when
CONFIG_PCI_QUIRKS isn't enabled). It *might* even be worth adding it
and the stub first because that's a trivial patch and wouldn't clutter
the probe.c git history with all the grotty details about ASM2824 and
this topology.
> +int pcie_downstream_link_retrain(struct pci_dev *dev)
> +{
> + static const struct pci_device_id ids[] = {
> + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> + {}
> + };
> + u16 lnksta, lnkctl2;
> +
> + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> + !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> + return -1;
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> + PCI_EXP_LNKSTA_LBMS) {
You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
can't remember what the reason is. If you make a preparatory patch
like this, it would give a place for that background, e.g.,
+bool pcie_link_failed(struct pci_dev *dev)
+{
+ u16 lnksta;
+
+ if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
+ !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
+ return false;
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+ if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
+ PCI_EXP_LNKSTA_LBMS)
+ return true;
+
+ return false;
+}
If this is a generic thing and checking PCI_EXP_LNKSTA_LBMS makes
sense for everybody, it could go in pci.c; otherwise it could go in
quirks.c as well. I guess it's not *truly* generic anyway because it
only detects link training failures for devices that have LNKCTL2 and
link_active_reporting.
> + unsigned long timeout;
> + u16 lnkctl;
> +
> + pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> + /*
> + * Due to an erratum in some devices the Retrain Link bit
> + * needs to be cleared again manually to allow the link
> + * training to succeed.
> + */
> + lnkctl &= ~PCI_EXP_LNKCTL_RL;
> + if (dev->clear_retrain_link)
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> + lnkctl);
> +
> + timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> + do {
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> + &lnksta);
> + if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> + break;
> + usleep_range(10000, 20000);
> + } while (time_before(jiffies, timeout));
> +
> + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> + pci_info(dev, "retraining failed\n");
> + return -1;
> + }
> + }
> + if (IS_ENABLED(CONFIG_PCI_QUIRKS) && (lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> + (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> + pci_match_id(ids, dev)) {
> + u32 lnkcap;
> + u16 lnkctl;
> +
> + pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
This starts a retrain; should we wait for training to complete?
> + }
If we put most of this into a pcie_fix_link_train() (separated from
detecting the *need* to fix something), could it be made to look
sort of like this? (I suppose you'd want to return bool and rename
it that reads naturally, e.g., "pcie_link_forcibly_retrained()",
"pcie_link_retrained()", etc)
+void pcie_fix_link_train(struct pci_dev *dev)
+{
+ u16 lnkctl2;
+ u32 lnkcap;
+ bool linkup;
+
+ pci_info(dev, "attempting link retrain at 2.5GT/s\n");
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+
+ linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
+ if (!linkup) {
+ pci_info(dev, "retraining failed\n");
+ return;
+ }
+
+ if (LNKCAP supports only 2.5GT/s)
+ return;
+
+ if (!pci_match_id(ids, dev))
+ return;
Your comment said "if we know this is *safe*"; I can't remember if
pci_match_id() is there to avoid a known problem?
+
+ pci_info(dev, "attempting link retrain at max supported rate\n");
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+
+ linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
+ if (!linkup)
+ pci_info(dev, "retraining failed\n");
+}
> +
> + return 0;
> +}
> +
> +/* Same as above, but called for a downstream device. */
> +static int pcie_upstream_link_retrain(struct pci_dev *dev)
> +{
> + struct pci_dev *bridge;
> +
> + bridge = pci_upstream_bridge(dev);
> + if (bridge)
> + return pcie_downstream_link_retrain(bridge);
> + else
> + return -1;
> +}
> +
> static int pci_acs_enable;
>
> /**
> @@ -1148,8 +1274,8 @@ void pci_resume_bus(struct pci_bus *bus)
>
> static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> {
> + int retrain = 0;
> int delay = 1;
> - u32 id;
>
> /*
> * After reset, the device should not silently discard config
> @@ -1163,21 +1289,37 @@ static int pci_dev_wait(struct pci_dev *
> * Command register instead of Vendor ID so we don't have to
> * contend with the CRS SV value.
> */
> - pci_read_config_dword(dev, PCI_COMMAND, &id);
> - while (PCI_POSSIBLE_ERROR(id)) {
> + for (;;) {
> + u32 id;
> +
> + pci_read_config_dword(dev, PCI_COMMAND, &id);
> + if (!PCI_POSSIBLE_ERROR(id)) {
> + if (delay > PCI_RESET_WAIT)
> + pci_info(dev, "ready %dms after %s\n",
> + delay - 1, reset_type);
> + break;
> + }
> +
> if (delay > timeout) {
> pci_warn(dev, "not ready %dms after %s; giving up\n",
> delay - 1, reset_type);
> return -ENOTTY;
> }
>
> - if (delay > PCI_RESET_WAIT)
> + if (delay > PCI_RESET_WAIT) {
> + if (!retrain) {
> + retrain = 1;
> + if (pcie_upstream_link_retrain(dev) == 0) {
> + delay = 1;
> + continue;
> + }
> + }
> pci_info(dev, "not ready %dms after %s; waiting\n",
> delay - 1, reset_type);
> + }
Thanks for fixing this in the reset path, too. Can we move this part
to a separate patch? It's related to the rest of the patch, but it
looks so much different that I think it would be easier to understand
by itself.
I think I might try to fold the pcie_upstream_link_retrain() directly
in here because the "upstream link retrain" in the function name
doesn't really make sense in PCIe terms.
Bjorn
next prev parent reply other threads:[~2023-05-04 22:20 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-06 0:20 [PATCH v8 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures Maciej W. Rozycki
2023-04-06 0:21 ` [PATCH v8 1/7] PCI: pciehp: Rely on `link_active_reporting' Maciej W. Rozycki
2023-04-06 0:21 ` [PATCH v8 2/7] PCI: Export PCI link retrain timeout Maciej W. Rozycki
2023-05-04 22:21 ` Bjorn Helgaas
2023-04-06 0:21 ` [PATCH v8 3/7] PCI: Execute `quirk_enable_clear_retrain_link' earlier Maciej W. Rozycki
2023-04-06 0:21 ` [PATCH v8 4/7] PCI: Initialize `link_active_reporting' earlier Maciej W. Rozycki
2023-04-06 0:21 ` [PATCH v8 5/7] powerpc/eeh: Rely on `link_active_reporting' Maciej W. Rozycki
2023-04-06 0:21 ` [PATCH v8 6/7] net/mlx5: " Maciej W. Rozycki
2023-04-06 0:21 ` [PATCH v8 7/7] PCI: Work around PCIe link training failures Maciej W. Rozycki
2023-05-04 22:20 ` Bjorn Helgaas [this message]
2023-05-07 18:33 ` Maciej W. Rozycki
2023-05-14 20:54 ` Maciej W. Rozycki
2023-06-11 17:14 ` Maciej W. Rozycki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230504222048.GA887151@bhelgaas \
--to=helgaas@kernel.org \
--cc=alex.williamson@redhat.com \
--cc=bhelgaas@google.com \
--cc=christophe.leroy@csgroup.eu \
--cc=davem@davemloft.net \
--cc=david.abdurachmanov@gmail.com \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lukas@wunner.de \
--cc=macro@orcam.me.uk \
--cc=mahesh@linux.ibm.com \
--cc=mika.westerberg@linux.intel.com \
--cc=mpe@ellerman.id.au \
--cc=netdev@vger.kernel.org \
--cc=npiggin@gmail.com \
--cc=oohall@gmail.com \
--cc=pabeni@redhat.com \
--cc=pali@kernel.org \
--cc=saeedm@nvidia.com \
--cc=sr@denx.de \
--cc=wilson@tuliptree.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).