[PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
@ 2025-07-02  5:24 Matthew W Carlis
  2025-07-02  5:24 ` [PATCH v2 1/1] " Matthew W Carlis
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-02  5:24 UTC (permalink / raw)
  To: linux-pci; +Cc: bhelgaas, ashishk, macro, msaggi, sconnor, Matthew W Carlis

Second iteration fixes a typo with missing close parenthesis.
Restricting the pcie_failed_link_retrain quirk to ASM2824.

Matthew W Carlis (1):
  PCI: pcie_failed_link_retrain() return if dev is not ASM2824

 drivers/pci/quirks.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-02  5:24 [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824 Matthew W Carlis
@ 2025-07-02  5:24 ` Matthew W Carlis
  2025-07-03 12:11   ` Ilpo Järvinen
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-02  5:24 UTC (permalink / raw)
  To: linux-pci; +Cc: bhelgaas, ashishk, macro, msaggi, sconnor, Matthew W Carlis

  The pcie_failed_link_retrain() was added due to a behavior observed with
a very specific set of circumstances which are in a comment above the
function. The "quirk" is supposed to force the link down to Gen1 in the
case where LTSSM is stuck in a loop or failing to train etc. The problem
is that this "quirk" is applied to any bridge & it can often write the
Gen1 TLS (Target Link Speed) when it should not. Leaving the port in
a state that will result in a device linking up at Gen1 when it should not.
  Incorrect action by pcie_failed_link_retrain() has been observed with a
variety of different NVMe drives using U.2 connectors & in multiple different
hardware designs. Directly attached to the root port, downstream of a
PCIe switch (Microchip/Broadcom) with different generations of Intel CPU.
All of these systems were configured without power controller capability.
They were also all in compliance with the Async Hot-Plug Reference model in
PCI Express® Base Specification Revision 6.0 Appendix I. for OS controlled
DPC Hot-Plug.
  The issue appears to be more likely to hit to be applied when using
OOB PD (out-of band presence detect), but has also been observed without
OOB PD support ('DLL State Changed' or 'In-Band PD').
  Powering off or power cycling the slot via an out-of-band power control
mechanism with OOB PD is extremely likely to hit since the kernel would
see that slot presence is true. Physical Hot-insertion is also extremly
likely to hit this issue with OOB PD with U.2 drives due to timing
between presence assertion and the actual power-on/link-up of the NVMe
drive itself. When the device eventually does power-up the TLS would
have been left forced to Gen1. This is similarly true to the case of
power cycling or powering off the slot.
  Exact circumstances for when this issue has been hit in a system without
OOB PD due hasn't been fully understood to due having less reproductions
as well as having reverted this patch for this configurations.

Signed-off-by: Matthew W Carlis <mattc@purestorage.com>
---
 drivers/pci/quirks.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index d7f4ee634263..39bb0c025119 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -100,6 +100,8 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
 	};
 	u16 lnksta, lnkctl2;
 	int ret = -ENOTTY;
+	if (!pci_match_id(ids, dev))
+		return ret;

 	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
 	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
@@ -124,8 +126,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
 	}

 	if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
-	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
-	    pci_match_id(ids, dev)) {
+	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT) {
 		u32 lnkcap;

 		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
-- 
2.46.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-02  5:24 ` [PATCH v2 1/1] " Matthew W Carlis
@ 2025-07-03 12:11   ` Ilpo Järvinen
  2025-07-03 23:53     ` [PATCH v2 0/1] " Matthew W Carlis
  0 siblings, 1 reply; 14+ messages in thread
From: Ilpo Järvinen @ 2025-07-03 12:11 UTC (permalink / raw)
  To: Matthew W Carlis; +Cc: linux-pci, bhelgaas, ashishk, macro, msaggi, sconnor

[-- Attachment #1: Type: text/plain, Size: 3894 bytes --]

On Tue, 1 Jul 2025, Matthew W Carlis wrote:

Hi Matthew,

I have a few simple style related comments to the patch itself and 
questions about this scenario below.

The wording in the shortlog (in subject) sounds a bit clumsy to me, 
perhaps change it to something like this:

PCI: Don't use Target Speed quirk if device is not ASM2824

>   The pcie_failed_link_retrain() was added due to a behavior observed with
> a very specific set of circumstances which are in a comment above the
> function. The "quirk" is supposed to force the link down to Gen1 in the
> case where LTSSM is stuck in a loop or failing to train etc. The problem
> is that this "quirk" is applied to any bridge & it can often write the
> Gen1 TLS (Target Link Speed) when it should not. Leaving the port in
> a state that will result in a device linking up at Gen1 when it should not.
>   Incorrect action by pcie_failed_link_retrain() has been observed with a
> variety of different NVMe drives using U.2 connectors & in multiple different
> hardware designs. Directly attached to the root port, downstream of a
> PCIe switch (Microchip/Broadcom) with different generations of Intel CPU.
> All of these systems were configured without power controller capability.
> They were also all in compliance with the Async Hot-Plug Reference model in
> PCI Express® Base Specification Revision 6.0 Appendix I. for OS controlled
> DPC Hot-Plug.
>   The issue appears to be more likely to hit to be applied when using
> OOB PD (out-of band presence detect), but has also been observed without
> OOB PD support ('DLL State Changed' or 'In-Band PD').
>   Powering off or power cycling the slot via an out-of-band power control
> mechanism with OOB PD is extremely likely to hit since the kernel would
> see that slot presence is true. Physical Hot-insertion is also extremly

extremely

> likely to hit this issue with OOB PD with U.2 drives due to timing
> between presence assertion and the actual power-on/link-up of the NVMe
> drive itself. When the device eventually does power-up the TLS would
> have been left forced to Gen1. This is similarly true to the case of
> power cycling or powering off the slot.
>   Exact circumstances for when this issue has been hit in a system without
> OOB PD due hasn't been fully understood to due having less reproductions
> as well as having reverted this patch for this configurations.

Paragraphs should be separated with empty lines and started without spaces 
as indent.

This description did not answer to the key question, why does 
pcie_lbms_seen() returns true in these case which is required for 2.5GT/s 
to be set for the bridge? Is it a stale indication? Would LBMS get 
cleared but quirk runs too soon to see that?

Is this mainly related to some artificial test that rapidly fires event 
after another (which is known to confuse the quirk)? ...I mean, you say 
"extremely likely".

I suppose when the problem occurs and the bridge remains at 2.5GT/s, is it 
possible to restore the higher speed using the pcie_cooling device 
associated with the bridge / bwctrl? You can find the correct cooling 
device with this:

grep -H . /sys/class/thermal/cooling_device*/type | grep PCIe_

...and then write to cur_state.

> Signed-off-by: Matthew W Carlis <mattc@purestorage.com>
> ---
>  drivers/pci/quirks.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index d7f4ee634263..39bb0c025119 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -100,6 +100,8 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
>  	};
>  	u16 lnksta, lnkctl2;
>  	int ret = -ENOTTY;

As per the coding style, please add an empty line after the local 
variables.

> +	if (!pci_match_id(ids, dev))
> +		return ret;

-- 
 i.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-03 12:11   ` Ilpo Järvinen
@ 2025-07-03 23:53     ` Matthew W Carlis
  2025-07-04 10:20       ` Ilpo Järvinen
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-03 23:53 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ashishk, bhelgaas, linux-pci, macro, mattc, msaggi, sconnor

On Thu, 3 Jul 2025, Ilpo Järvinen wrote:
> Is this mainly related to some artificial test that rapidly fires event 
> after another (which is known to confuse the quirk)? ...I mean, you say 
> "extremely likely".

I wouldn't describe the test as "rapidly fires" of events because we have given
conservative delays between injections (waiting for DLLA & being able to perform
IO to the nvme block device before potentially injecting again). In any case
the testing results are clearly worse when moving from a kernel that didn't
have the quirk to a kernel that does which is a regression in my mind.

> I suppose when the problem occurs and the bridge remains at 2.5GT/s, is it 
> possible to restore the higher speed using the pcie_cooling device 
> associated with the bridge / bwctrl? You can find the correct cooling 
> device with this:

Yes the problem is when a device is forced to 2.5GT/s and it should not have
been. I did not test with the patches for CONFIG_PCIE_THERMAL because our drives
would not need thermal management by the kernel, but if I use "setpci" to
restore TLS & then write the link retrain bit the link would arrive at the
maximum speed (Gen3/Gen4/Gen5 depending).

I have other vendor drives as well, but we design and build our own drives
with our own firmware & therefore are able to determine from firmware logging
in the drive when the link was most likely guided to 2.5GT/s by TLS. We are
also able to see the 2.5GT/s value in the TLS register when it happens. I have
less visibility into drives from other vendors in terms of ltssm transitions
without hooking up an analyzer.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-03 23:53     ` [PATCH v2 0/1] " Matthew W Carlis
@ 2025-07-04 10:20       ` Ilpo Järvinen
  2025-07-08 22:49         ` Matthew W Carlis
  0 siblings, 1 reply; 14+ messages in thread
From: Ilpo Järvinen @ 2025-07-04 10:20 UTC (permalink / raw)
  To: Matthew W Carlis; +Cc: ashishk, bhelgaas, linux-pci, macro, msaggi, sconnor

[-- Attachment #1: Type: text/plain, Size: 2894 bytes --]

On Thu, 3 Jul 2025, Matthew W Carlis wrote:

> On Thu, 3 Jul 2025, Ilpo Järvinen wrote:
> > Is this mainly related to some artificial test that rapidly fires event 
> > after another (which is known to confuse the quirk)? ...I mean, you say 
> > "extremely likely".
> 
> I wouldn't describe the test as "rapidly fires" of events because we have given
> conservative delays between injections (waiting for DLLA & being able to perform
> IO to the nvme block device before potentially injecting again).

Okay, I asked this because I saw one other test which did hotplug add & 
remove in millisecond timescales which was way too fast for hotplug driver 
to keep up (and thus it couldn't reset LBMS often enough).

The other question still stands though, why is LBMS is not reset? Perhaps 
DPC should clear LBMS in some places (that is, call pcie_reset_lbms()). 
Have you consider that?

(It sound to me you're having this occur in multiple scenarios and I've 
some trouble on figuring those out from your long descriptions what those 
exactly are so it's bit challenging for me to suggest where it should be 
done but I the surprise down certainly seems like case where LBMS 
information must have become stale so it should be reset which would 
prevent quirk from setting 2.5GT/s).

> In any case
> the testing results are clearly worse when moving from a kernel that didn't
> have the quirk to a kernel that does which is a regression in my mind.
> 
> > I suppose when the problem occurs and the bridge remains at 2.5GT/s, is it 
> > possible to restore the higher speed using the pcie_cooling device 
> > associated with the bridge / bwctrl? You can find the correct cooling 
> > device with this:
> 
> Yes the problem is when a device is forced to 2.5GT/s and it should not have
> been. I did not test with the patches for CONFIG_PCIE_THERMAL because our drives
> would not need thermal management by the kernel,

Fine, but all it technically does is exposes interface to bwctrl set speed 
API, the pcie_thermal driver itself agnostic to whether userspace uses 
that for thermal management or some other purpose. There's no kernel side 
thermal management in that driver. It was just more natural to expose it 
there than inside PCI subsystem.

> but if I use "setpci" to
> restore TLS & then write the link retrain bit the link would arrive at the
> maximum speed (Gen3/Gen4/Gen5 depending).
>
> I have other vendor drives as well, but we design and build our own drives
> with our own firmware & therefore are able to determine from firmware logging
> in the drive when the link was most likely guided to 2.5GT/s by TLS. We are
> also able to see the 2.5GT/s value in the TLS register when it happens. I have
> less visibility into drives from other vendors in terms of ltssm transitions
> without hooking up an analyzer.
> 

-- 
 i.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-04 10:20       ` Ilpo Järvinen
@ 2025-07-08 22:49         ` Matthew W Carlis
  2025-07-09  9:45           ` Ilpo Järvinen
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-08 22:49 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ashishk, bhelgaas, linux-pci, macro, mattc, msaggi, sconnor

On Fri, 4 Jul 2025, Ilpo Järvinen wrote:
> The other question still stands though, why is LBMS is not reset? Perhaps 
> DPC should clear LBMS in some places (that is, call pcie_reset_lbms()). 
> Have you consider that?

Initially we started to observe this when physically removing and
reinserting devices in a kernel version with the quirk, but without the bandwidth
controller driver. I think there is a problem with any place where the link
would be expected to go down (dpc, hpc, etc) & then carrying forward LBMS
into the next time the link comes up. Should it not matter how long ago LBMS
was asserted before we invoke a TLS modification? It also looks like card
presence is enough for the kernel to believe the link should train & enter
the quirk function without ever having seen LNKSTA_DLLLA or LNKSTA_LT. I
wonder if it shouldn't have to see some kind of actual link activity as a
prereq to entering the quirk.

> (It sound to me you're having this occur in multiple scenarios and I've 
> some trouble on figuring those out from your long descriptions what those 
> exactly are so it's bit challenging for me to suggest where it should be 
> done but I the surprise down certainly seems like case where LBMS 
> information must have become stale so it should be reset which would 
> prevent quirk from setting 2.5GT/s)

Something I found recently that was interesting - when I power off
a slot (triggering DPC via SDES) the LBMS becomes set on Intel Root Ports,
but in another server with a PCIe switch LBMS does not become set on the
switch DSP if I perform the same action. I don't have any explanation for
this difference other than "vendor specific" behavior.

One thing that honestly doesn't make any sense to me is the ID list in the
quirk. If the link comes up after forcing to Gen1 then it would only restore
TLS if the device is the ASMedia switch, but also ignoring what device is
detected downstream. If we allow ASMedia to restore the speed for any downstream
device when we only saw the initial issue with the Pericom switch then why
do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the
list which did not have any issues reported.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-08 22:49         ` Matthew W Carlis
@ 2025-07-09  9:45           ` Ilpo Järvinen
  2025-07-09 18:52             ` Matthew W Carlis
  2025-07-16 13:01             ` Maciej W. Rozycki
  0 siblings, 2 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2025-07-09  9:45 UTC (permalink / raw)
  To: Matthew W Carlis; +Cc: ashishk, bhelgaas, linux-pci, macro, msaggi, sconnor

[-- Attachment #1: Type: text/plain, Size: 3929 bytes --]

On Tue, 8 Jul 2025, Matthew W Carlis wrote:

> On Fri, 4 Jul 2025, Ilpo Järvinen wrote:
> > The other question still stands though, why is LBMS is not reset? Perhaps 
> > DPC should clear LBMS in some places (that is, call pcie_reset_lbms()). 
> > Have you consider that?
> 
> Initially we started to observe this when physically removing and
> reinserting devices in a kernel version with the quirk, but without the bandwidth
> controller driver. I think there is a problem with any place where the link
> would be expected to go down (dpc, hpc, etc) & then carrying forward LBMS
> into the next time the link comes up.

Are you saying there's still a problem in hpc? Since the introduction of 
bwctrl, remove_board() in pciehp has had pcie_reset_lbms() (or it's 
equivalent).

As I already mentioned, for DPC I agree, it likely should reset LBMS 
somewhere.

We also clear LBMS after retraining to not retain that LBMS beyond the 
completion of the retraining.

What other things are included into that "etc"?

> Should it not matter how long ago LBMS
> was asserted before we invoke a TLS modification?

To some extent, yes, which is why we call pcie_reset_lbms() in a few 
places.

> It also looks like card
> presence is enough for the kernel to believe the link should train & enter
> the quirk function without ever having seen LNKSTA_DLLLA or LNKSTA_LT.

Without LBMS that won't do anything in the quirk (except try raise the 
Link Speed if it's the particular device on the whitelist).

> I wonder if it shouldn't have to see some kind of actual link activity 
> as a prereq to entering the quirk.

How would you observe that "link activity"? Doesn't LBMS itself imply 
"link activity" occurred?

Any good suggestions how to realize that check more precisely to 
differentiate if there was some link activity or not?

> > (It sound to me you're having this occur in multiple scenarios and I've 
> > some trouble on figuring those out from your long descriptions what those 
> > exactly are so it's bit challenging for me to suggest where it should be 
> > done but I the surprise down certainly seems like case where LBMS 
> > information must have become stale so it should be reset which would 
> > prevent quirk from setting 2.5GT/s)
> 
> Something I found recently that was interesting - when I power off
> a slot (triggering DPC via SDES) the LBMS becomes set on Intel Root Ports,
> but in another server with a PCIe switch LBMS does not become set on the
> switch DSP if I perform the same action. I don't have any explanation for
> this difference other than "vendor specific" behavior.

If you'd try this on different generations of Intel RP, you'd likely see 
variations there too, that's my experience when testing bwctrl.

E.g., on some platforms, I see LBMS asserted twice from single retraining 
(after a TLS change). One when still having LT=1 and the other after LT=0.

(I don't have explanation to that behavior.)

> One thing that honestly doesn't make any sense to me is the ID list in the
> quirk. If the link comes up after forcing to Gen1 then it would only restore
> TLS if the device is the ASMedia switch, but also ignoring what device is
> detected downstream. If we allow ASMedia to restore the speed for any downstream
> device when we only saw the initial issue with the Pericom switch then why
> do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the
> list which did not have any issues reported.

I think it's because the restore has been tested on that device 
(whitelist).

Your reasoning is based on assumption that TLS quirk setting Link Speed 
to 2.5GT/s is part of "normal" operation. My view is that those 
triggerings are caused by not clearing stale LBMS in the right places. If 
LBMS is not wrongly kept, the quirk is no-op on all but that ID listed 
device.

-- 
 i.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-09  9:45           ` Ilpo Järvinen
@ 2025-07-09 18:52             ` Matthew W Carlis
  2025-07-09 20:27               ` Matthew W Carlis
  2025-07-11 13:46               ` Ilpo Järvinen
  2025-07-16 13:01             ` Maciej W. Rozycki
  1 sibling, 2 replies; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-09 18:52 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ashishk, bhelgaas, linux-pci, macro, mattc, msaggi, sconnor

On Wed, 9 Jul 2025, Ilpo Järvinen wrote:
> Are you saying there's still a problem in hpc? Since the introduction of 
> bwctrl, remove_board() in pciehp has had pcie_reset_lbms() (or it's 
> equivalent).
I think my concern with hpc or the current mechanism in general is that the
condition is basically binary. Across a large fleet I expect to see momentary
issues. For example a device might start to link up, have an issue & then
try to link up again and from there be working correctly. However if that
were to trigger an LBMS it might result in the quirk forcing the link to Gen1.

For example if the quirk first guided the link to Gen1 & then if the device
linked up at Gen1 it tried to guide it to Gen2 & then if it linked up at Gen2
it continued towards the maximum speed falling back down when it found the
device not able to achieve a certain higher speed that would be more ideal.
Or perhaps starting at the second highest speed & working its way down.
Its quite a large fall in performance for a device to go from Gen4/5 to Gen1
whereas the ASMedia/Pericom combination was only capable of Gen2 as a pair.
If the SI is marginal for Gen4/5 I would tend to think the device has a fairly
high chance of being able to run at the next lower speed for example.

Actually I also wonder in the case of the ASMedia & Pericom combination
would we just see a LBMS interrupt every time the device loop between
speeds? Maybe the quirk should have been invoked by bwctrl.c when a certain
rate of LBMS assertions is detected instead? Is it better to give a device
a few chances or to catch it right away on the first issue? (some value
judgements here)

> As I already mentioned, for DPC I agree, it likely should reset LBMS 
> somewhere.
...
> If you'd try this on different generations of Intel RP, you'd likely see 
> variations there too, that's my experience when testing bwctrl.

Yes agree about DPC especially given that there is likely vendor/device specific
variations in assertions of the bit. There is another patch that came into the
DPC driver which suppresses surprise down error reporting which I would like to
challenge/remove. My feeling is that the DPC driver should clear LBMS in all cases
before clearing DPC status.

>> Should it not matter how long ago LBMS
>> was asserted before we invoke a TLS modification?
>
> To some extent, yes, which is why we call pcie_reset_lbms() in a few 
> places.

Maybe there should even be a config or sysfs file to disable the quirk because
it kind of takes control away from users in some ways. i.e - doesn't obviously
interact well with callers of setpci etc.

>> I wonder if it shouldn't have to see some kind of actual link activity 
>> as a prereq to entering the quirk.
>
> How would you observe that "link activity"? Doesn't LBMS itself imply 
> "link activity" occurred?

I was thinking literally not entering the quirk function unless the kernel
had witnessed LNKSTA_DLLLA or LNKSTA_LT in the last second.

Does this preclude us from declaring a device as "broken" as done by the quirk
without having seen DLLA within 1s after DLLSC Event?
* PCI Express Base Revision - 6.7.3.3 Data Link Layer State Changed Events
"Software must allow 1 second after the Data Link Layer Link Active bit reads 1b
before it is permitted to determine that a hot plugged device which fails to return
a Successful Completion for a Valid Configuration Request is a broken device."

> > One thing that honestly doesn't make any sense to me is the ID list in the
> > quirk. If the link comes up after forcing to Gen1 then it would only restore
> > TLS if the device is the ASMedia switch, but also ignoring what device is
> > detected downstream. If we allow ASMedia to restore the speed for any downstream
> > device when we only saw the initial issue with the Pericom switch then why
> > do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the
> > list which did not have any issues reported.
> 
> I think it's because the restore has been tested on that device 
> (whitelist).
> 
> Your reasoning is based on assumption that TLS quirk setting Link Speed 
> to 2.5GT/s is part of "normal" operation. My view is that those 
> triggerings are caused by not clearing stale LBMS in the right places. If 
> LBMS is not wrongly kept, the quirk is no-op on all but that ID listed 
> device.

I'm making a slightly different assumption which is "something is working
until proven otherwise". We only know that the restore works on the ASMedia
when the downstream device is the Pericom switch. In fact we only know
it works for very specific layout & configuration of these two devices.
It seems wrong in my mind to be more restrictive on devices that don't have
a reported issue from, but then be less restrictive on the devices that had an
out of spec interaction in the first place. Until reported we don't know
how many devices might see LBMS get set during the course of linking up, but
then still arrive at the maximum speed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-09 18:52             ` Matthew W Carlis
@ 2025-07-09 20:27               ` Matthew W Carlis
  2025-07-11 13:46               ` Ilpo Järvinen
  1 sibling, 0 replies; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-09 20:27 UTC (permalink / raw)
  To: mattc; +Cc: ashishk, bhelgaas, ilpo.jarvinen, linux-pci, macro, msaggi,
	sconnor

Changing the subject a little here.. Minimally we should do something
along the lines of this patch in kernel releases that do not
have the bwctrl.c, but also have the quirk. I'm happy to continue
discussing what to do in the presence of bwctrl.c, but the behavior
of the quirk with hot-plug is utterly broken without bwctrl.c
based on my testing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-09 18:52             ` Matthew W Carlis
  2025-07-09 20:27               ` Matthew W Carlis
@ 2025-07-11 13:46               ` Ilpo Järvinen
  1 sibling, 0 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2025-07-11 13:46 UTC (permalink / raw)
  To: Matthew W Carlis; +Cc: ashishk, bhelgaas, linux-pci, macro, msaggi, sconnor

[-- Attachment #1: Type: text/plain, Size: 14392 bytes --]

On Wed, 9 Jul 2025, Matthew W Carlis wrote:

> On Wed, 9 Jul 2025, Ilpo Järvinen wrote:
> > Are you saying there's still a problem in hpc? Since the introduction of 
> > bwctrl, remove_board() in pciehp has had pcie_reset_lbms() (or it's 
> > equivalent).
>
> I think my concern with hpc or the current mechanism in general is that the
> condition is basically binary. Across a large fleet I expect to see momentary
> issues. For example a device might start to link up, have an issue & then
> try to link up again and from there be working correctly. However if that
> were to trigger an LBMS it might result in the quirk forcing the link to Gen1.
> 
> For example if the quirk first guided the link to Gen1 & then if the device
> linked up at Gen1 it tried to guide it to Gen2 & then if it linked up at Gen2
> it continued towards the maximum speed falling back down when it found the
> device not able to achieve a certain higher speed that would be more ideal.
> Or perhaps starting at the second highest speed & working its way down.
> Its quite a large fall in performance for a device to go from Gen4/5 to Gen1
> whereas the ASMedia/Pericom combination was only capable of Gen2 as a pair.
> If the SI is marginal for Gen4/5 I would tend to think the device has a fairly
> high chance of being able to run at the next lower speed for example.

This is possible but it also come at a non-trivial latency cost, Link 
Retraining is not very cheap.

In here, you seem to suggesting the TLS quirk might be useful for other 
devices too besides the one on the ID list. Is that the case? (I'm asking 
this because it contradicts with the patch you're submitting.)

I don't know if other speeds are generally useful, intuition tells they 
might be. However, I've no way to gather numbers as I don't have to luxury 
of large fleet of machines with PCIe devices to observe/measure. Perhaps 
you have some insight to this beyond just hypothetizing?

> Actually I also wonder in the case of the ASMedia & Pericom combination
> would we just see a LBMS interrupt every time the device loop between
> speeds? Maybe the quirk should have been invoked by bwctrl.c when a certain
> rate of LBMS assertions is detected instead? Is it better to give a device
> a few chances or to catch it right away on the first issue? (some value
> judgements here)

I investigated this as it came up while bwctrl was under review and found 
various issues and challenges related to the quirk. The main problem 
is that bwctrl being a portdrv service means it probes quite late so it's 
not available very early and quirk runs mainly during that time.

It might be possible to delay bringing up of a failty device to 
workaround that, however, it's not the end of challenges.

One would need to build a state machine to make such decisions as we don't 
want to keep repeating it if the link is just broken. I lacked a way to 
test this in a meaningful way so I just gave up and left it as future work.

But yes, it might be workable solution nobody has just written yet. If you 
want to implement this, I'm certainly not against it. (I might even 
consider writing one myself but that certainly isn't going to be a high 
priority item for me and the current level details are not concrete enough 
to be realized on the code level.)

> > As I already mentioned, for DPC I agree, it likely should reset LBMS 
> > somewhere.
> ...
> > If you'd try this on different generations of Intel RP, you'd likely see 
> > variations there too, that's my experience when testing bwctrl.
> 
> Yes agree about DPC especially given that there is likely vendor/device specific
> variations in assertions of the bit. There is another patch that came into the
> DPC driver which suppresses surprise down error reporting which I would like to
> challenge/remove. My feeling is that the DPC driver should clear LBMS in all cases
> before clearing DPC status.

I suggest you make a patch to that effect.

> >> Should it not matter how long ago LBMS
> >> was asserted before we invoke a TLS modification?
> >
> > To some extent, yes, which is why we call pcie_reset_lbms() in a few 
> > places.
> 
> Maybe there should even be a config or sysfs file to disable the quirk because
> it kind of takes control away from users in some ways. i.e - doesn't obviously
> interact well with callers of setpci etc.

There's PCI_QUIRKS but that's probably not fine-grained enough to be 
useful in practice at it takes away all quirks.

> >> I wonder if it shouldn't have to see some kind of actual link activity 
> >> as a prereq to entering the quirk.
> >
> > How would you observe that "link activity"? Doesn't LBMS itself imply 
> > "link activity" occurred?
> 
> I was thinking literally not entering the quirk function unless the kernel
> had witnessed LNKSTA_DLLLA or LNKSTA_LT in the last second.

How can we track that condition? There's nothing that tracks DLLLA nor LT, 
and we can't get interrupt out of them either (AFAIK). So while it is 
perhaps nice on conceptual level, it would require polling those bits 
which doesn't look reasonable from implementation point-of-view.

Also, I'm not convinced it would help your cases where you have 
short-term, intermitted failures during bring up.

> Does this preclude us from declaring a device as "broken" as done by the quirk
> without having seen DLLA within 1s after DLLSC Event?
> * PCI Express Base Revision - 6.7.3.3 Data Link Layer State Changed Events
> "Software must allow 1 second after the Data Link Layer Link Active bit reads 1b
> before it is permitted to determine that a hot plugged device which fails to return
> a Successful Completion for a Valid Configuration Request is a broken device."

If you think there is problem related to spec compliance here (there 
well might be), patches are definitely welcome. I'm not sure from where 
the quirk is called in this scenario and where/how the quirk logic 
invocation can be delayed (unfortunately won't have time to look at it any 
time soon either).
 
> > > One thing that honestly doesn't make any sense to me is the ID list in the
> > > quirk. If the link comes up after forcing to Gen1 then it would only restore
> > > TLS if the device is the ASMedia switch, but also ignoring what device is
> > > detected downstream. If we allow ASMedia to restore the speed for any downstream
> > > device when we only saw the initial issue with the Pericom switch then why
> > > do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the
> > > list which did not have any issues reported.
> > 
> > I think it's because the restore has been tested on that device 
> > (whitelist).
> > 
> > Your reasoning is based on assumption that TLS quirk setting Link Speed 
> > to 2.5GT/s is part of "normal" operation. My view is that those 
> > triggerings are caused by not clearing stale LBMS in the right places. If 
> > LBMS is not wrongly kept, the quirk is no-op on all but that ID listed 
> > device.
> 
> I'm making a slightly different assumption which is "something is working
> until proven otherwise". We only know that the restore works on the ASMedia
> when the downstream device is the Pericom switch. In fact we only know
> it works for very specific layout & configuration of these two devices.
> It seems wrong in my mind to be more restrictive on devices that don't have
> a reported issue from, but then be less restrictive on the devices that had an
> out of spec interaction in the first place. Until reported we don't know
> how many devices might see LBMS get set during the course of linking up, but
> then still arrive at the maximum speed.

I wonder, if you could give my bwctrl tracing patch (below) a spin in some 
case where such a problem shows up as it could show what DLLLA/LT are 
while LNKSTA register is read from bwctrl's irq handler. I'm planning to 
submit it eventually but placement of the tracing code has not been 
agreed yet with the other person submitting hotplug tracing.

--
From e5d7bc850028a82823c2cbb822c3ba5edaa623c1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com>
Date: Mon, 9 Jun 2025 20:29:29 +0300
Subject: [PATCH 1/1] PCI/bwctrl: Add trace event to BW notifications
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Frequent changes in the Link Speed or Width may indicate a PCIe
link-layer problem. PCIe BW controller listen BW notifications, i.e.,
whenever LBMS (Link Bandwidth Management Status) and/or LABS (Link
Autonomous Bandwidth Status) is asserted to indicate the Link Speed
and/or Width was changed (PCIe spec. r6.2, sec. 7.5.3.7 & 7.5.3.8).

To help troubleshooting link related problems, add trace event for LBMS
and LABS assertions.

I was (privately) asked to expose LBMS count for this purpose while
bwctrl was under review. Lukas Wunner suggested, however, to use
traceevent instead to expose finer-grained details of the LBMS
assertions (namely, the timing of the assertions and link status
details).

Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
 drivers/pci/Makefile       |  3 ++-
 drivers/pci/pci-trace.c    |  9 +++++++
 drivers/pci/pci.h          |  1 -
 drivers/pci/pcie/bwctrl.c  | 13 ++++++++++
 include/linux/pci.h        |  1 +
 include/trace/events/pci.h | 49 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 74 insertions(+), 2 deletions(-)
 create mode 100644 drivers/pci/pci-trace.c
 create mode 100644 include/trace/events/pci.h

diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 67647f1880fb..49bd51b995cd 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -5,7 +5,8 @@
 obj-$(CONFIG_PCI)		+= access.o bus.o probe.o host-bridge.o \
 				   remove.o pci.o pci-driver.o search.o \
 				   rom.o setup-res.o irq.o vpd.o \
-				   setup-bus.o vc.o mmap.o devres.o
+				   setup-bus.o vc.o mmap.o devres.o \
+				   pci-trace.o
 
 obj-$(CONFIG_PCI)		+= msi/
 obj-$(CONFIG_PCI)		+= pcie/
diff --git a/drivers/pci/pci-trace.c b/drivers/pci/pci-trace.c
new file mode 100644
index 000000000000..99af6466447f
--- /dev/null
+++ b/drivers/pci/pci-trace.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI trace functions
+ *
+ * Copyright (C) 2025 Intel Corporation
+ */
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/pci.h>
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12215ee72afb..8f1fffcda364 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -452,7 +452,6 @@ static inline int pcie_dev_speed_mbps(enum pci_bus_speed speed)
 }
 
 u8 pcie_get_supported_speeds(struct pci_dev *dev);
-const char *pci_speed_string(enum pci_bus_speed speed);
 void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
 void pcie_report_downtraining(struct pci_dev *dev);
 
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index 36f939f23d34..7fb4e00f1e7a 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -20,6 +20,7 @@
 #define dev_fmt(fmt) "bwctrl: " fmt
 
 #include <linux/atomic.h>
+#include <linux/bitfield.h>
 #include <linux/bitops.h>
 #include <linux/bits.h>
 #include <linux/cleanup.h>
@@ -32,6 +33,8 @@
 #include <linux/slab.h>
 #include <linux/types.h>
 
+#include <trace/events/pci.h>
+
 #include "../pci.h"
 #include "portdrv.h"
 
@@ -208,6 +211,11 @@ static void pcie_bwnotif_disable(struct pci_dev *port)
 				   PCI_EXP_LNKCTL_LBMIE | PCI_EXP_LNKCTL_LABIE);
 }
 
+#define PCI_EXP_LNKSTA_LINK_STATUS_MASK (PCI_EXP_LNKSTA_LBMS | \
+					 PCI_EXP_LNKSTA_LABS | \
+					 PCI_EXP_LNKSTA_LT | \
+					 PCI_EXP_LNKSTA_DLLLA)
+
 static irqreturn_t pcie_bwnotif_irq(int irq, void *context)
 {
 	struct pcie_device *srv = context;
@@ -236,6 +244,11 @@ static irqreturn_t pcie_bwnotif_irq(int irq, void *context)
 	 */
 	pcie_update_link_speed(port->subordinate);
 
+	trace_pci_link_event(port,
+			     link_status & PCI_EXP_LNKSTA_LINK_STATUS_MASK,
+			     pcie_link_speed[link_status & PCI_EXP_LNKSTA_CLS],
+			     FIELD_GET(PCI_EXP_LNKSTA_NLW, link_status));
+
 	return IRQ_HANDLED;
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f392..8346121c035d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -305,6 +305,7 @@ enum pci_bus_speed {
 	PCI_SPEED_UNKNOWN		= 0xff,
 };
 
+const char *pci_speed_string(enum pci_bus_speed speed);
 enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
 enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
 
diff --git a/include/trace/events/pci.h b/include/trace/events/pci.h
new file mode 100644
index 000000000000..c7187022cba5
--- /dev/null
+++ b/include/trace/events/pci.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2025 Intel Corporation
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pci
+
+#if !defined(_TRACE_PCI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PCI_H
+
+#include <linux/pci.h>
+#include <linux/tracepoint.h>
+
+#define LNKSTA_FLAGS					\
+	{ PCI_EXP_LNKSTA_LT,	"LT"},			\
+	{ PCI_EXP_LNKSTA_DLLLA,	"DLLLA"},		\
+	{ PCI_EXP_LNKSTA_LBMS,	"LBMS"},		\
+	{ PCI_EXP_LNKSTA_LABS,	"LABS"}
+
+TRACE_EVENT(pci_link_event,
+	TP_PROTO(struct pci_dev *dev, u16 link_status,
+		 enum pci_bus_speed link_speed, u8 link_width),
+	TP_ARGS(dev, link_status, link_speed, link_width),
+
+	TP_STRUCT__entry(
+		__string(	name,			pci_name(dev))
+		__field(	u16,			link_status)
+		__field(	enum pci_bus_speed,	link_speed)
+		__field(	u8,			link_width)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->link_status	= link_status;
+		__entry->link_speed	= link_speed;
+		__entry->link_width	= link_width;
+	),
+
+	TP_printk("%s %s x%u st=%s",
+		  __get_str(name), pci_speed_string(__entry->link_speed),
+		  __entry->link_width,
+		  __print_flags((unsigned long)__entry->link_status, "|",
+				LNKSTA_FLAGS))
+);
+
+#endif /* _TRACE_PCI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-09  9:45           ` Ilpo Järvinen
  2025-07-09 18:52             ` Matthew W Carlis
@ 2025-07-16 13:01             ` Maciej W. Rozycki
  2025-07-23 19:13               ` Matthew W Carlis
  1 sibling, 1 reply; 14+ messages in thread
From: Maciej W. Rozycki @ 2025-07-16 13:01 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Matthew W Carlis, ashishk, Bjorn Helgaas, linux-pci, msaggi,
	sconnor

On Wed, 9 Jul 2025, Ilpo Järvinen wrote:

> > I wonder if it shouldn't have to see some kind of actual link activity 
> > as a prereq to entering the quirk.
> 
> How would you observe that "link activity"? Doesn't LBMS itself imply 
> "link activity" occurred?

 It does, although in this case it shouldn't have been set in the first 
place, because after reset the link never comes up (i.e. goes into the 
Link Active state) and only keeps flipping between training and not 
training, as indicated by the LT bit.  FAOD with the affected link the 
LBMS bit doesn't ever retrigger once cleared while the link is in its 
broken state.

 Once the speed has been clamped and link retrained it goes up right away
(i.e. into the Link Active state) and remains steady up, also once the 
speed has been unclamped.

 I made a test once and left the system up for half a year or so.  The 
LBMS bit was set once, a couple of days after system reset.  I cleared it 
by hand and it never retriggered for the rest of the experiment, so this 
single occasion must have been a glitch and not a link quality issue.

 During that half a year the system and the link in question were both 
used heavily in remote GNU toolchain verification over a network interface 
placed downstream the problematic link.  Traffic included NFS and SSH.  
No issues ever triggered, so I must conclude the link training issue is 
specific to speed negotiation, likely at the protocol level, rather than 
at the physical layer.

 Last year I tried to make an alternative setup using a PCIe switch option 
card using the same ASMedia device.  The card has turned out not to work 
at all (the switch reporting in the configurations space, but all the 
downstream switch permanently down) owing to the host leaving the Vaux 
line disconnected in the slot, which is a conforming configuration.  I was 
told by the option card manufacturer this is an erratum in the ASMedia 
switch device and the workaround is to drive Vaux.  I think this just 
tells what the quality of these devices is.  Sigh.

 Anyway, I chose to rework the card and tracked down a suitable miniature 
SMD switch to mount onto the PCB so as to let me select whether to drive 
ASMedia device's Vaux input from the Vaux or a regular 3.3V slot position, 
but owing to other commitments I've never got to completing this effort, 
as it requires a couple of hours of precise manual work at the workshop.  
I'll get back to it sometime and report the results.

> Any good suggestions how to realize that check more precisely to 
> differentiate if there was some link activity or not?

 The LT bit is an obvious candidate and also how I wrote a corresponding 
quirk in U-boot.  A problem however is while in U-boot it's fine to poll 
the LT bit busy-looping for a second or so, it's absolutely not in Linux 
where we have the rest of the OS running.  Sampling at random intervals 
isn't going to help as we could well miss the active state.

 FWIW it's all documented with the description of the quirk.

> > One thing that honestly doesn't make any sense to me is the ID list in the
> > quirk. If the link comes up after forcing to Gen1 then it would only restore
> > TLS if the device is the ASMedia switch, but also ignoring what device is
> > detected downstream. If we allow ASMedia to restore the speed for any downstream
> > device when we only saw the initial issue with the Pericom switch then why
> > do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the
> > list which did not have any issues reported.
> 
> I think it's because the restore has been tested on that device 
> (whitelist).

 Correct, the idea has been to err on the side of caution.  The ASMedia 
device seems to cope well with this unclamping, so it's been listed, and 
so should any other device that has been confirmed to work.

 Matching the downstream and the upstream device both at a time instead, 
once this quirk has triggered and succeeded, seems to make no sense: if 
the device downstream turns out affected, then it matches the behaviour 
observed, so it should be enough to have the upstream device checked.  I 
did want to run it at full speed anyway.

 OTOH matching the downstream device likely makes sense if the quirk has 
been bypassed, such as when the link speed had been already clamped by the 
firmware.  In this case we do not really know if the clamping has been 
triggered by this erratum or something else, so such a check would be 
justified.  I don't think it's going to matter for the problems discussed 
though.

 Apologies for the irregular replies, lots on my head right now and I had 
to write this all down properly.

  Maciej

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-16 13:01             ` Maciej W. Rozycki
@ 2025-07-23 19:13               ` Matthew W Carlis
  2025-08-01 16:04                 ` Maciej W. Rozycki
  2025-08-15  0:35                 ` Matthew W Carlis
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew W Carlis @ 2025-07-23 19:13 UTC (permalink / raw)
  To: macro; +Cc: ashishk, bhelgaas, ilpo.jarvinen, linux-pci, mattc, msaggi,
	sconnor

On Wed, 16 Jul 2025, Maciej W. Rozycki wrote:
>  I made a test once and left the system up for half a year or so.  The 
> LBMS bit was set once, a couple of days after system reset.  I cleared it 
> by hand and it never retriggered for the rest of the experiment, so this 
> single occasion must have been a glitch and not a link quality issue.

I guess you also did not observe unusual number of correctable errors? I wonder
if AER was under OS control & the relevant errors were unmasked (RxErr, BadTLP,
BadDLLP, Replay Rollover, Replay Timeout). If the link transitions from L0 to
Recovery state due to excessive LCRC failures I have seen it return at the same
speed many times. I won't be able to say what the LBMS behavior is in that case
for some time unless I get lucky & find one in our internal test pool.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-23 19:13               ` Matthew W Carlis
@ 2025-08-01 16:04                 ` Maciej W. Rozycki
  2025-08-15  0:35                 ` Matthew W Carlis
  1 sibling, 0 replies; 14+ messages in thread
From: Maciej W. Rozycki @ 2025-08-01 16:04 UTC (permalink / raw)
  To: Matthew W Carlis
  Cc: ashishk, Bjorn Helgaas, Ilpo Järvinen, linux-pci, msaggi,
	sconnor

On Wed, 23 Jul 2025, Matthew W Carlis wrote:

> >  I made a test once and left the system up for half a year or so.  The 
> > LBMS bit was set once, a couple of days after system reset.  I cleared it 
> > by hand and it never retriggered for the rest of the experiment, so this 
> > single occasion must have been a glitch and not a link quality issue.
> 
> I guess you also did not observe unusual number of correctable errors? I wonder
> if AER was under OS control & the relevant errors were unmasked (RxErr, BadTLP,
> BadDLLP, Replay Rollover, Replay Timeout). If the link transitions from L0 to
> Recovery state due to excessive LCRC failures I have seen it return at the same
> speed many times. I won't be able to say what the LBMS behavior is in that case
> for some time unless I get lucky & find one in our internal test pool.

 The system has been up for a while now:

$ uptime
 17:06:25 up 44 days, 15:43,  2 users,  load average: 0.00, 0.00, 0.00
$ 

but how would I gather such error information?  All I can see is:

		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-

and:

	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP+ BadDLLP+ Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000

with the upstream device and somewhat different:

		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-

and:

	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000

with the downstream device.  In particular I find the all-zeros header log 
peculiar, especially given that nonzero contents are reported across other 
ports, but it's the only port in the system to report BadTLP+ or BadDLLP+.

 I've cleared the two status bits (and BWMgmt+) by hand and BadDLLP+ has 
retriggered after a couple of seconds (and then again), so your reasoning 
seems to go in the right direction.  I'll try to experiment some more and 
report back, but it'll take a while as I'm away from all equipment but my 
laptop right now.  I'd like to understand which exact pieces of hardware 
cause this problem.  I'll also see if I can chase another make of a PCIe 
switch for the upstream device.

 Also it makes me feel the workaround can indeed be of general use and it 
is just speed unclamping that needs to be robust enough not to interfere 
with good equipment.  In a perfect world we'd only have good hardware, but 
in the world we have instead I feel like doing the best we can rather than 
giving up.

 Thanks for your input!

  Maciej

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824
  2025-07-23 19:13               ` Matthew W Carlis
  2025-08-01 16:04                 ` Maciej W. Rozycki
@ 2025-08-15  0:35                 ` Matthew W Carlis
  1 sibling, 0 replies; 14+ messages in thread
From: Matthew W Carlis @ 2025-08-15  0:35 UTC (permalink / raw)
  To: mattc; +Cc: ashishk, bhelgaas, ilpo.jarvinen, linux-pci, macro, msaggi,
	sconnor

Sorry for delayed response here.

On Fri, 1 Aug 2025, Maciej W. Rozycki wrote:
> CESta: RxErr- BadTLP+ BadDLLP+ Rollover- Timeout- AdvNonFatalErr-

The information you sent is somewhat incomplete. I guess you probably won't be
able to get any of the LTSSM state information unless one of the devices has an
ltssm log you can dump, but I doubt either of them do.

When I see that BadTLP and BadDLLP are still set it makes me suspect that
the hierarchy isn't configured correctly in order for those errors to go
to the root port. Or perhaps they're just being reported to the BIOS &
ignored or not cleared.

> but how would I gather such error information?

Lets try to figure out what is in control of AER & how/whether the hierarchy
is configured to send errors all the way to the root port. First we have to look
around "OSC" related kernel logging & the adjacent root port. Example here from an
Intel system we can see OS took control over AER (and other things) from BIOS. We
can infer this was for Bus 4f root port since its logged just after afaik. The
negotiation happens on a per root port basis so need to make sure its the root
port in hierarchy of the devices we're interested in. I've seen some BIOS retain
AER control over PCIe ports on the PCH.

Example dmesg from during boot:
acpi PNP0A08:04: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
acpi PNP0A08:04: FADT indicates ASPM is unsupported, using BIOS configuration
PCI host bridge to bus 0000:4f

We would want to look at lspci for the root port, the asmedia USP, the asmedia
DSP and the USP of the pericom switch (when able). I don't have any nested
switch configurations, but I think I can generalize it a little. Maybe this is
a correct configuration (using BDFs from a system I have to start with).

 +-[0000:4f]-+-00.0 Intel Corporation ...
 |           +-...
 |           +-01.0-[50-57]--+-00.0-[51-57]--+-00.0-[52-53] 

RP: 4f:01.0
USP (asmedia): 50:00.0
DSP (asmedia): 51:00.0
USP (pericom): 52:00.0

Root port can tell us if PCIe errors are going to the BIOS. IF any of the
ErrCorrectable, ErrNon-Fatal, ErrFatal, are set in the RootCtrl then those
error types would most likely go to the BIOS even if the OS thinks it took
control. Someone will have to correct me if wrong about ARM. If you sent
the full lspci -vvv of root port, USP/DSP/USP combo I could figure out
whats going on.

lspci -vvv -s 4f:01.0

4f:01.0 PCI bridge: Intel Corporation Device 352a (rev 04) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        ...
        ...
        Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #25, Speed 32GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x8
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 75W; Interlock- NoCompl-
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-08-15  0:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02  5:24 [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824 Matthew W Carlis
2025-07-02  5:24 ` [PATCH v2 1/1] " Matthew W Carlis
2025-07-03 12:11   ` Ilpo Järvinen
2025-07-03 23:53     ` [PATCH v2 0/1] " Matthew W Carlis
2025-07-04 10:20       ` Ilpo Järvinen
2025-07-08 22:49         ` Matthew W Carlis
2025-07-09  9:45           ` Ilpo Järvinen
2025-07-09 18:52             ` Matthew W Carlis
2025-07-09 20:27               ` Matthew W Carlis
2025-07-11 13:46               ` Ilpo Järvinen
2025-07-16 13:01             ` Maciej W. Rozycki
2025-07-23 19:13               ` Matthew W Carlis
2025-08-01 16:04                 ` Maciej W. Rozycki
2025-08-15  0:35                 ` Matthew W Carlis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).