linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
@ 2024-05-16 20:47 Smita Koralahalli
  2024-06-17 20:09 ` Bjorn Helgaas
  0 siblings, 1 reply; 11+ messages in thread
From: Smita Koralahalli @ 2024-05-16 20:47 UTC (permalink / raw)
  To: linux-pci, linux-kernel
  Cc: Bjorn Helgaas, Mahesh J Salgaonkar, Lukas Wunner, Yazen Ghannam,
	Ilpo Jarvinen, Bowman Terry, Hagan Billy, Simon Guinot,
	Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
event.

The hot-remove event could result in target link speed reduction if LBMS
is set, due to a delay in Presence Detect State Change (PDSC) happening
after a Data Link Layer State Change event (DLLSC).

In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
PDSC can sometimes be too late and the slot could have already been
powered down just by a DLLSC event. And the delayed PDSC could falsely be
interpreted as an interrupt raised to turn the slot on. This false process
of powering the slot on, without a link forces the kernel to retrain the
link if LBMS is set, to a lower speed to restablish the link thereby
bringing down the link speeds [2].

According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
be set for an unconnected link and if set, it serves the purpose of
indicating that there is actually a device down an inactive link.
However, hardware could have already set LBMS when the device was
connected to the port i.e when the state was DL_Up or DL_Active. Some
hardwares would have even attempted retrain going into recovery mode,
just before transitioning to DL_Down.

Thus the set LBMS is never cleared and might force software to cause link
speed drops when there is no link [2].

Dmesg before:
	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
	pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
	pcieport 0000:20:01.1: retraining failed
	pcieport 0000:20:01.1: pciehp: Slot(59): No link

Dmesg after:
	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
	pcieport 0000:20:01.1: pciehp: Slot(59): No link

[1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
    https://members.pcisig.com/wg/PCI-SIG/document/20590
[2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")

Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
Link to v1:
https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/

v2:
	Cleared LBMS unconditionally. (Ilpo)
	Added Fixes Tag. (Lukas)
---
 drivers/pci/hotplug/pciehp_pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index ad12515a4a12..dae73a8932ef 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
 	}
 
 	pci_unlock_rescan_remove();
+
+	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
+				   PCI_EXP_LNKSTA_LBMS);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-05-16 20:47 [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction Smita Koralahalli
@ 2024-06-17 20:09 ` Bjorn Helgaas
  2024-06-17 22:51   ` Smita Koralahalli
  0 siblings, 1 reply; 11+ messages in thread
From: Bjorn Helgaas @ 2024-06-17 20:09 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-pci, linux-kernel, Mahesh J Salgaonkar, Lukas Wunner,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
> event.
> 
> The hot-remove event could result in target link speed reduction if LBMS
> is set, due to a delay in Presence Detect State Change (PDSC) happening
> after a Data Link Layer State Change event (DLLSC).
> 
> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> PDSC can sometimes be too late and the slot could have already been
> powered down just by a DLLSC event. And the delayed PDSC could falsely be
> interpreted as an interrupt raised to turn the slot on. This false process
> of powering the slot on, without a link forces the kernel to retrain the
> link if LBMS is set, to a lower speed to restablish the link thereby
> bringing down the link speeds [2].

Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
has no meaning for an empty slot?

> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> be set for an unconnected link and if set, it serves the purpose of
> indicating that there is actually a device down an inactive link.

I see that r6.2 added an implementation note about DLLSC, but I'm not
a hardware person and can't follow the implication about a device
present down an inactive link.

I guess it must be related to the fact that LBMS indicates either
completion of link retraining or a change in link speed or width
(which would imply presence of a downstream device).  But in both
cases I assume the link would be active.

But IIUC LBMS is set by hardware but never cleared by hardware, so if
we remove a device and power off the slot, it doesn't seem like LBMS
could be telling us anything useful (what could we do in response to
LBMS when the slot is empty?), so it makes sense to me to clear it.

It seems like pciehp_unconfigure_device() does sort of PCI core and
driver-related things and possibly could be something shared by all
hotplug drivers, while remove_board() does things more specific to the
hotplug model (pciehp, shpchp, etc).

From that perspective, clearing LBMS might fit better in
remove_board().  In that case, I wonder whether it should be done
after turning off slot power?  This patch clears is *before* turning
off the power, so I wonder if hardware could possibly set it again
before the poweroff?

> However, hardware could have already set LBMS when the device was
> connected to the port i.e when the state was DL_Up or DL_Active. Some
> hardwares would have even attempted retrain going into recovery mode,
> just before transitioning to DL_Down.
> 
> Thus the set LBMS is never cleared and might force software to cause link
> speed drops when there is no link [2].
> 
> Dmesg before:
> 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> 	pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> 	pcieport 0000:20:01.1: retraining failed
> 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
> 
> Dmesg after:
> 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> 	pcieport 0000:20:01.1: pciehp: Slot(59): No link

I'm a little confused about the problem being solved here.  Obviously
the message is extraneous.  I guess the slot is empty, so retraining
is meaningless and will always fail.  Maybe avoiding it avoids a
delay?  Is the benefit that we get rid of the message and a delay?

> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
>     https://members.pcisig.com/wg/PCI-SIG/document/20590
> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
> 
> Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")

Lukas asked about this; did you confirm that it is related?  Asking
because the Fixes tag may cause this to be backported along with
a89c82249c37.

> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> Link to v1:
> https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/
> 
> v2:
> 	Cleared LBMS unconditionally. (Ilpo)
> 	Added Fixes Tag. (Lukas)
> ---
>  drivers/pci/hotplug/pciehp_pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index ad12515a4a12..dae73a8932ef 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>  	}
>  
>  	pci_unlock_rescan_remove();
> +
> +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> +				   PCI_EXP_LNKSTA_LBMS);
>  }
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-17 20:09 ` Bjorn Helgaas
@ 2024-06-17 22:51   ` Smita Koralahalli
  2024-06-17 23:18     ` Bjorn Helgaas
  0 siblings, 1 reply; 11+ messages in thread
From: Smita Koralahalli @ 2024-06-17 22:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, Mahesh J Salgaonkar, Lukas Wunner,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

Hi Bjorn,

On 6/17/2024 1:09 PM, Bjorn Helgaas wrote:
> On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
>> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
>> event.
>>
>> The hot-remove event could result in target link speed reduction if LBMS
>> is set, due to a delay in Presence Detect State Change (PDSC) happening
>> after a Data Link Layer State Change event (DLLSC).
>>
>> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
>> PDSC can sometimes be too late and the slot could have already been
>> powered down just by a DLLSC event. And the delayed PDSC could falsely be
>> interpreted as an interrupt raised to turn the slot on. This false process
>> of powering the slot on, without a link forces the kernel to retrain the
>> link if LBMS is set, to a lower speed to restablish the link thereby
>> bringing down the link speeds [2].
> 
> Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
> has no meaning for an empty slot?

I'm trying to also answer your below question here..

 >I guess the slot is empty, so retraining
 > is meaningless and will always fail.  Maybe avoiding it avoids a
 > delay?  Is the benefit that we get rid of the message and a delay?"

The benefit of this patch is to "avoid link speed drops" on a hot remove 
event if LBMS is set and DLLLA is clear. But I'm not trying to solve 
delay issues here..

I included the PDSC and DLLSC details as they are the cause for link 
speed drops on a remove event. On an empty slot, DLLLA is cleared and 
LBMS may or may not be set. And, we see cases of link speed drops here, 
if PDSC happens on an empty slot.

We know for the fact that slot becomes empty if either of the events 
PDSC or DLLSC occur. Also, either of them do not wait for the other to 
bring down the device and mark the slot as "empty". That is the reason I 
was also thinking of waiting on both events PDSC and DLLSC to bring down 
the device as I mentioned in my comments in V1. We could eliminate link 
speed drops by taking this approach as well. But then we had to address 
cases where PDSC is hardwired to zero.

On our AMD systems, we expect to see both events on an hot-remove event. 
But, mostly we see DLLSC happening first, which does the job of marking 
the slot empty. Now, if the PDSC event is delayed way too much and if it 
occurs after the slot becomes empty, kernel misinterprets PDSC as the 
signal to re-initialize the slot and this is the sequence of steps the 
kernel takes:

pciehp_handle_presence_or_link_change()
   pciehp_enable_slot()
     __pciehp_enable_slot()
         board_added
           pciehp_check_link_status()
             pcie_wait_for_link()
               pcie_wait_for_link_delay()
                 pcie_failed_link_retrain()

while doing so, it hits the case of DLLLA clear and LBMS set and brings 
down the speeds.

The issue of PDSC and DLLSC never occurring simultaneously was a known 
thing from before and it wasn't breaking anything functionally as the 
kernel would just exit with the message: "No link" at 
pciehp_check_link_status().

However, Commit a89c82249c37 ("PCI: Work around PCIe link training 
failures") introduced link speed downgrades to re-establish links if 
LBMS is set, and DLLLA is clear. This caused the devices to operate at 
2.5GT/s after they were plugged-in which were previously operating at 
higher speeds before hot-remove.

> 
>> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
>> be set for an unconnected link and if set, it serves the purpose of
>> indicating that there is actually a device down an inactive link.
> 
> I see that r6.2 added an implementation note about DLLSC, but I'm not
> a hardware person and can't follow the implication about a device
> present down an inactive link.
> 
> I guess it must be related to the fact that LBMS indicates either
> completion of link retraining or a change in link speed or width
> (which would imply presence of a downstream device).  But in both
> cases I assume the link would be active.
> 
> But IIUC LBMS is set by hardware but never cleared by hardware, so if
> we remove a device and power off the slot, it doesn't seem like LBMS
> could be telling us anything useful (what could we do in response to
> LBMS when the slot is empty?), so it makes sense to me to clear it.
> 
> It seems like pciehp_unconfigure_device() does sort of PCI core and
> driver-related things and possibly could be something shared by all
> hotplug drivers, while remove_board() does things more specific to the
> hotplug model (pciehp, shpchp, etc).
> 
>  From that perspective, clearing LBMS might fit better in
> remove_board().  In that case, I wonder whether it should be done
> after turning off slot power?  This patch clears is *before* turning
> off the power, so I wonder if hardware could possibly set it again
> before the poweroff?

Yeah by talking to HW people I realized that HW could interfere possibly 
anytime to set LBMS when the slot power is on. Will change it to include 
in remove_board().

> 
>> However, hardware could have already set LBMS when the device was
>> connected to the port i.e when the state was DL_Up or DL_Active. Some
>> hardwares would have even attempted retrain going into recovery mode,
>> just before transitioning to DL_Down.
>>
>> Thus the set LBMS is never cleared and might force software to cause link
>> speed drops when there is no link [2].
>>
>> Dmesg before:
>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> 	pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>> 	pcieport 0000:20:01.1: retraining failed
>> 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> Dmesg after:
>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
> 
> I'm a little confused about the problem being solved here.  Obviously
> the message is extraneous.  I guess the slot is empty, so retraining
> is meaningless and will always fail.  Maybe avoiding it avoids a
> delay?  Is the benefit that we get rid of the message and a delay?
> 
>> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
>>      https://members.pcisig.com/wg/PCI-SIG/document/20590
>> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>>
>> Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
> 
> Lukas asked about this; did you confirm that it is related?  Asking
> because the Fixes tag may cause this to be backported along with
> a89c82249c37.

Yeah, without this patch we won't see link speed drops.

Thanks,
Smita

> 
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>> Link to v1:
>> https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/
>>
>> v2:
>> 	Cleared LBMS unconditionally. (Ilpo)
>> 	Added Fixes Tag. (Lukas)
>> ---
>>   drivers/pci/hotplug/pciehp_pci.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
>> index ad12515a4a12..dae73a8932ef 100644
>> --- a/drivers/pci/hotplug/pciehp_pci.c
>> +++ b/drivers/pci/hotplug/pciehp_pci.c
>> @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>>   	}
>>   
>>   	pci_unlock_rescan_remove();
>> +
>> +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> +				   PCI_EXP_LNKSTA_LBMS);
>>   }
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-17 22:51   ` Smita Koralahalli
@ 2024-06-17 23:18     ` Bjorn Helgaas
  2024-06-18 18:51       ` Smita Koralahalli
  0 siblings, 1 reply; 11+ messages in thread
From: Bjorn Helgaas @ 2024-06-17 23:18 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-pci, linux-kernel, Mahesh J Salgaonkar, Lukas Wunner,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On Mon, Jun 17, 2024 at 03:51:57PM -0700, Smita Koralahalli wrote:
> Hi Bjorn,
> 
> On 6/17/2024 1:09 PM, Bjorn Helgaas wrote:
> > On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
> > > Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
> > > event.
> > > 
> > > The hot-remove event could result in target link speed reduction if LBMS
> > > is set, due to a delay in Presence Detect State Change (PDSC) happening
> > > after a Data Link Layer State Change event (DLLSC).
> > > 
> > > In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> > > PDSC can sometimes be too late and the slot could have already been
> > > powered down just by a DLLSC event. And the delayed PDSC could falsely be
> > > interpreted as an interrupt raised to turn the slot on. This false process
> > > of powering the slot on, without a link forces the kernel to retrain the
> > > link if LBMS is set, to a lower speed to restablish the link thereby
> > > bringing down the link speeds [2].
> > 
> > Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
> > has no meaning for an empty slot?
> 
> I'm trying to also answer your below question here..
> 
> >I guess the slot is empty, so retraining
> > is meaningless and will always fail.  Maybe avoiding it avoids a
> > delay?  Is the benefit that we get rid of the message and a delay?"
> 
> The benefit of this patch is to "avoid link speed drops" on a hot remove
> event if LBMS is set and DLLLA is clear. But I'm not trying to solve delay
> issues here..
> 
> I included the PDSC and DLLSC details as they are the cause for link speed
> drops on a remove event. On an empty slot, DLLLA is cleared and LBMS may or
> may not be set. And, we see cases of link speed drops here, if PDSC happens
> on an empty slot.
> 
> We know for the fact that slot becomes empty if either of the events PDSC or
> DLLSC occur. Also, either of them do not wait for the other to bring down
> the device and mark the slot as "empty". That is the reason I was also
> thinking of waiting on both events PDSC and DLLSC to bring down the device
> as I mentioned in my comments in V1. We could eliminate link speed drops by
> taking this approach as well. But then we had to address cases where PDSC is
> hardwired to zero.
> 
> On our AMD systems, we expect to see both events on an hot-remove event.
> But, mostly we see DLLSC happening first, which does the job of marking the
> slot empty. Now, if the PDSC event is delayed way too much and if it occurs
> after the slot becomes empty, kernel misinterprets PDSC as the signal to
> re-initialize the slot and this is the sequence of steps the kernel takes:
> 
> pciehp_handle_presence_or_link_change()
>   pciehp_enable_slot()
>     __pciehp_enable_slot()
>         board_added
>           pciehp_check_link_status()
>             pcie_wait_for_link()
>               pcie_wait_for_link_delay()
>                 pcie_failed_link_retrain()
> 
> while doing so, it hits the case of DLLLA clear and LBMS set and brings down
> the speeds.

So I guess LBMS currently remains set after a device has been removed,
so the slot is empty, and later when a device is hot-added, *that*
device sees a lower-than expected link speed?

> The issue of PDSC and DLLSC never occurring simultaneously was a known thing
> from before and it wasn't breaking anything functionally as the kernel would
> just exit with the message: "No link" at pciehp_check_link_status().
> 
> However, Commit a89c82249c37 ("PCI: Work around PCIe link training
> failures") introduced link speed downgrades to re-establish links if LBMS is
> set, and DLLLA is clear. This caused the devices to operate at 2.5GT/s after
> they were plugged-in which were previously operating at higher speeds before
> hot-remove.
> 
> > > According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> > > be set for an unconnected link and if set, it serves the purpose of
> > > indicating that there is actually a device down an inactive link.
> > 
> > I see that r6.2 added an implementation note about DLLSC, but I'm not
> > a hardware person and can't follow the implication about a device
> > present down an inactive link.
> > 
> > I guess it must be related to the fact that LBMS indicates either
> > completion of link retraining or a change in link speed or width
> > (which would imply presence of a downstream device).  But in both
> > cases I assume the link would be active.
> > 
> > But IIUC LBMS is set by hardware but never cleared by hardware, so if
> > we remove a device and power off the slot, it doesn't seem like LBMS
> > could be telling us anything useful (what could we do in response to
> > LBMS when the slot is empty?), so it makes sense to me to clear it.
> > 
> > It seems like pciehp_unconfigure_device() does sort of PCI core and
> > driver-related things and possibly could be something shared by all
> > hotplug drivers, while remove_board() does things more specific to the
> > hotplug model (pciehp, shpchp, etc).
> > 
> >  From that perspective, clearing LBMS might fit better in
> > remove_board().  In that case, I wonder whether it should be done
> > after turning off slot power?  This patch clears is *before* turning
> > off the power, so I wonder if hardware could possibly set it again
> > before the poweroff?
> 
> Yeah by talking to HW people I realized that HW could interfere possibly
> anytime to set LBMS when the slot power is on. Will change it to include in
> remove_board().
> 
> > > However, hardware could have already set LBMS when the device was
> > > connected to the port i.e when the state was DL_Up or DL_Active. Some
> > > hardwares would have even attempted retrain going into recovery mode,
> > > just before transitioning to DL_Down.
> > > 
> > > Thus the set LBMS is never cleared and might force software to cause link
> > > speed drops when there is no link [2].
> > > 
> > > Dmesg before:
> > > 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > 	pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> > > 	pcieport 0000:20:01.1: retraining failed
> > > 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
> > > 
> > > Dmesg after:
> > > 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
> > 
> > I'm a little confused about the problem being solved here.  Obviously
> > the message is extraneous.  I guess the slot is empty, so retraining
> > is meaningless and will always fail.  Maybe avoiding it avoids a
> > delay?  Is the benefit that we get rid of the message and a delay?
> > 
> > > [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
> > >      https://members.pcisig.com/wg/PCI-SIG/document/20590
> > > [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
> > > 
> > > Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
> > 
> > Lukas asked about this; did you confirm that it is related?  Asking
> > because the Fixes tag may cause this to be backported along with
> > a89c82249c37.
> 
> Yeah, without this patch we won't see link speed drops.
> 
> Thanks,
> Smita
> 
> > 
> > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> > > ---
> > > Link to v1:
> > > https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/
> > > 
> > > v2:
> > > 	Cleared LBMS unconditionally. (Ilpo)
> > > 	Added Fixes Tag. (Lukas)
> > > ---
> > >   drivers/pci/hotplug/pciehp_pci.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> > > index ad12515a4a12..dae73a8932ef 100644
> > > --- a/drivers/pci/hotplug/pciehp_pci.c
> > > +++ b/drivers/pci/hotplug/pciehp_pci.c
> > > @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> > >   	}
> > >   	pci_unlock_rescan_remove();
> > > +
> > > +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> > > +				   PCI_EXP_LNKSTA_LBMS);
> > >   }
> > > -- 
> > > 2.17.1
> > > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-17 23:18     ` Bjorn Helgaas
@ 2024-06-18 18:51       ` Smita Koralahalli
  2024-06-18 21:23         ` Smita Koralahalli
  0 siblings, 1 reply; 11+ messages in thread
From: Smita Koralahalli @ 2024-06-18 18:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, Mahesh J Salgaonkar, Lukas Wunner,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On 6/17/2024 4:18 PM, Bjorn Helgaas wrote:
> On Mon, Jun 17, 2024 at 03:51:57PM -0700, Smita Koralahalli wrote:
>> Hi Bjorn,
>>
>> On 6/17/2024 1:09 PM, Bjorn Helgaas wrote:
>>> On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
>>>> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
>>>> event.
>>>>
>>>> The hot-remove event could result in target link speed reduction if LBMS
>>>> is set, due to a delay in Presence Detect State Change (PDSC) happening
>>>> after a Data Link Layer State Change event (DLLSC).
>>>>
>>>> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
>>>> PDSC can sometimes be too late and the slot could have already been
>>>> powered down just by a DLLSC event. And the delayed PDSC could falsely be
>>>> interpreted as an interrupt raised to turn the slot on. This false process
>>>> of powering the slot on, without a link forces the kernel to retrain the
>>>> link if LBMS is set, to a lower speed to restablish the link thereby
>>>> bringing down the link speeds [2].
>>>
>>> Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
>>> has no meaning for an empty slot?
>>
>> I'm trying to also answer your below question here..
>>
>>> I guess the slot is empty, so retraining
>>> is meaningless and will always fail.  Maybe avoiding it avoids a
>>> delay?  Is the benefit that we get rid of the message and a delay?"
>>
>> The benefit of this patch is to "avoid link speed drops" on a hot remove
>> event if LBMS is set and DLLLA is clear. But I'm not trying to solve delay
>> issues here..
>>
>> I included the PDSC and DLLSC details as they are the cause for link speed
>> drops on a remove event. On an empty slot, DLLLA is cleared and LBMS may or
>> may not be set. And, we see cases of link speed drops here, if PDSC happens
>> on an empty slot.
>>
>> We know for the fact that slot becomes empty if either of the events PDSC or
>> DLLSC occur. Also, either of them do not wait for the other to bring down
>> the device and mark the slot as "empty". That is the reason I was also
>> thinking of waiting on both events PDSC and DLLSC to bring down the device
>> as I mentioned in my comments in V1. We could eliminate link speed drops by
>> taking this approach as well. But then we had to address cases where PDSC is
>> hardwired to zero.
>>
>> On our AMD systems, we expect to see both events on an hot-remove event.
>> But, mostly we see DLLSC happening first, which does the job of marking the
>> slot empty. Now, if the PDSC event is delayed way too much and if it occurs
>> after the slot becomes empty, kernel misinterprets PDSC as the signal to
>> re-initialize the slot and this is the sequence of steps the kernel takes:
>>
>> pciehp_handle_presence_or_link_change()
>>    pciehp_enable_slot()
>>      __pciehp_enable_slot()
>>          board_added
>>            pciehp_check_link_status()
>>              pcie_wait_for_link()
>>                pcie_wait_for_link_delay()
>>                  pcie_failed_link_retrain()
>>
>> while doing so, it hits the case of DLLLA clear and LBMS set and brings down
>> the speeds.
> 
> So I guess LBMS currently remains set after a device has been removed,
> so the slot is empty, and later when a device is hot-added, *that*
> device sees a lower-than expected link speed?

Yes, when PDSC is fired after the slot is empty (i.e when DLLLA is clear).

Thanks
Smita

> 
>> The issue of PDSC and DLLSC never occurring simultaneously was a known thing
>> from before and it wasn't breaking anything functionally as the kernel would
>> just exit with the message: "No link" at pciehp_check_link_status().
>>
>> However, Commit a89c82249c37 ("PCI: Work around PCIe link training
>> failures") introduced link speed downgrades to re-establish links if LBMS is
>> set, and DLLLA is clear. This caused the devices to operate at 2.5GT/s after
>> they were plugged-in which were previously operating at higher speeds before
>> hot-remove.
>>
>>>> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
>>>> be set for an unconnected link and if set, it serves the purpose of
>>>> indicating that there is actually a device down an inactive link.
>>>
>>> I see that r6.2 added an implementation note about DLLSC, but I'm not
>>> a hardware person and can't follow the implication about a device
>>> present down an inactive link.
>>>
>>> I guess it must be related to the fact that LBMS indicates either
>>> completion of link retraining or a change in link speed or width
>>> (which would imply presence of a downstream device).  But in both
>>> cases I assume the link would be active.
>>>
>>> But IIUC LBMS is set by hardware but never cleared by hardware, so if
>>> we remove a device and power off the slot, it doesn't seem like LBMS
>>> could be telling us anything useful (what could we do in response to
>>> LBMS when the slot is empty?), so it makes sense to me to clear it.
>>>
>>> It seems like pciehp_unconfigure_device() does sort of PCI core and
>>> driver-related things and possibly could be something shared by all
>>> hotplug drivers, while remove_board() does things more specific to the
>>> hotplug model (pciehp, shpchp, etc).
>>>
>>>   From that perspective, clearing LBMS might fit better in
>>> remove_board().  In that case, I wonder whether it should be done
>>> after turning off slot power?  This patch clears is *before* turning
>>> off the power, so I wonder if hardware could possibly set it again
>>> before the poweroff?
>>
>> Yeah by talking to HW people I realized that HW could interfere possibly
>> anytime to set LBMS when the slot power is on. Will change it to include in
>> remove_board().
>>
>>>> However, hardware could have already set LBMS when the device was
>>>> connected to the port i.e when the state was DL_Up or DL_Active. Some
>>>> hardwares would have even attempted retrain going into recovery mode,
>>>> just before transitioning to DL_Down.
>>>>
>>>> Thus the set LBMS is never cleared and might force software to cause link
>>>> speed drops when there is no link [2].
>>>>
>>>> Dmesg before:
>>>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>>>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>>>> 	pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>>>> 	pcieport 0000:20:01.1: retraining failed
>>>> 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>>>
>>>> Dmesg after:
>>>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>>>> 	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>>>> 	pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>>
>>> I'm a little confused about the problem being solved here.  Obviously
>>> the message is extraneous.  I guess the slot is empty, so retraining
>>> is meaningless and will always fail.  Maybe avoiding it avoids a
>>> delay?  Is the benefit that we get rid of the message and a delay?
>>>
>>>> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
>>>>       https://members.pcisig.com/wg/PCI-SIG/document/20590
>>>> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>>>>
>>>> Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
>>>
>>> Lukas asked about this; did you confirm that it is related?  Asking
>>> because the Fixes tag may cause this to be backported along with
>>> a89c82249c37.
>>
>> Yeah, without this patch we won't see link speed drops.
>>
>> Thanks,
>> Smita
>>
>>>
>>>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>>>> ---
>>>> Link to v1:
>>>> https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/
>>>>
>>>> v2:
>>>> 	Cleared LBMS unconditionally. (Ilpo)
>>>> 	Added Fixes Tag. (Lukas)
>>>> ---
>>>>    drivers/pci/hotplug/pciehp_pci.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
>>>> index ad12515a4a12..dae73a8932ef 100644
>>>> --- a/drivers/pci/hotplug/pciehp_pci.c
>>>> +++ b/drivers/pci/hotplug/pciehp_pci.c
>>>> @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>>>>    	}
>>>>    	pci_unlock_rescan_remove();
>>>> +
>>>> +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>>>> +				   PCI_EXP_LNKSTA_LBMS);
>>>>    }
>>>> -- 
>>>> 2.17.1
>>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-18 18:51       ` Smita Koralahalli
@ 2024-06-18 21:23         ` Smita Koralahalli
  2024-06-19  7:47           ` Lukas Wunner
  0 siblings, 1 reply; 11+ messages in thread
From: Smita Koralahalli @ 2024-06-18 21:23 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, Mahesh J Salgaonkar, Lukas Wunner,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On 6/18/2024 11:51 AM, Smita Koralahalli wrote:

[snip]

>>>> But IIUC LBMS is set by hardware but never cleared by hardware, so if
>>>> we remove a device and power off the slot, it doesn't seem like LBMS
>>>> could be telling us anything useful (what could we do in response to
>>>> LBMS when the slot is empty?), so it makes sense to me to clear it.
>>>>
>>>> It seems like pciehp_unconfigure_device() does sort of PCI core and
>>>> driver-related things and possibly could be something shared by all
>>>> hotplug drivers, while remove_board() does things more specific to the
>>>> hotplug model (pciehp, shpchp, etc).
>>>>
>>>>   From that perspective, clearing LBMS might fit better in
>>>> remove_board().  In that case, I wonder whether it should be done
>>>> after turning off slot power?  This patch clears is *before* turning
>>>> off the power, so I wonder if hardware could possibly set it again
>>>> before the poweroff?

While clearing LBMS in remove_board() here:

if (POWER_CTRL(ctrl)) {
	pciehp_power_off_slot(ctrl);
+	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
				   PCI_EXP_LNKSTA_LBMS);

	/*
	 * After turning power off, we must wait for at least 1 second
	 * before taking any action that relies on power having been
	 * removed from the slot/adapter.
	 */
	msleep(1000);

	/* Ignore link or presence changes caused by power off */
	atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
		   &ctrl->pending_events);
}

This can happen too right? I.e Just after the slot poweroff and before 
LBMS clearing the PDC/PDSC could be fired. Then 
pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and 
proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and 
ultimately link speed drops..

So, I added clearing just before turning off the slot.. Let me know if 
I'm thinking it right.

Thanks
Smita
>>>
>>> Yeah by talking to HW people I realized that HW could interfere possibly
>>> anytime to set LBMS when the slot power is on. Will change it to 
>>> include in
>>> remove_board().
>>>

[snip]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-18 21:23         ` Smita Koralahalli
@ 2024-06-19  7:47           ` Lukas Wunner
  2024-06-25 20:20             ` Smita Koralahalli
  0 siblings, 1 reply; 11+ messages in thread
From: Lukas Wunner @ 2024-06-19  7:47 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, Mahesh J Salgaonkar,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
> On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
> > > > > But IIUC LBMS is set by hardware but never cleared by hardware, so if
> > > > > we remove a device and power off the slot, it doesn't seem like LBMS
> > > > > could be telling us anything useful (what could we do in response to
> > > > > LBMS when the slot is empty?), so it makes sense to me to clear it.
> > > > > 
> > > > > It seems like pciehp_unconfigure_device() does sort of PCI core and
> > > > > driver-related things and possibly could be something shared by all
> > > > > hotplug drivers, while remove_board() does things more specific to the
> > > > > hotplug model (pciehp, shpchp, etc).
> > > > > 
> > > > > From that perspective, clearing LBMS might fit better in
> > > > > remove_board(). In that case, I wonder whether it should be done
> > > > > after turning off slot power? This patch clears is *before* turning
> > > > > off the power, so I wonder if hardware could possibly set it again
> > > > > before the poweroff?
> 
> While clearing LBMS in remove_board() here:
> 
> if (POWER_CTRL(ctrl)) {
> 	pciehp_power_off_slot(ctrl);
> +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> 				   PCI_EXP_LNKSTA_LBMS);
> 
> 	/*
> 	 * After turning power off, we must wait for at least 1 second
> 	 * before taking any action that relies on power having been
> 	 * removed from the slot/adapter.
> 	 */
> 	msleep(1000);
> 
> 	/* Ignore link or presence changes caused by power off */
> 	atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
> 		   &ctrl->pending_events);
> }
> 
> This can happen too right? I.e Just after the slot poweroff and before LBMS
> clearing the PDC/PDSC could be fired. Then
> pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
> proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
> ultimately link speed drops..
> 
> So, I added clearing just before turning off the slot.. Let me know if I'm
> thinking it right.

This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
after powering off a slot").  You can try reproducing it by writing "0"
to the slot's "power" file in sysfs, but your hardware needs to support
slot power.

Basically the idea is that after waiting for 1 sec, chances are very low
that any DLLSC or PDSC events caused by removing slot power may still
occur.

Arguably the same applies to LBMS changes, so I'd recommend to likewise
clear stale LBMS after the msleep(1000).

pciehp_ctrl.c only contains the state machine and higher-level logic of
the hotplug controller and all the actual register accesses are in helpers
in pciehp_hpc.c.  So if you want to do it picture-perfectly, add a helper
in pciehp_hpc.c to clear LBMS and call that from remove_board().

That all being said, I'm wondering how this plays together with Ilpo's
bandwidth control driver?

https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/

IIUC, the bandwidth control driver will be in charge of handling LBMS
changes.  So clearing LBMS behind the bandwidth control driver's back
might be problematic.  Ilpo?

Also, since you've confirmed that this issue is fallout from
a89c82249c37 ("PCI: Work around PCIe link training failures"),
I'm wondering if the logic introduced by that commit can be
changed so that the quirk is applied more narrowly, i.e. *not*
applied to unaffected hardware, such as AMD's hotplug ports.
That would avoid the need to undo the effect of the quirk and
work around the downtraining you're seeing.

Maciej, any ideas?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-19  7:47           ` Lukas Wunner
@ 2024-06-25 20:20             ` Smita Koralahalli
  2024-07-09 10:52               ` Ilpo Järvinen
  0 siblings, 1 reply; 11+ messages in thread
From: Smita Koralahalli @ 2024-06-25 20:20 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, Mahesh J Salgaonkar,
	Yazen Ghannam, Ilpo Jarvinen, Bowman Terry, Hagan Billy,
	Simon Guinot, Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

Hi all,

Sorry for the delay here. Took some time to find a system to run 
experiments. Comments inline.

On 6/19/2024 12:47 AM, Lukas Wunner wrote:
> On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
>> On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
>>>>>> But IIUC LBMS is set by hardware but never cleared by hardware, so if
>>>>>> we remove a device and power off the slot, it doesn't seem like LBMS
>>>>>> could be telling us anything useful (what could we do in response to
>>>>>> LBMS when the slot is empty?), so it makes sense to me to clear it.
>>>>>>
>>>>>> It seems like pciehp_unconfigure_device() does sort of PCI core and
>>>>>> driver-related things and possibly could be something shared by all
>>>>>> hotplug drivers, while remove_board() does things more specific to the
>>>>>> hotplug model (pciehp, shpchp, etc).
>>>>>>
>>>>>>  From that perspective, clearing LBMS might fit better in
>>>>>> remove_board(). In that case, I wonder whether it should be done
>>>>>> after turning off slot power? This patch clears is *before* turning
>>>>>> off the power, so I wonder if hardware could possibly set it again
>>>>>> before the poweroff?
>>
>> While clearing LBMS in remove_board() here:
>>
>> if (POWER_CTRL(ctrl)) {
>> 	pciehp_power_off_slot(ctrl);
>> +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> 				   PCI_EXP_LNKSTA_LBMS);
>>
>> 	/*
>> 	 * After turning power off, we must wait for at least 1 second
>> 	 * before taking any action that relies on power having been
>> 	 * removed from the slot/adapter.
>> 	 */
>> 	msleep(1000);
>>
>> 	/* Ignore link or presence changes caused by power off */
>> 	atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
>> 		   &ctrl->pending_events);
>> }
>>
>> This can happen too right? I.e Just after the slot poweroff and before LBMS
>> clearing the PDC/PDSC could be fired. Then
>> pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
>> proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
>> ultimately link speed drops..
>>
>> So, I added clearing just before turning off the slot.. Let me know if I'm
>> thinking it right.

I guess I should have experimented before putting this comment out.

After talking to the HW/FW teams, I understood that, none of our CRBs 
support power controller for NVMe devices, which means the "Power 
Controller Present" in Slot_Cap is always false. That's what makes it a 
"surprise removal." If the OS was notified beforehand and there was a 
power controller attached, the OS would turn off the power with 
SLOT_CNTL. That's an "orderly" removal. So essentially, the entire block 
from "if (POWER_CTRL(ctrl))" will never be executed for surprise removal 
for us.

There could be board designs outside of us, with power controllers for 
the NVME devices, which I'm not aware of.
> 
> This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
> after powering off a slot").  You can try reproducing it by writing "0"
> to the slot's "power" file in sysfs, but your hardware needs to support
> slot power.
> 
> Basically the idea is that after waiting for 1 sec, chances are very low
> that any DLLSC or PDSC events caused by removing slot power may still
> occur.

PDSC events occurring in our case aren't by removing slot power. It 
should/will always happen on a surprise removal along with DLLSC for us. 
But this PDSC is been delayed and happens after DLLSC is invoked and 
ctrl->state = OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook 
to enable slot in pciehp_enable_slot() inside 
pciehp_handle_presence_or_link_change().
> 
> Arguably the same applies to LBMS changes, so I'd recommend to likewise
> clear stale LBMS after the msleep(1000).
> 
> pciehp_ctrl.c only contains the state machine and higher-level logic of
> the hotplug controller and all the actual register accesses are in helpers
> in pciehp_hpc.c.  So if you want to do it picture-perfectly, add a helper
> in pciehp_hpc.c to clear LBMS and call that from remove_board().
> 
> That all being said, I'm wondering how this plays together with Ilpo's
> bandwidth control driver?
> 
> https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/

I need to yet do a thorough reading of Ilpo's bandwidth control driver. 
Ilpo please correct me if I misspeak something as I don't have a 
thorough understanding.

Ilpo's bandwidth controller also checks for lbms count to be greater 
than zero to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If 
false, it follows the default path to check LBMS bit in link status 
register. So if, CONFIG_PCIE_BWCTRL is disabled by default we continue 
to see link speed drops. Even, if BWCTRL is enabled, LBMS count is 
incremented to 1 in pcie_bwnotif_enable() so likely pcie_lbms_seen() 
might return true thereby bringing down speeds here as well if DLLLA is 
clear?

> 
> IIUC, the bandwidth control driver will be in charge of handling LBMS
> changes.  So clearing LBMS behind the bandwidth control driver's back
> might be problematic.  Ilpo?
> 
> Also, since you've confirmed that this issue is fallout from
> a89c82249c37 ("PCI: Work around PCIe link training failures"),
> I'm wondering if the logic introduced by that commit can be
> changed so that the quirk is applied more narrowly, i.e. *not*
> applied to unaffected hardware, such as AMD's hotplug ports.
> That would avoid the need to undo the effect of the quirk and
> work around the downtraining you're seeing.
> 
> Maciej, any ideas?

Yeah I'm okay to go down to that approach as well. Any ideas would be 
helpful here.

Thanks
Smita
> 
> Thanks,
> 
> Lukas
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-06-25 20:20             ` Smita Koralahalli
@ 2024-07-09 10:52               ` Ilpo Järvinen
  2024-07-17 21:14                 ` Smita Koralahalli
  0 siblings, 1 reply; 11+ messages in thread
From: Ilpo Järvinen @ 2024-07-09 10:52 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: Lukas Wunner, Bjorn Helgaas, linux-pci, LKML, Mahesh J Salgaonkar,
	Yazen Ghannam, Bowman Terry, Hagan Billy, Simon Guinot,
	Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On Tue, 25 Jun 2024, Smita Koralahalli wrote:

> Sorry for the delay here. Took some time to find a system to run experiments.
> Comments inline.
>
> On 6/19/2024 12:47 AM, Lukas Wunner wrote:
> > On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
> > > On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
> > > > > > > But IIUC LBMS is set by hardware but never cleared by hardware, so
> > > > > > > if
> > > > > > > we remove a device and power off the slot, it doesn't seem like
> > > > > > > LBMS
> > > > > > > could be telling us anything useful (what could we do in response
> > > > > > > to
> > > > > > > LBMS when the slot is empty?), so it makes sense to me to clear
> > > > > > > it.
> > > > > > > 
> > > > > > > It seems like pciehp_unconfigure_device() does sort of PCI core
> > > > > > > and
> > > > > > > driver-related things and possibly could be something shared by
> > > > > > > all
> > > > > > > hotplug drivers, while remove_board() does things more specific to
> > > > > > > the
> > > > > > > hotplug model (pciehp, shpchp, etc).
> > > > > > > 
> > > > > > >  From that perspective, clearing LBMS might fit better in
> > > > > > > remove_board(). In that case, I wonder whether it should be done
> > > > > > > after turning off slot power? This patch clears is *before*
> > > > > > > turning
> > > > > > > off the power, so I wonder if hardware could possibly set it again
> > > > > > > before the poweroff?
> > > 
> > > While clearing LBMS in remove_board() here:
> > > 
> > > if (POWER_CTRL(ctrl)) {
> > > 	pciehp_power_off_slot(ctrl);
> > > +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> > > 				   PCI_EXP_LNKSTA_LBMS);
> > > 
> > > 	/*
> > > 	 * After turning power off, we must wait for at least 1 second
> > > 	 * before taking any action that relies on power having been
> > > 	 * removed from the slot/adapter.
> > > 	 */
> > > 	msleep(1000);
> > > 
> > > 	/* Ignore link or presence changes caused by power off */
> > > 	atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
> > > 		   &ctrl->pending_events);
> > > }
> > > 
> > > This can happen too right? I.e Just after the slot poweroff and before
> > > LBMS
> > > clearing the PDC/PDSC could be fired. Then
> > > pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
> > > proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
> > > ultimately link speed drops..
> > > 
> > > So, I added clearing just before turning off the slot.. Let me know if I'm
> > > thinking it right.
> 
> I guess I should have experimented before putting this comment out.
> 
> After talking to the HW/FW teams, I understood that, none of our CRBs support
> power controller for NVMe devices, which means the "Power Controller Present"
> in Slot_Cap is always false. That's what makes it a "surprise removal." If the
> OS was notified beforehand and there was a power controller attached, the OS
> would turn off the power with SLOT_CNTL. That's an "orderly" removal. So
> essentially, the entire block from "if (POWER_CTRL(ctrl))" will never be
> executed for surprise removal for us.
> 
> There could be board designs outside of us, with power controllers for the
> NVME devices, which I'm not aware of.
> > 
> > This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
> > after powering off a slot").  You can try reproducing it by writing "0"
> > to the slot's "power" file in sysfs, but your hardware needs to support
> > slot power.
> > 
> > Basically the idea is that after waiting for 1 sec, chances are very low
> > that any DLLSC or PDSC events caused by removing slot power may still
> > occur.
> 
> PDSC events occurring in our case aren't by removing slot power. It
> should/will always happen on a surprise removal along with DLLSC for us. But
> this PDSC is been delayed and happens after DLLSC is invoked and ctrl->state =
> OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook to enable slot in
> pciehp_enable_slot() inside pciehp_handle_presence_or_link_change().
> > 
> > Arguably the same applies to LBMS changes, so I'd recommend to likewise
> > clear stale LBMS after the msleep(1000).
> > 
> > pciehp_ctrl.c only contains the state machine and higher-level logic of
> > the hotplug controller and all the actual register accesses are in helpers
> > in pciehp_hpc.c.  So if you want to do it picture-perfectly, add a helper
> > in pciehp_hpc.c to clear LBMS and call that from remove_board().
> > 
> > That all being said, I'm wondering how this plays together with Ilpo's
> > bandwidth control driver?
> > 
> > https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/
> 
> I need to yet do a thorough reading of Ilpo's bandwidth control driver. Ilpo
> please correct me if I misspeak something as I don't have a thorough
> understanding.
> 
> Ilpo's bandwidth controller also checks for lbms count to be greater than zero
> to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If false, it follows
> the default path to check LBMS bit in link status register. So if,
> CONFIG_PCIE_BWCTRL is disabled by default we continue to see link speed drops.
> Even, if BWCTRL is enabled, LBMS count is incremented to 1 in
> pcie_bwnotif_enable() so likely pcie_lbms_seen() might return true thereby
> bringing down speeds here as well if DLLLA is clear?

I did add code to clear the LBMS count in pciehp_unconfigure_device() in 
part thanks to this patch of yours. Do you think it wouldn't work?

But I agree there would still be problem if BWCTRL is not enabled. I 
already have to keep part of it enabled due to the Target Speed quirk
and now this is another case where just having it always on would be
beneficial.

> > IIUC, the bandwidth control driver will be in charge of handling LBMS
> > changes.  So clearing LBMS behind the bandwidth control driver's back
> > might be problematic.  Ilpo?

Yes, BW controller will take control of LBMS and other code should not 
touch it directly (and LBMS will be kept cleared by the BW controller). 
However, in this case I'll just need to adapt the code to replace the 
LBMS clearing with resetting the LBMS count (if this patch is accepted 
before BW controller), the resetting is already there anyway.

> > Also, since you've confirmed that this issue is fallout from
> > a89c82249c37 ("PCI: Work around PCIe link training failures"),
> > I'm wondering if the logic introduced by that commit can be
> > changed so that the quirk is applied more narrowly, i.e. *not*
> > applied to unaffected hardware, such as AMD's hotplug ports.
> > That would avoid the need to undo the effect of the quirk and
> > work around the downtraining you're seeing.
> > 
> > Maciej, any ideas?
> 
> Yeah I'm okay to go down to that approach as well. Any ideas would be helpful
> here.

One thing I don't like in the Target Speed quirk is that it leaves the 
Link Speed into the lower value if the quirk fails to bring the link up, 
the quirk could restore the original Link Speed on failure to avoid these 
problems. I even suggested that earlier, however, the downside of 
restoring the original Link Speed is that it will require triggering yet 
another retraining (perhaps we could avoid waiting for its completion 
though since we expect it to fail).

It might be possible to eventually trigger the Target Speed quirk from the 
BW controller but it would require writing some state machine so that the 
quirk is not repeatedly attempted. It seemed to complicate things too much 
to add such a state machine at this point.

-- 
 i.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-07-09 10:52               ` Ilpo Järvinen
@ 2024-07-17 21:14                 ` Smita Koralahalli
  2024-07-18  7:47                   ` Ilpo Järvinen
  0 siblings, 1 reply; 11+ messages in thread
From: Smita Koralahalli @ 2024-07-17 21:14 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, Bjorn Helgaas, linux-pci, LKML, Mahesh J Salgaonkar,
	Yazen Ghannam, Bowman Terry, Hagan Billy, Simon Guinot,
	Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

On 7/9/2024 3:52 AM, Ilpo Järvinen wrote:
> On Tue, 25 Jun 2024, Smita Koralahalli wrote:
> 
>> Sorry for the delay here. Took some time to find a system to run experiments.
>> Comments inline.
>>
>> On 6/19/2024 12:47 AM, Lukas Wunner wrote:
>>> On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
>>>> On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
>>>>>>>> But IIUC LBMS is set by hardware but never cleared by hardware, so
>>>>>>>> if
>>>>>>>> we remove a device and power off the slot, it doesn't seem like
>>>>>>>> LBMS
>>>>>>>> could be telling us anything useful (what could we do in response
>>>>>>>> to
>>>>>>>> LBMS when the slot is empty?), so it makes sense to me to clear
>>>>>>>> it.
>>>>>>>>
>>>>>>>> It seems like pciehp_unconfigure_device() does sort of PCI core
>>>>>>>> and
>>>>>>>> driver-related things and possibly could be something shared by
>>>>>>>> all
>>>>>>>> hotplug drivers, while remove_board() does things more specific to
>>>>>>>> the
>>>>>>>> hotplug model (pciehp, shpchp, etc).
>>>>>>>>
>>>>>>>>   From that perspective, clearing LBMS might fit better in
>>>>>>>> remove_board(). In that case, I wonder whether it should be done
>>>>>>>> after turning off slot power? This patch clears is *before*
>>>>>>>> turning
>>>>>>>> off the power, so I wonder if hardware could possibly set it again
>>>>>>>> before the poweroff?
>>>>
>>>> While clearing LBMS in remove_board() here:
>>>>
>>>> if (POWER_CTRL(ctrl)) {
>>>> 	pciehp_power_off_slot(ctrl);
>>>> +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>>>> 				   PCI_EXP_LNKSTA_LBMS);
>>>>
>>>> 	/*
>>>> 	 * After turning power off, we must wait for at least 1 second
>>>> 	 * before taking any action that relies on power having been
>>>> 	 * removed from the slot/adapter.
>>>> 	 */
>>>> 	msleep(1000);
>>>>
>>>> 	/* Ignore link or presence changes caused by power off */
>>>> 	atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
>>>> 		   &ctrl->pending_events);
>>>> }
>>>>
>>>> This can happen too right? I.e Just after the slot poweroff and before
>>>> LBMS
>>>> clearing the PDC/PDSC could be fired. Then
>>>> pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
>>>> proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
>>>> ultimately link speed drops..
>>>>
>>>> So, I added clearing just before turning off the slot.. Let me know if I'm
>>>> thinking it right.
>>
>> I guess I should have experimented before putting this comment out.
>>
>> After talking to the HW/FW teams, I understood that, none of our CRBs support
>> power controller for NVMe devices, which means the "Power Controller Present"
>> in Slot_Cap is always false. That's what makes it a "surprise removal." If the
>> OS was notified beforehand and there was a power controller attached, the OS
>> would turn off the power with SLOT_CNTL. That's an "orderly" removal. So
>> essentially, the entire block from "if (POWER_CTRL(ctrl))" will never be
>> executed for surprise removal for us.
>>
>> There could be board designs outside of us, with power controllers for the
>> NVME devices, which I'm not aware of.
>>>
>>> This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
>>> after powering off a slot").  You can try reproducing it by writing "0"
>>> to the slot's "power" file in sysfs, but your hardware needs to support
>>> slot power.
>>>
>>> Basically the idea is that after waiting for 1 sec, chances are very low
>>> that any DLLSC or PDSC events caused by removing slot power may still
>>> occur.
>>
>> PDSC events occurring in our case aren't by removing slot power. It
>> should/will always happen on a surprise removal along with DLLSC for us. But
>> this PDSC is been delayed and happens after DLLSC is invoked and ctrl->state =
>> OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook to enable slot in
>> pciehp_enable_slot() inside pciehp_handle_presence_or_link_change().
>>>
>>> Arguably the same applies to LBMS changes, so I'd recommend to likewise
>>> clear stale LBMS after the msleep(1000).
>>>
>>> pciehp_ctrl.c only contains the state machine and higher-level logic of
>>> the hotplug controller and all the actual register accesses are in helpers
>>> in pciehp_hpc.c.  So if you want to do it picture-perfectly, add a helper
>>> in pciehp_hpc.c to clear LBMS and call that from remove_board().
>>>
>>> That all being said, I'm wondering how this plays together with Ilpo's
>>> bandwidth control driver?
>>>
>>> https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/
>>
>> I need to yet do a thorough reading of Ilpo's bandwidth control driver. Ilpo
>> please correct me if I misspeak something as I don't have a thorough
>> understanding.
>>
>> Ilpo's bandwidth controller also checks for lbms count to be greater than zero
>> to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If false, it follows
>> the default path to check LBMS bit in link status register. So if,
>> CONFIG_PCIE_BWCTRL is disabled by default we continue to see link speed drops.
>> Even, if BWCTRL is enabled, LBMS count is incremented to 1 in
>> pcie_bwnotif_enable() so likely pcie_lbms_seen() might return true thereby
>> bringing down speeds here as well if DLLLA is clear?
> 
> I did add code to clear the LBMS count in pciehp_unconfigure_device() in
> part thanks to this patch of yours. Do you think it wouldn't work?

It works with BWCTRL enabled. Just my concern would be to keep the 
clearing in pciehp_unconfigure_device() and not do it inside 
POWER_CTRL(ctrl), in remove_board() as per the suggestions given above.
> 
> But I agree there would still be problem if BWCTRL is not enabled. I
> already have to keep part of it enabled due to the Target Speed quirk
> and now this is another case where just having it always on would be
> beneficial.

Correct, it should always be on to not see the problem.

Would like to have this "LBMS clearing fix" accepted in sooner since its 
breaking things on our systems! :)

Thanks
Smita.

> 
>>> IIUC, the bandwidth control driver will be in charge of handling LBMS
>>> changes.  So clearing LBMS behind the bandwidth control driver's back
>>> might be problematic.  Ilpo?
> 
> Yes, BW controller will take control of LBMS and other code should not
> touch it directly (and LBMS will be kept cleared by the BW controller).
> However, in this case I'll just need to adapt the code to replace the
> LBMS clearing with resetting the LBMS count (if this patch is accepted
> before BW controller), the resetting is already there anyway.
> 
>>> Also, since you've confirmed that this issue is fallout from
>>> a89c82249c37 ("PCI: Work around PCIe link training failures"),
>>> I'm wondering if the logic introduced by that commit can be
>>> changed so that the quirk is applied more narrowly, i.e. *not*
>>> applied to unaffected hardware, such as AMD's hotplug ports.
>>> That would avoid the need to undo the effect of the quirk and
>>> work around the downtraining you're seeing.
>>>
>>> Maciej, any ideas?
>>
>> Yeah I'm okay to go down to that approach as well. Any ideas would be helpful
>> here.
> 
> One thing I don't like in the Target Speed quirk is that it leaves the
> Link Speed into the lower value if the quirk fails to bring the link up,
> the quirk could restore the original Link Speed on failure to avoid these
> problems. I even suggested that earlier, however, the downside of
> restoring the original Link Speed is that it will require triggering yet
> another retraining (perhaps we could avoid waiting for its completion
> though since we expect it to fail).
> 
> It might be possible to eventually trigger the Target Speed quirk from the
> BW controller but it would require writing some state machine so that the
> quirk is not repeatedly attempted. It seemed to complicate things too much
> to add such a state machine at this point.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
  2024-07-17 21:14                 ` Smita Koralahalli
@ 2024-07-18  7:47                   ` Ilpo Järvinen
  0 siblings, 0 replies; 11+ messages in thread
From: Ilpo Järvinen @ 2024-07-18  7:47 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: Lukas Wunner, Bjorn Helgaas, linux-pci, LKML, Mahesh J Salgaonkar,
	Yazen Ghannam, Bowman Terry, Hagan Billy, Simon Guinot,
	Maciej W . Rozycki, Kuppuswamy Sathyanarayanan

[-- Attachment #1: Type: text/plain, Size: 7832 bytes --]

On Wed, 17 Jul 2024, Smita Koralahalli wrote:

> On 7/9/2024 3:52 AM, Ilpo Järvinen wrote:
> > On Tue, 25 Jun 2024, Smita Koralahalli wrote:
> > 
> > > Sorry for the delay here. Took some time to find a system to run
> > > experiments.
> > > Comments inline.
> > > 
> > > On 6/19/2024 12:47 AM, Lukas Wunner wrote:
> > > > On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
> > > > > On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
> > > > > > > > > But IIUC LBMS is set by hardware but never cleared by
> > > > > > > > > hardware, so
> > > > > > > > > if
> > > > > > > > > we remove a device and power off the slot, it doesn't seem
> > > > > > > > > like
> > > > > > > > > LBMS
> > > > > > > > > could be telling us anything useful (what could we do in
> > > > > > > > > response
> > > > > > > > > to
> > > > > > > > > LBMS when the slot is empty?), so it makes sense to me to
> > > > > > > > > clear
> > > > > > > > > it.
> > > > > > > > > 
> > > > > > > > > It seems like pciehp_unconfigure_device() does sort of PCI
> > > > > > > > > core
> > > > > > > > > and
> > > > > > > > > driver-related things and possibly could be something shared
> > > > > > > > > by
> > > > > > > > > all
> > > > > > > > > hotplug drivers, while remove_board() does things more
> > > > > > > > > specific to
> > > > > > > > > the
> > > > > > > > > hotplug model (pciehp, shpchp, etc).
> > > > > > > > > 
> > > > > > > > >   From that perspective, clearing LBMS might fit better in
> > > > > > > > > remove_board(). In that case, I wonder whether it should be
> > > > > > > > > done
> > > > > > > > > after turning off slot power? This patch clears is *before*
> > > > > > > > > turning
> > > > > > > > > off the power, so I wonder if hardware could possibly set it
> > > > > > > > > again
> > > > > > > > > before the poweroff?
> > > > > 
> > > > > While clearing LBMS in remove_board() here:
> > > > > 
> > > > > if (POWER_CTRL(ctrl)) {
> > > > > 	pciehp_power_off_slot(ctrl);
> > > > > +	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> > > > > 				   PCI_EXP_LNKSTA_LBMS);
> > > > > 
> > > > > 	/*
> > > > > 	 * After turning power off, we must wait for at least 1 second
> > > > > 	 * before taking any action that relies on power having been
> > > > > 	 * removed from the slot/adapter.
> > > > > 	 */
> > > > > 	msleep(1000);
> > > > > 
> > > > > 	/* Ignore link or presence changes caused by power off */
> > > > > 	atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
> > > > > 		   &ctrl->pending_events);
> > > > > }
> > > > > 
> > > > > This can happen too right? I.e Just after the slot poweroff and before
> > > > > LBMS
> > > > > clearing the PDC/PDSC could be fired. Then
> > > > > pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
> > > > > proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
> > > > > ultimately link speed drops..
> > > > > 
> > > > > So, I added clearing just before turning off the slot.. Let me know if
> > > > > I'm
> > > > > thinking it right.
> > > 
> > > I guess I should have experimented before putting this comment out.
> > > 
> > > After talking to the HW/FW teams, I understood that, none of our CRBs
> > > support
> > > power controller for NVMe devices, which means the "Power Controller
> > > Present"
> > > in Slot_Cap is always false. That's what makes it a "surprise removal." If
> > > the
> > > OS was notified beforehand and there was a power controller attached, the
> > > OS
> > > would turn off the power with SLOT_CNTL. That's an "orderly" removal. So
> > > essentially, the entire block from "if (POWER_CTRL(ctrl))" will never be
> > > executed for surprise removal for us.
> > > 
> > > There could be board designs outside of us, with power controllers for the
> > > NVME devices, which I'm not aware of.
> > > > 
> > > > This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
> > > > after powering off a slot").  You can try reproducing it by writing "0"
> > > > to the slot's "power" file in sysfs, but your hardware needs to support
> > > > slot power.
> > > > 
> > > > Basically the idea is that after waiting for 1 sec, chances are very low
> > > > that any DLLSC or PDSC events caused by removing slot power may still
> > > > occur.
> > > 
> > > PDSC events occurring in our case aren't by removing slot power. It
> > > should/will always happen on a surprise removal along with DLLSC for us.
> > > But
> > > this PDSC is been delayed and happens after DLLSC is invoked and
> > > ctrl->state =
> > > OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook to enable slot
> > > in
> > > pciehp_enable_slot() inside pciehp_handle_presence_or_link_change().
> > > > 
> > > > Arguably the same applies to LBMS changes, so I'd recommend to likewise
> > > > clear stale LBMS after the msleep(1000).
> > > > 
> > > > pciehp_ctrl.c only contains the state machine and higher-level logic of
> > > > the hotplug controller and all the actual register accesses are in
> > > > helpers
> > > > in pciehp_hpc.c.  So if you want to do it picture-perfectly, add a
> > > > helper
> > > > in pciehp_hpc.c to clear LBMS and call that from remove_board().
> > > > 
> > > > That all being said, I'm wondering how this plays together with Ilpo's
> > > > bandwidth control driver?
> > > > 
> > > > https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/
> > > 
> > > I need to yet do a thorough reading of Ilpo's bandwidth control driver.
> > > Ilpo
> > > please correct me if I misspeak something as I don't have a thorough
> > > understanding.
> > > 
> > > Ilpo's bandwidth controller also checks for lbms count to be greater than
> > > zero
> > > to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If false, it
> > > follows
> > > the default path to check LBMS bit in link status register. So if,
> > > CONFIG_PCIE_BWCTRL is disabled by default we continue to see link speed
> > > drops.
> > > Even, if BWCTRL is enabled, LBMS count is incremented to 1 in
> > > pcie_bwnotif_enable() so likely pcie_lbms_seen() might return true thereby
> > > bringing down speeds here as well if DLLLA is clear?
> > 
> > I did add code to clear the LBMS count in pciehp_unconfigure_device() in
> > part thanks to this patch of yours. Do you think it wouldn't work?
> 
> It works with BWCTRL enabled. Just my concern would be to keep the clearing in
> pciehp_unconfigure_device() and not do it inside POWER_CTRL(ctrl), in
> remove_board() as per the suggestions given above.

I added that LBMS count reset based on one of the first versions of your 
patch which had the LBMS clearing in pciehp_unconfigure_device() before 
this discussion went further. The point anyway is that bwctrl change would 
replace the line you put in to clear LBMS, wherever that will be placed 
in the end.

> > But I agree there would still be problem if BWCTRL is not enabled. I
> > already have to keep part of it enabled due to the Target Speed quirk
> > and now this is another case where just having it always on would be
> > beneficial.
> 
> Correct, it should always be on to not see the problem.
>
> Would like to have this "LBMS clearing fix" accepted in sooner since its
> breaking things on our systems! :)

Sure, I've no objection to that. In fact, it's what I prefer myself too 
because it makes things easier for stable folks. :-)

I'd also like to see those Target Speed quick fixes to go in before
bwctl (to the extent I did base my series on top of them) but it seems 
Maciej might not be updating them so perhaps I'll have to take look at 
Bjorn's comments on that series and see what can be done to those patches.

-- 
 i.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-18  7:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-16 20:47 [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction Smita Koralahalli
2024-06-17 20:09 ` Bjorn Helgaas
2024-06-17 22:51   ` Smita Koralahalli
2024-06-17 23:18     ` Bjorn Helgaas
2024-06-18 18:51       ` Smita Koralahalli
2024-06-18 21:23         ` Smita Koralahalli
2024-06-19  7:47           ` Lukas Wunner
2024-06-25 20:20             ` Smita Koralahalli
2024-07-09 10:52               ` Ilpo Järvinen
2024-07-17 21:14                 ` Smita Koralahalli
2024-07-18  7:47                   ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).