Race b/w pciehp and FirmwareFirst DPC->EDR flow

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg
@ 2023-11-10 13:00 Vidya Sagar
  2023-11-10 17:01 ` Vidya Sagar
  0 siblings, 1 reply; 5+ messages in thread
From: Vidya Sagar @ 2023-11-10 13:00 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Lukas Wunner,
	Sathyanarayanan Kuppuswamy, kbusch@kernel.org
  Cc: Vikram Sethi, Krishna Thota, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org

Hi folks,
Here are the platform details.
- System with a Firmware First approach for both AER and DPC
- CPERs are sent to the OS for the AER events
- DPC is notified to the OS through EDR mechanism
- System doesn't have support for in-band PD and supports only OOB PD where writing to a private register would set the PD state
- System has this design where PD state gets cleared whenever there is a link down (without even writing to the private register)

In this system, if the root port has to experience a DPC because of any right reasons (say SW trigger of DPC for time being), I'm observing the following flow which is causing a race.
1. Since DPC brings the link down, there is a DLLSC and an MSI is sent to the OS hence PCIe HP ISR is invoked.
2. ISR then takes stock of the Slot_Status register to see what all events caused the MSI generation.
3. It sees both DLLSC and PDC bits getting set.
4. PDC is set because of the aforementioned hardware design where for every link down, PD state gets cleared and since this is considered as a change in the PD state, PDC also gets set.
5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt thread/bottom half) and waits for the DPC_EDR to complete (because DLLSC is one of the events)
6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the firmware and that would then send an EDR event to the OS. Firmware also sets the PD state to '1' before that, as the endpoint device is still available.
7. Firmware programming of the private register to set the PD state raises second interrupt and PCIe HP ISR takes stock of the events and this time it is only the PDC and not DLLSC. ISR sends a wake to the IST, but since there is an IST in progress already, nothing much happens at this point.
8. Once the DPC is completed and link comes back up again, DPC framework asks the endpoint drivers to handle it by calling the 'pci_error_handlers' callabacks registered by the endpoint device drivers.
9. The PCIe HP IST (of the very first MSI) resumes after receiving the completion from the DPC framework saying that DPC recovery is successful.
10. Since PDC (Presence Detect Change) bit is also set for the first interrupt, IST attempts to remove the devices (as part of pciehp_handle_presence_or_link_change())

At this point, there is a race between the device driver that is trying to work with the device (through pci_error_handlers callback) and the IST that is trying to remove the device.
To be fair to pciehp_handle_presence_or_link_change(), after removing the devices, it checks for the link-up/PD being '1' and scans the devices again if the device is still available. But unfortunately, IST is deadlocked (with the device driver) while removing the devices itself and won't go to the next step.

The rationale given in the form of a comment in pciehp_handle_presence_or_link_change() for removing the devices first (without checking PD/link-up) and adding them back after checking link-up/PD is that, since there is a change in PD, the new link-up need not be with the same old device rather it could be with a different device. So, it justifies in removing the existing devices and then scanning for new ones. But this is causing deadlock with the already initiated DPC recovery process.

I see two ways to avoid the deadlock in this scenario.
1. When PCIe HP IST is looking at both DLLSC and PDC, why should DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks DPC also (along with DLL) and declares that the DPC recovery can't happen (as there could be a change of device) hence returning DPC recovery failure, then, the device driver's pci_error_handlers callbacks won't be called and there won't be any deadlock.
2. Check for the possibility of a common lock so that PCIe HP IST and device driver's pci_error_handlers callbacks don't race.

Let me know your comments on this.

Thanks,
Vidya Sagar

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg
  2023-11-10 13:00 Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg Vidya Sagar
@ 2023-11-10 17:01 ` Vidya Sagar
  2023-11-27 14:43   ` Vidya Sagar
  2023-11-27 15:26   ` Lukas Wunner
  0 siblings, 2 replies; 5+ messages in thread
From: Vidya Sagar @ 2023-11-10 17:01 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Lukas Wunner,
	Sathyanarayanan Kuppuswamy, kbusch@kernel.org
  Cc: Vikram Sethi, Krishna Thota, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, sagar.tv

There was a typo. Corrected it.

s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also

On 11/10/2023 6:30 PM, Vidya Sagar wrote:
> Hi folks,
> Here are the platform details.
> - System with a Firmware First approach for both AER and DPC
> - CPERs are sent to the OS for the AER events
> - DPC is notified to the OS through EDR mechanism
> - System doesn't have support for in-band PD and supports only OOB PD where writing to a private register would set the PD state
> - System has this design where PD state gets cleared whenever there is a link down (without even writing to the private register)
> 
> In this system, if the root port has to experience a DPC because of any right reasons (say SW trigger of DPC for time being), I'm observing the following flow which is causing a race.
> 1. Since DPC brings the link down, there is a DLLSC and an MSI is sent to the OS hence PCIe HP ISR is invoked.
> 2. ISR then takes stock of the Slot_Status register to see what all events caused the MSI generation.
> 3. It sees both DLLSC and PDC bits getting set.
> 4. PDC is set because of the aforementioned hardware design where for every link down, PD state gets cleared and since this is considered as a change in the PD state, PDC also gets set.
> 5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt thread/bottom half) and waits for the DPC_EDR to complete (because DLLSC is one of the events)
> 6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the firmware and that would then send an EDR event to the OS. Firmware also sets the PD state to '1' before that, as the endpoint device is still available.
> 7. Firmware programming of the private register to set the PD state raises second interrupt and PCIe HP ISR takes stock of the events and this time it is only the PDC and not DLLSC. ISR sends a wake to the IST, but since there is an IST in progress already, nothing much happens at this point.
> 8. Once the DPC is completed and link comes back up again, DPC framework asks the endpoint drivers to handle it by calling the 'pci_error_handlers' callabacks registered by the endpoint device drivers.
> 9. The PCIe HP IST (of the very first MSI) resumes after receiving the completion from the DPC framework saying that DPC recovery is successful.
> 10. Since PDC (Presence Detect Change) bit is also set for the first interrupt, IST attempts to remove the devices (as part of pciehp_handle_presence_or_link_change())
> 
> At this point, there is a race between the device driver that is trying to work with the device (through pci_error_handlers callback) and the IST that is trying to remove the device.
> To be fair to pciehp_handle_presence_or_link_change(), after removing the devices, it checks for the link-up/PD being '1' and scans the devices again if the device is still available. But unfortunately, IST is deadlocked (with the device driver) while removing the devices itself and won't go to the next step.
> 
> The rationale given in the form of a comment in pciehp_handle_presence_or_link_change() for removing the devices first (without checking PD/link-up) and adding them back after checking link-up/PD is that, since there is a change in PD, the new link-up need not be with the same old device rather it could be with a different device. So, it justifies in removing the existing devices and then scanning for new ones. But this is causing deadlock with the already initiated DPC recovery process.
> 
> I see two ways to avoid the deadlock in this scenario.
> 1. When PCIe HP IST is looking at both DLLSC and PDC, why should DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks 'PD Change' also (along with DLL) and declares that the DPC recovery can't happen (as there could be a change of device) hence returning DPC recovery failure, then, the device driver's pci_error_handlers callbacks won't be called and there won't be any deadlock.
> 2. Check for the possibility of a common lock so that PCIe HP IST and device driver's pci_error_handlers callbacks don't race.
> 
> Let me know your comments on this.
> 
> Thanks,
> Vidya Sagar

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg
  2023-11-10 17:01 ` Vidya Sagar
@ 2023-11-27 14:43   ` Vidya Sagar
  2023-11-27 17:05     ` Kuppuswamy Sathyanarayanan
  2023-11-27 15:26   ` Lukas Wunner
  1 sibling, 1 reply; 5+ messages in thread
From: Vidya Sagar @ 2023-11-27 14:43 UTC (permalink / raw)
  To: Bjorn Helgaas, Lorenzo Pieralisi, Lukas Wunner,
	Sathyanarayanan Kuppuswamy, kbusch@kernel.org
  Cc: Vikram Sethi, Krishna Thota, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, sagar.tv

Hi Bjorn and others,
Could you please share your thoughts on this?

Thanks,
Vidya Sagar

On 11/10/2023 10:31 PM, Vidya Sagar wrote:
> There was a typo. Corrected it.
> 
> s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also
> 
> On 11/10/2023 6:30 PM, Vidya Sagar wrote:
>> Hi folks,
>> Here are the platform details.
>> - System with a Firmware First approach for both AER and DPC
>> - CPERs are sent to the OS for the AER events
>> - DPC is notified to the OS through EDR mechanism
>> - System doesn't have support for in-band PD and supports only OOB PD 
>> where writing to a private register would set the PD state
>> - System has this design where PD state gets cleared whenever there is 
>> a link down (without even writing to the private register)
>>
>> In this system, if the root port has to experience a DPC because of 
>> any right reasons (say SW trigger of DPC for time being), I'm 
>> observing the following flow which is causing a race.
>> 1. Since DPC brings the link down, there is a DLLSC and an MSI is sent 
>> to the OS hence PCIe HP ISR is invoked.
>> 2. ISR then takes stock of the Slot_Status register to see what all 
>> events caused the MSI generation.
>> 3. It sees both DLLSC and PDC bits getting set.
>> 4. PDC is set because of the aforementioned hardware design where for 
>> every link down, PD state gets cleared and since this is considered as 
>> a change in the PD state, PDC also gets set.
>> 5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt 
>> thread/bottom half) and waits for the DPC_EDR to complete (because 
>> DLLSC is one of the events)
>> 6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the 
>> firmware and that would then send an EDR event to the OS. Firmware 
>> also sets the PD state to '1' before that, as the endpoint device is 
>> still available.
>> 7. Firmware programming of the private register to set the PD state 
>> raises second interrupt and PCIe HP ISR takes stock of the events and 
>> this time it is only the PDC and not DLLSC. ISR sends a wake to the 
>> IST, but since there is an IST in progress already, nothing much 
>> happens at this point.
>> 8. Once the DPC is completed and link comes back up again, DPC 
>> framework asks the endpoint drivers to handle it by calling the 
>> 'pci_error_handlers' callabacks registered by the endpoint device 
>> drivers.
>> 9. The PCIe HP IST (of the very first MSI) resumes after receiving the 
>> completion from the DPC framework saying that DPC recovery is successful.
>> 10. Since PDC (Presence Detect Change) bit is also set for the first 
>> interrupt, IST attempts to remove the devices (as part of 
>> pciehp_handle_presence_or_link_change())
>>
>> At this point, there is a race between the device driver that is 
>> trying to work with the device (through pci_error_handlers callback) 
>> and the IST that is trying to remove the device.
>> To be fair to pciehp_handle_presence_or_link_change(), after removing 
>> the devices, it checks for the link-up/PD being '1' and scans the 
>> devices again if the device is still available. But unfortunately, IST 
>> is deadlocked (with the device driver) while removing the devices 
>> itself and won't go to the next step.
>>
>> The rationale given in the form of a comment in 
>> pciehp_handle_presence_or_link_change() for removing the devices first 
>> (without checking PD/link-up) and adding them back after checking 
>> link-up/PD is that, since there is a change in PD, the new link-up 
>> need not be with the same old device rather it could be with a 
>> different device. So, it justifies in removing the existing devices 
>> and then scanning for new ones. But this is causing deadlock with the 
>> already initiated DPC recovery process.
>>
>> I see two ways to avoid the deadlock in this scenario.
>> 1. When PCIe HP IST is looking at both DLLSC and PDC, why should 
>> DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks 
>> 'PD Change' also (along with DLL) and declares that the DPC recovery 
>> can't happen (as there could be a change of device) hence returning 
>> DPC recovery failure, then, the device driver's pci_error_handlers 
>> callbacks won't be called and there won't be any deadlock.
>> 2. Check for the possibility of a common lock so that PCIe HP IST and 
>> device driver's pci_error_handlers callbacks don't race.
>>
>> Let me know your comments on this.
>>
>> Thanks,
>> Vidya Sagar

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg
  2023-11-10 17:01 ` Vidya Sagar
  2023-11-27 14:43   ` Vidya Sagar
@ 2023-11-27 15:26   ` Lukas Wunner
  1 sibling, 0 replies; 5+ messages in thread
From: Lukas Wunner @ 2023-11-27 15:26 UTC (permalink / raw)
  To: Vidya Sagar
  Cc: Bjorn Helgaas, Lorenzo Pieralisi, Sathyanarayanan Kuppuswamy,
	kbusch@kernel.org, Vikram Sethi, Krishna Thota,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, sagar.tv

Hi Vidya,

sorry for the delay, still catching up on e-mails after Plumbers...

On Fri, Nov 10, 2023 at 10:31:55PM +0530, Vidya Sagar wrote:
> > - System doesn't have support for in-band PD and supports only OOB PD
> >   where writing to a private register would set the PD state

We already have an inband_presence_disabled flag in struct controller
which is set if the In-Band PD Disable Supported bit in the Slot
Capabilities 2 Register is set.  The flag may also be set through the
inband_presence_disabled_dmi_table[].

Currently the only place where the flag makes a difference is on
slot bringup:  pciehp_check_link_status() doesn't wait for the
Presence Detect Status bit to become set.

I'm wondering if we need to also disregard PDC events if In-Band PD
is disabled.  Not sure if the behavior you're seeing is caused by a
quirk of the hardware or is expected if In-Band PD is disabled.
Probably the former.  A code change would generally only be acceptable
in the latter case though I think.

> > 10. Since PDC (Presence Detect Change) bit is also set for the first
> >     interrupt, IST attempts to remove the devices (as part of
> >     pciehp_handle_presence_or_link_change())
> > 
> > At this point, there is a race between the device driver that is
> > trying to work with the device (through pci_error_handlers callback)
> > and the IST that is trying to remove the device.
> > To be fair to pciehp_handle_presence_or_link_change(), after removing
> > the devices, it checks for the link-up/PD being '1' and scans the
> > devices again if the device is still available. But unfortunately,
> > IST is deadlocked (with the device driver) while removing the devices
> > itself and won't go to the next step.

Could you provide stacktraces of the two deadlocked tasks?
Right now I don't quite understand why they're deadlocked.

Are you getting hung task messages in dmesg?
They should include stacktraces.

Also, which kernel version are we talking about?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg
  2023-11-27 14:43   ` Vidya Sagar
@ 2023-11-27 17:05     ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 5+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-11-27 17:05 UTC (permalink / raw)
  To: Vidya Sagar, Bjorn Helgaas, Lorenzo Pieralisi, Lukas Wunner,
	kbusch@kernel.org
  Cc: Vikram Sethi, Krishna Thota, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, sagar.tv



On 11/27/2023 6:43 AM, Vidya Sagar wrote:
> Hi Bjorn and others,
> Could you please share your thoughts on this?
> 
> Thanks,
> Vidya Sagar
> 
> On 11/10/2023 10:31 PM, Vidya Sagar wrote:
>> There was a typo. Corrected it.
>>
>> s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also
>>
>> On 11/10/2023 6:30 PM, Vidya Sagar wrote:
>>> Hi folks,
>>> Here are the platform details.
>>> - System with a Firmware First approach for both AER and DPC
>>> - CPERs are sent to the OS for the AER events
>>> - DPC is notified to the OS through EDR mechanism
>>> - System doesn't have support for in-band PD and supports only OOB PD where writing to a private register would set the PD state
>>> - System has this design where PD state gets cleared whenever there is a link down (without even writing to the private register)
>>>

Is clearing the PD state when link goes down normal? Is this a bug in the hardware?

>>> In this system, if the root port has to experience a DPC because of any right reasons (say SW trigger of DPC for time being), I'm observing the following flow which is causing a race.
>>> 1. Since DPC brings the link down, there is a DLLSC and an MSI is sent to the OS hence PCIe HP ISR is invoked.
>>> 2. ISR then takes stock of the Slot_Status register to see what all events caused the MSI generation.
>>> 3. It sees both DLLSC and PDC bits getting set.
>>> 4. PDC is set because of the aforementioned hardware design where for every link down, PD state gets cleared and since this is considered as a change in the PD state, PDC also gets set.
>>> 5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt thread/bottom half) and waits for the DPC_EDR to complete (because DLLSC is one of the events)
>>> 6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the firmware and that would then send an EDR event to the OS. Firmware also sets the PD state to '1' before that, as the endpoint device is still available.
>>> 7. Firmware programming of the private register to set the PD state raises second interrupt and PCIe HP ISR takes stock of the events and this time it is only the PDC and not DLLSC. ISR sends a wake to the IST, but since there is an IST in progress already, nothing much happens at this point.
>>> 8. Once the DPC is completed and link comes back up again, DPC framework asks the endpoint drivers to handle it by calling the 'pci_error_handlers' callabacks registered by the endpoint device drivers.
>>> 9. The PCIe HP IST (of the very first MSI) resumes after receiving the completion from the DPC framework saying that DPC recovery is successful.
>>> 10. Since PDC (Presence Detect Change) bit is also set for the first interrupt, IST attempts to remove the devices (as part of pciehp_handle_presence_or_link_change())
>>>
>>> At this point, there is a race between the device driver that is trying to work with the device (through pci_error_handlers callback) and the IST that is trying to remove the device.

IIUC, this step is after DPC recovery, which means the device is up and error_handlers are not used anymore, right? Why do you say there is a race between IST and error handler?

>>> To be fair to pciehp_handle_presence_or_link_change(), after removing the devices, it checks for the link-up/PD being '1' and scans the devices again if the device is still available. But unfortunately, IST is deadlocked (with the device driver) while removing the devices itself and won't go to the next step.
>>>
>>> The rationale given in the form of a comment in pciehp_handle_presence_or_link_change() for removing the devices first (without checking PD/link-up) and adding them back after checking link-up/PD is that, since there is a change in PD, the new link-up need not be with the same old device rather it could be with a different device. So, it justifies in removing the existing devices and then scanning for new ones. But this is causing deadlock with the already initiated DPC recovery process.
>>>
>>> I see two ways to avoid the deadlock in this scenario.
>>> 1. When PCIe HP IST is looking at both DLLSC and PDC, why should DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks 'PD Change' also (along with DLL) and declares that the DPC recovery can't happen (as there could be a change of device) hence returning DPC recovery failure, then, the device driver's pci_error_handlers callbacks won't be called and there won't be any deadlock.
>>> 2. Check for the possibility of a common lock so that PCIe HP IST and device driver's pci_error_handlers callbacks don't race.
>>>
>>> Let me know your comments on this.
>>>
>>> Thanks,
>>> Vidya Sagar

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-27 17:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-10 13:00 Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg Vidya Sagar
2023-11-10 17:01 ` Vidya Sagar
2023-11-27 14:43   ` Vidya Sagar
2023-11-27 17:05     ` Kuppuswamy Sathyanarayanan
2023-11-27 15:26   ` Lukas Wunner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox