* Question: Clearing error bits in the root port post enumeration @ 2023-10-31 12:26 Vidya Sagar 2023-11-03 18:20 ` Bjorn Helgaas 0 siblings, 1 reply; 5+ messages in thread From: Vidya Sagar @ 2023-10-31 12:26 UTC (permalink / raw) To: Bjorn Helgaas, Lorenzo Pieralisi Cc: Vikram Sethi, Thierry Reding, Jonathan Hunter, Krishna Thota, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Hi folks, I would like to know your comments on the following scenario where we are observing the root port logging errors because of the enumeration flow being followed. DUT information: - Has a root port and an endpoint connected to it - Uses ECAM mechanism to access the configuration space - Booted through ACPI flow - Has a Firmware-First approach for handling the errors - System is configured to treat Unsupported Requests as AdvisoryNon-Fatal errors As we all know, when a configuration read request comes in for a device number that is not implemented, a UR would be returned as per the PCIe spec. As part of the enumeration flow on DUT, when the kernel reads offset 0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID and Device-ID values. But, when B:D:F=0:1:0 is probed, since there is no device present there, the root port responds with an Unsupported Request and simultaneously logs the same in the Device Status register (i.e. bit-3). Because of it, there is a UR logged in the Device Status register of the RP by the time enumeration is complete. In the case of AER capability natively owned by the kernel, the AER driver's init call would clear all such pending bits. Since we are going with the Firmware-First approach, and the system is configured to treat Unsupported Requests as AdvisoryNon-Fatal errors, only a correctable error interrupt can be raised to the Firmware which takes care of clearing the corresponding status registers. The firmware can't know about the UnsupReq bit being set as the interrupt it received is for a correctable error hence it clears only bits related to correctable error. All these events leave a freshly booted system with the following bits set. Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort) DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq) UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq) Since the reason for UR is well understood at this point, I would like to weigh in on the idea of clearing the aforementioned bits in the root port once the enumeration is done particularly to cater to the configurations where Firmware-First approach is in place. Please let me know your comments on this approach. Thanks, Vidya Sagar ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Clearing error bits in the root port post enumeration 2023-10-31 12:26 Question: Clearing error bits in the root port post enumeration Vidya Sagar @ 2023-11-03 18:20 ` Bjorn Helgaas 2023-11-07 3:14 ` Vidya Sagar 0 siblings, 1 reply; 5+ messages in thread From: Bjorn Helgaas @ 2023-11-03 18:20 UTC (permalink / raw) To: Vidya Sagar Cc: Lorenzo Pieralisi, Vikram Sethi, Thierry Reding, Jonathan Hunter, Krishna Thota, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 31, 2023 at 12:26:31PM +0000, Vidya Sagar wrote: > Hi folks, > > I would like to know your comments on the following scenario where > we are observing the root port logging errors because of the > enumeration flow being followed. > > DUT information: > - Has a root port and an endpoint connected to it > - Uses ECAM mechanism to access the configuration space > - Booted through ACPI flow > - Has a Firmware-First approach for handling the errors > - System is configured to treat Unsupported Requests as > AdvisoryNon-Fatal errors > > As we all know, when a configuration read request comes in for a > device number that is not implemented, a UR would be returned as per > the PCIe spec. > > As part of the enumeration flow on DUT, when the kernel reads offset > 0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID > and Device-ID values. But, when B:D:F=0:1:0 is probed, since there > is no device present there, the root port responds with an > Unsupported Request and simultaneously logs the same in the Device > Status register (i.e. bit-3). Because of it, there is a UR logged > in the Device Status register of the RP by the time enumeration is > complete. > > In the case of AER capability natively owned by the kernel, the AER > driver's init call would clear all such pending bits. > > Since we are going with the Firmware-First approach, and the system > is configured to treat Unsupported Requests as AdvisoryNon-Fatal > errors, only a correctable error interrupt can be raised to the > Firmware which takes care of clearing the corresponding status > registers. The firmware can't know about the UnsupReq bit being set > as the interrupt it received is for a correctable error hence it > clears only bits related to correctable error. > > All these events leave a freshly booted system with the following > bits set. > > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort) > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq) > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq) > > Since the reason for UR is well understood at this point, I would > like to weigh in on the idea of clearing the aforementioned bits in > the root port once the enumeration is done particularly to cater to > the configurations where Firmware-First approach is in place. > Please let me know your comments on this approach. I think Secondary status (PCI_SEC_STATUS) is always owned by the OS and is not affected by _OSC negotiation, right? Linux does basically nothing with that today, but I think it *could* clear the "Received Master Abort" bit. I'm not very familiar with Advisory Non-Fatal errors. I'm curious about the UESta situation: why can't firmware know about UnsupReq being set? I assume PCI_ERR_COR_ADV_NFAT is the Correctable Error Status bit the firmware *does* see and clear. But isn't the whole point of Advisory Non-Fatal errors that an error that is logged as an Uncorrectable Error and that normally would be signaled with ERR_NONFATAL is signaled with ERR_COR instead? So doesn't PCI_ERR_COR_ADV_NFAT being set imply that some PCI_ERR_UNCOR_STATUS must be set as well? If so, I would think firmware *could* figure that out and clear the PCI_ERR_UNCOR_STATUS bit. Bjorn ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Clearing error bits in the root port post enumeration 2023-11-03 18:20 ` Bjorn Helgaas @ 2023-11-07 3:14 ` Vidya Sagar 2023-11-07 15:29 ` Bjorn Helgaas 0 siblings, 1 reply; 5+ messages in thread From: Vidya Sagar @ 2023-11-07 3:14 UTC (permalink / raw) To: Bjorn Helgaas Cc: Lorenzo Pieralisi, Vikram Sethi, Thierry Reding, Jonathan Hunter, Krishna Thota, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org On 11/3/2023 11:50 PM, Bjorn Helgaas wrote: > External email: Use caution opening links or attachments > > > On Tue, Oct 31, 2023 at 12:26:31PM +0000, Vidya Sagar wrote: >> Hi folks, >> >> I would like to know your comments on the following scenario where >> we are observing the root port logging errors because of the >> enumeration flow being followed. >> >> DUT information: >> - Has a root port and an endpoint connected to it >> - Uses ECAM mechanism to access the configuration space >> - Booted through ACPI flow >> - Has a Firmware-First approach for handling the errors >> - System is configured to treat Unsupported Requests as >> AdvisoryNon-Fatal errors >> >> As we all know, when a configuration read request comes in for a >> device number that is not implemented, a UR would be returned as per >> the PCIe spec. >> >> As part of the enumeration flow on DUT, when the kernel reads offset >> 0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID >> and Device-ID values. But, when B:D:F=0:1:0 is probed, since there >> is no device present there, the root port responds with an >> Unsupported Request and simultaneously logs the same in the Device >> Status register (i.e. bit-3). Because of it, there is a UR logged >> in the Device Status register of the RP by the time enumeration is >> complete. >> >> In the case of AER capability natively owned by the kernel, the AER >> driver's init call would clear all such pending bits. >> >> Since we are going with the Firmware-First approach, and the system >> is configured to treat Unsupported Requests as AdvisoryNon-Fatal >> errors, only a correctable error interrupt can be raised to the >> Firmware which takes care of clearing the corresponding status >> registers. The firmware can't know about the UnsupReq bit being set >> as the interrupt it received is for a correctable error hence it >> clears only bits related to correctable error. >> >> All these events leave a freshly booted system with the following >> bits set. >> >> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort) >> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq) >> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq) >> >> Since the reason for UR is well understood at this point, I would >> like to weigh in on the idea of clearing the aforementioned bits in >> the root port once the enumeration is done particularly to cater to >> the configurations where Firmware-First approach is in place. >> Please let me know your comments on this approach. > > I think Secondary status (PCI_SEC_STATUS) is always owned by the OS > and is not affected by _OSC negotiation, right? Linux does basically > nothing with that today, but I think it *could* clear the "Received > Master Abort" bit. Yes. PCI_SEC_STATUS is always owned by the OS and _OSC negotiation doesn't really affect that. > > I'm not very familiar with Advisory Non-Fatal errors. I'm curious > about the UESta situation: why can't firmware know about UnsupReq > being set? I assume PCI_ERR_COR_ADV_NFAT is the Correctable Error > Status bit the firmware *does* see and clear. Yes, PCI_ERR_COR_ADV_NFAT is indeed cleared by the firmware. > > But isn't the whole point of Advisory Non-Fatal errors that an error > that is logged as an Uncorrectable Error and that normally would be > signaled with ERR_NONFATAL is signaled with ERR_COR instead? So > doesn't PCI_ERR_COR_ADV_NFAT being set imply that some > PCI_ERR_UNCOR_STATUS must be set as well? If so, I would think > firmware *could* figure that out and clear the PCI_ERR_UNCOR_STATUS > bit. So, are you suggesting that let the firmware only clear the PCI_ERR_UNCOR_STATUS also? if so, then, I can even make the firmware clear the PCI_SEC_STATUS also thereby leaving the firmware responsible for clearing all the error bits. Does that sound ok? > > Bjorn ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Clearing error bits in the root port post enumeration 2023-11-07 3:14 ` Vidya Sagar @ 2023-11-07 15:29 ` Bjorn Helgaas 2023-11-08 7:17 ` Vidya Sagar 0 siblings, 1 reply; 5+ messages in thread From: Bjorn Helgaas @ 2023-11-07 15:29 UTC (permalink / raw) To: Vidya Sagar Cc: Lorenzo Pieralisi, Vikram Sethi, Thierry Reding, Jonathan Hunter, Krishna Thota, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Nov 07, 2023 at 08:44:53AM +0530, Vidya Sagar wrote: > On 11/3/2023 11:50 PM, Bjorn Helgaas wrote: > > On Tue, Oct 31, 2023 at 12:26:31PM +0000, Vidya Sagar wrote: > > > Hi folks, > > > > > > I would like to know your comments on the following scenario where > > > we are observing the root port logging errors because of the > > > enumeration flow being followed. > > > > > > DUT information: > > > - Has a root port and an endpoint connected to it > > > - Uses ECAM mechanism to access the configuration space > > > - Booted through ACPI flow > > > - Has a Firmware-First approach for handling the errors > > > - System is configured to treat Unsupported Requests as > > > AdvisoryNon-Fatal errors > > > > > > As we all know, when a configuration read request comes in for a > > > device number that is not implemented, a UR would be returned as per > > > the PCIe spec. > > > > > > As part of the enumeration flow on DUT, when the kernel reads offset > > > 0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID > > > and Device-ID values. But, when B:D:F=0:1:0 is probed, since there > > > is no device present there, the root port responds with an > > > Unsupported Request and simultaneously logs the same in the Device > > > Status register (i.e. bit-3). Because of it, there is a UR logged > > > in the Device Status register of the RP by the time enumeration is > > > complete. > > > > > > In the case of AER capability natively owned by the kernel, the AER > > > driver's init call would clear all such pending bits. > > > > > > Since we are going with the Firmware-First approach, and the system > > > is configured to treat Unsupported Requests as AdvisoryNon-Fatal > > > errors, only a correctable error interrupt can be raised to the > > > Firmware which takes care of clearing the corresponding status > > > registers. The firmware can't know about the UnsupReq bit being set > > > as the interrupt it received is for a correctable error hence it > > > clears only bits related to correctable error. > > > > > > All these events leave a freshly booted system with the following > > > bits set. > > > > > > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort) > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq) > > > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq) > > > > > > Since the reason for UR is well understood at this point, I would > > > like to weigh in on the idea of clearing the aforementioned bits in > > > the root port once the enumeration is done particularly to cater to > > > the configurations where Firmware-First approach is in place. > > > Please let me know your comments on this approach. > > > > I think Secondary status (PCI_SEC_STATUS) is always owned by the OS > > and is not affected by _OSC negotiation, right? Linux does basically > > nothing with that today, but I think it *could* clear the "Received > > Master Abort" bit. > > Yes. PCI_SEC_STATUS is always owned by the OS and _OSC negotiation doesn't > really affect that. > > > I'm not very familiar with Advisory Non-Fatal errors. I'm curious > > about the UESta situation: why can't firmware know about UnsupReq > > being set? I assume PCI_ERR_COR_ADV_NFAT is the Correctable Error > > Status bit the firmware *does* see and clear. > > Yes, PCI_ERR_COR_ADV_NFAT is indeed cleared by the firmware. > > > > But isn't the whole point of Advisory Non-Fatal errors that an error > > that is logged as an Uncorrectable Error and that normally would be > > signaled with ERR_NONFATAL is signaled with ERR_COR instead? So > > doesn't PCI_ERR_COR_ADV_NFAT being set imply that some > > PCI_ERR_UNCOR_STATUS must be set as well? If so, I would think > > firmware *could* figure that out and clear the PCI_ERR_UNCOR_STATUS > > bit. > > So, are you suggesting that let the firmware only clear the > PCI_ERR_UNCOR_STATUS also? In this firmware-first scenario, I'm assuming the platform retained ownership of the AER capability, so I would think firmware certainly should be allowed to clear PCI_ERR_UNCOR_STATUS. > if so, then, I can even make the firmware clear the PCI_SEC_STATUS > also thereby leaving the firmware responsible for clearing all the > error bits. Does that sound ok? It doesn't sound quite right to me for firmware to clear PCI_SEC_STATUS because it doesn't own that register. I suspect we would probably see the "Received Master Abort" bit set after enumeration even on Conventional PCI systems, so I doubt this is anything specific to PCIe or AER, and maybe Linux should clear it after enumerating devices below the bridge. Bjorn ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Clearing error bits in the root port post enumeration 2023-11-07 15:29 ` Bjorn Helgaas @ 2023-11-08 7:17 ` Vidya Sagar 0 siblings, 0 replies; 5+ messages in thread From: Vidya Sagar @ 2023-11-08 7:17 UTC (permalink / raw) To: Bjorn Helgaas Cc: Lorenzo Pieralisi, Vikram Sethi, Thierry Reding, Jonathan Hunter, Krishna Thota, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org On 11/7/2023 8:59 PM, Bjorn Helgaas wrote: > External email: Use caution opening links or attachments > > > On Tue, Nov 07, 2023 at 08:44:53AM +0530, Vidya Sagar wrote: >> On 11/3/2023 11:50 PM, Bjorn Helgaas wrote: >>> On Tue, Oct 31, 2023 at 12:26:31PM +0000, Vidya Sagar wrote: >>>> Hi folks, >>>> >>>> I would like to know your comments on the following scenario where >>>> we are observing the root port logging errors because of the >>>> enumeration flow being followed. >>>> >>>> DUT information: >>>> - Has a root port and an endpoint connected to it >>>> - Uses ECAM mechanism to access the configuration space >>>> - Booted through ACPI flow >>>> - Has a Firmware-First approach for handling the errors >>>> - System is configured to treat Unsupported Requests as >>>> AdvisoryNon-Fatal errors >>>> >>>> As we all know, when a configuration read request comes in for a >>>> device number that is not implemented, a UR would be returned as per >>>> the PCIe spec. >>>> >>>> As part of the enumeration flow on DUT, when the kernel reads offset >>>> 0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID >>>> and Device-ID values. But, when B:D:F=0:1:0 is probed, since there >>>> is no device present there, the root port responds with an >>>> Unsupported Request and simultaneously logs the same in the Device >>>> Status register (i.e. bit-3). Because of it, there is a UR logged >>>> in the Device Status register of the RP by the time enumeration is >>>> complete. >>>> >>>> In the case of AER capability natively owned by the kernel, the AER >>>> driver's init call would clear all such pending bits. >>>> >>>> Since we are going with the Firmware-First approach, and the system >>>> is configured to treat Unsupported Requests as AdvisoryNon-Fatal >>>> errors, only a correctable error interrupt can be raised to the >>>> Firmware which takes care of clearing the corresponding status >>>> registers. The firmware can't know about the UnsupReq bit being set >>>> as the interrupt it received is for a correctable error hence it >>>> clears only bits related to correctable error. >>>> >>>> All these events leave a freshly booted system with the following >>>> bits set. >>>> >>>> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort) >>>> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq) >>>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq) >>>> >>>> Since the reason for UR is well understood at this point, I would >>>> like to weigh in on the idea of clearing the aforementioned bits in >>>> the root port once the enumeration is done particularly to cater to >>>> the configurations where Firmware-First approach is in place. >>>> Please let me know your comments on this approach. >>> >>> I think Secondary status (PCI_SEC_STATUS) is always owned by the OS >>> and is not affected by _OSC negotiation, right? Linux does basically >>> nothing with that today, but I think it *could* clear the "Received >>> Master Abort" bit. >> >> Yes. PCI_SEC_STATUS is always owned by the OS and _OSC negotiation doesn't >> really affect that. >> >>> I'm not very familiar with Advisory Non-Fatal errors. I'm curious >>> about the UESta situation: why can't firmware know about UnsupReq >>> being set? I assume PCI_ERR_COR_ADV_NFAT is the Correctable Error >>> Status bit the firmware *does* see and clear. >> >> Yes, PCI_ERR_COR_ADV_NFAT is indeed cleared by the firmware. >>> >>> But isn't the whole point of Advisory Non-Fatal errors that an error >>> that is logged as an Uncorrectable Error and that normally would be >>> signaled with ERR_NONFATAL is signaled with ERR_COR instead? So >>> doesn't PCI_ERR_COR_ADV_NFAT being set imply that some >>> PCI_ERR_UNCOR_STATUS must be set as well? If so, I would think >>> firmware *could* figure that out and clear the PCI_ERR_UNCOR_STATUS >>> bit. >> >> So, are you suggesting that let the firmware only clear the >> PCI_ERR_UNCOR_STATUS also? > > In this firmware-first scenario, I'm assuming the platform retained > ownership of the AER capability, so I would think firmware certainly > should be allowed to clear PCI_ERR_UNCOR_STATUS. > >> if so, then, I can even make the firmware clear the PCI_SEC_STATUS >> also thereby leaving the firmware responsible for clearing all the >> error bits. Does that sound ok? > > It doesn't sound quite right to me for firmware to clear > PCI_SEC_STATUS because it doesn't own that register. I suspect we > would probably see the "Received Master Abort" bit set after > enumeration even on Conventional PCI systems, so I doubt this is > anything specific to PCIe or AER, and maybe Linux should clear it > after enumerating devices below the bridge. Agree. I'll push a patch to clear PCI_SEC_STATUS bit. Thanks for your inputs Bjorn. > > Bjorn ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-11-08 7:17 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-10-31 12:26 Question: Clearing error bits in the root port post enumeration Vidya Sagar 2023-11-03 18:20 ` Bjorn Helgaas 2023-11-07 3:14 ` Vidya Sagar 2023-11-07 15:29 ` Bjorn Helgaas 2023-11-08 7:17 ` Vidya Sagar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).