sa8540p-ride crash when all PCI buses are disabled

All of lore.kernel.org
 help / color / mirror / Atom feed

* sa8540p-ride crash when all PCI buses are disabled
@ 2023-08-14 22:36 Radu Rendec
  2023-08-15 10:54 ` Bryan O'Donoghue
  0 siblings, 1 reply; 6+ messages in thread
From: Radu Rendec @ 2023-08-14 22:36 UTC (permalink / raw)
  To: linux-arm-msm

Hello everyone,

I'm consistently getting a system crash followed by a ramdump on
sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
(count == providers_count).

Context: all PCIe buses are disabled due to [1]. Previously, due to
some local kernel misconfiguration, icc_sync_state() never really did
anything (because count was always less than providers_count).

I was able to isolate the problem to the qns_pcie_gem_noc icc node.
What happens is that both avg_bw and peak_bw for this node end up as 0
after aggregate_requests() gets called. The request list associated
with the node is empty.

For testing purposes, I modified icc_sync_state() to skip calling
aggregate_requests() and subsequently p->set(n, n) for that particular
node only. With that change in place, the system no longer crashes.

Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
all have 0 bandwidth after aggregate_requests() gets called, but that
doesn't seem to be a problem and the system is stable. This makes me
think there is a missing link somewhere, and something doesn't claim
any bandwidth on qns_pcie_gem_noc when it should. And it's probably
none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
to be fine.

For what is worth, when pcie2a is not disabled, xm_pcie3_2a ends up
with avg_bw=0kBps and peak_bw=1970000kBps, which is also reflected in
qns_pcie_gem_noc. Both of these nodes get a request from 1c20000.pcie:

# cat /sys/kernel/debug/interconnect/interconnect_summary

 node                                  tag          avg         peak
--------------------------------------------------------------------
...
xm_pcie3_2a                                           0      1970000
  1c20000.pcie                           0            0      1970000
...
qns_pcie_gem_noc                                      0      1970000
  1c20000.pcie                           0            0      1970000
...

Any thoughts or suggestions would be highly appreciated. Thanks!

Best regards,
Radu Rendec

[1] https://lore.kernel.org/linux-arm-msm/pmodcoakbs25z2a7mlo5gpuz63zluh35vbgb5itn6k5aqhjnny@jvphbpvahtse/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sa8540p-ride crash when all PCI buses are disabled
  2023-08-14 22:36 sa8540p-ride crash when all PCI buses are disabled Radu Rendec
@ 2023-08-15 10:54 ` Bryan O'Donoghue
  2023-08-16 16:25   ` Radu Rendec
  0 siblings, 1 reply; 6+ messages in thread
From: Bryan O'Donoghue @ 2023-08-15 10:54 UTC (permalink / raw)
  To: Radu Rendec, linux-arm-msm

On 14/08/2023 23:36, Radu Rendec wrote:
> Hello everyone,
> 
> I'm consistently getting a system crash followed by a ramdump on
> sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> (count == providers_count).
> 
> Context: all PCIe buses are disabled due to [1]. Previously, due to
> some local kernel misconfiguration, icc_sync_state() never really did
> anything (because count was always less than providers_count).
> 
> I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> What happens is that both avg_bw and peak_bw for this node end up as 0
> after aggregate_requests() gets called. The request list associated
> with the node is empty.

If all PCIe buses are disabled, then of course the bandwidth requests 
should say zero, the clocks should be disabled and any associated 
regulators should be off.

> For testing purposes, I modified icc_sync_state() to skip calling
> aggregate_requests() and subsequently p->set(n, n) for that particular
> node only. With that change in place, the system no longer crashes.

So what's happening is that a bus master in the system - perhaps not the 
application processor is issuing a transaction to a register most likely 
that is not clocked/powered.

Have you considered that one of the downstream devices might be causing 
a PCIe bus transaction ?

If you physically remove - can you physically remove - devices from the 
PCIe bus does this error still occur ?

> Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> all have 0 bandwidth after aggregate_requests() gets called, but that
> doesn't seem to be a problem and the system is stable. This makes me
> think there is a missing link somewhere, and something doesn't claim
> any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> to be fine.

Yes so if you assume that the AP/kernel side has the right references, 
counts, votes then consider another bus master - a thing that can 
initiate a read or a write might be misbehaving.

Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
code that wants to do something on the PCIe bus, might the culprit be 
whatever you have connected to the bus ?

Could something be driving the #WAKE signal and then transacting ?

But also keep in mind depending on what you are doing with this system 
if you have a bit of firmware in one of the DSP cores - does that 
firmware have scope to talk to any devices on the PCIe bus ?

I'd guess another firmware is unlikely but, a downstream device doing a 
#WAKE when you have the PCIe nodes disabled would presumably be bad..

Try looking for an upstream transaction from a device..

---
bod

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sa8540p-ride crash when all PCI buses are disabled
  2023-08-15 10:54 ` Bryan O'Donoghue
@ 2023-08-16 16:25   ` Radu Rendec
  2023-08-16 17:16     ` Manivannan Sadhasivam
  0 siblings, 1 reply; 6+ messages in thread
From: Radu Rendec @ 2023-08-16 16:25 UTC (permalink / raw)
  To: Bryan O'Donoghue, linux-arm-msm

On Tue, 2023-08-15 at 11:54 +0100, Bryan O'Donoghue wrote:
> On 14/08/2023 23:36, Radu Rendec wrote:
> > I'm consistently getting a system crash followed by a ramdump on
> > sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> > (count == providers_count).
> > 
> > Context: all PCIe buses are disabled due to [1]. Previously, due to
> > some local kernel misconfiguration, icc_sync_state() never really did
> > anything (because count was always less than providers_count).
> > 
> > I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> > What happens is that both avg_bw and peak_bw for this node end up as 0
> > after aggregate_requests() gets called. The request list associated
> > with the node is empty.
> 
> If all PCIe buses are disabled, then of course the bandwidth requests
> should say zero, the clocks should be disabled and any associated 
> regulators should be off.
> 
> > For testing purposes, I modified icc_sync_state() to skip calling
> > aggregate_requests() and subsequently p->set(n, n) for that particular
> > node only. With that change in place, the system no longer crashes.
> 
> So what's happening is that a bus master in the system - perhaps not the 
> application processor is issuing a transaction to a register most likely 
> that is not clocked/powered.

Yes, that was my assumption as well. But I didn't think it could be
something other than the AP. That is an interesting perspective.

My first thought was to analyze the ramdump and hopefully find some
clues there. But unfortunately that doesn't seem to be an option with
the tools that I have.

> Have you considered that one of the downstream devices might be causing 
> a PCIe bus transaction ?

No, I haven't considered that. If that's the case, it will probably be
even harder to debug.

> If you physically remove - can you physically remove - devices from the 
> PCIe bus does this error still occur ?

This is a standard QDrive 3 reference board, so I think this is not an
option. Taking those things apart is very difficult, and I think all
peripherals are soldered onto the board anyway.

> > Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> > xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> > all have 0 bandwidth after aggregate_requests() gets called, but that
> > doesn't seem to be a problem and the system is stable. This makes me
> > think there is a missing link somewhere, and something doesn't claim
> > any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> > none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> > to be fine.
> 
> Yes so if you assume that the AP/kernel side has the right references, 
> counts, votes then consider another bus master - a thing that can 
> initiate a read or a write might be misbehaving.

There is one thing I wasn't aware of when I wrote the previous email.
As it turns out, bandwidth/clock control is done at the bcm level, not
at the icc node level. It looks like there is a single bcm called PCI0,
and it's linked to the qns_pcie_gem_noc node. The xm_pcie3_* icc nodes
are not linked to any bcm.

This means that *all* PCIe buses are shut down when qns_pcie_gem_noc is
disabled due to zero bandwidth. I was under the (wrong) impression
that, since all xm_pcie3_* nodes had no requests, each corresponding
PCIe bus would be shut down separately, leaving only qns_pcie_gem_noc
active (with my test change in place).

> Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
> code that wants to do something on the PCIe bus, might the culprit be
> whatever you have connected to the bus ?
> 
> Could something be driving the #WAKE signal and then transacting ?
> 
> But also keep in mind depending on what you are doing with this system 
> if you have a bit of firmware in one of the DSP cores - does that 
> firmware have scope to talk to any devices on the PCIe bus ?

As I mentioned above, this is a standard QDrive 3 reference board.
Furthermore, I don't explicitly do anything with the DSPs. I just boot
a fairly recent upstream kernel (6.5-rc1) with a standard rootfs. The
boot firmware is whatever Qualcomm provides by default for these
systems. So, unless the boot firmware loads anything into the DSPs
behind my back (which I doubt), the DSPs should not even be running.

What is more likely though is that the boot firmware initializes a
bunch of PCIe devices and leaves them on.

> I'd guess another firmware is unlikely but, a downstream device doing a 
> #WAKE when you have the PCIe nodes disabled would presumably be bad..
> 
> Try looking for an upstream transaction from a device..

Yes, that makes sense. Do you have any suggestion on how to do that
without using any specialized hardware (such as JTAG pod or PCIe bus
analyzer)?

Thanks for all the input and suggestions!

--
Radu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sa8540p-ride crash when all PCI buses are disabled
  2023-08-16 16:25   ` Radu Rendec
@ 2023-08-16 17:16     ` Manivannan Sadhasivam
  2023-08-16 17:56       ` Andrew Halaney
  0 siblings, 1 reply; 6+ messages in thread
From: Manivannan Sadhasivam @ 2023-08-16 17:16 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Bryan O'Donoghue, linux-arm-msm

On Wed, Aug 16, 2023 at 12:25:50PM -0400, Radu Rendec wrote:
> On Tue, 2023-08-15 at 11:54 +0100, Bryan O'Donoghue wrote:
> > On 14/08/2023 23:36, Radu Rendec wrote:
> > > I'm consistently getting a system crash followed by a ramdump on
> > > sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> > > (count == providers_count).
> > > 
> > > Context: all PCIe buses are disabled due to [1]. Previously, due to
> > > some local kernel misconfiguration, icc_sync_state() never really did
> > > anything (because count was always less than providers_count).
> > > 
> > > I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> > > What happens is that both avg_bw and peak_bw for this node end up as 0
> > > after aggregate_requests() gets called. The request list associated
> > > with the node is empty.
> > 
> > If all PCIe buses are disabled, then of course the bandwidth requests
> > should say zero, the clocks should be disabled and any associated 
> > regulators should be off.
> > 
> > > For testing purposes, I modified icc_sync_state() to skip calling
> > > aggregate_requests() and subsequently p->set(n, n) for that particular
> > > node only. With that change in place, the system no longer crashes.
> > 
> > So what's happening is that a bus master in the system - perhaps not the 
> > application processor is issuing a transaction to a register most likely 
> > that is not clocked/powered.
> 
> Yes, that was my assumption as well. But I didn't think it could be
> something other than the AP. That is an interesting perspective.
> 
> My first thought was to analyze the ramdump and hopefully find some
> clues there. But unfortunately that doesn't seem to be an option with
> the tools that I have.
> 
> > Have you considered that one of the downstream devices might be causing 
> > a PCIe bus transaction ?
> 
> No, I haven't considered that. If that's the case, it will probably be
> even harder to debug.
> 

If the PCIe controller node is disabled in devicetree, then none of the devices
would be enumerated. In that case, they cannot initiate any transactions on
their own.

Qcom observed a similar crash with PCIe SMMU when the PCIe controllers were not
enabled in devicetree [1]. Since Qcom was going to enable PCIe controllers
eventually, I concluded that the issue will be gone once they do it.

But looking at your issue, I think the transaction is triggered by PCIe SMMU as
observed earlier. Since there are no active votes on the path after
icc_sync_state(), it ends up in a crash.

But did you disable all PCIe instances or just pcie2a? The revert patch you
pointed only applies to pcie2a. But if you are disabling all PCIe instances,
then I do not see a point in enabling PCIe SMMU as well. Could you try disabling
the pcie_smmu node and check?

- Mani

[1] https://lore.kernel.org/linux-arm-msm/20230609054141.18938-3-quic_ppareek@quicinc.com/

> > If you physically remove - can you physically remove - devices from the 
> > PCIe bus does this error still occur ?
> 
> This is a standard QDrive 3 reference board, so I think this is not an
> option. Taking those things apart is very difficult, and I think all
> peripherals are soldered onto the board anyway.
> 
> > > Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> > > xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> > > all have 0 bandwidth after aggregate_requests() gets called, but that
> > > doesn't seem to be a problem and the system is stable. This makes me
> > > think there is a missing link somewhere, and something doesn't claim
> > > any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> > > none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> > > to be fine.
> > 
> > Yes so if you assume that the AP/kernel side has the right references, 
> > counts, votes then consider another bus master - a thing that can 
> > initiate a read or a write might be misbehaving.
> 
> There is one thing I wasn't aware of when I wrote the previous email.
> As it turns out, bandwidth/clock control is done at the bcm level, not
> at the icc node level. It looks like there is a single bcm called PCI0,
> and it's linked to the qns_pcie_gem_noc node. The xm_pcie3_* icc nodes
> are not linked to any bcm.
> 
> This means that *all* PCIe buses are shut down when qns_pcie_gem_noc is
> disabled due to zero bandwidth. I was under the (wrong) impression
> that, since all xm_pcie3_* nodes had no requests, each corresponding
> PCIe bus would be shut down separately, leaving only qns_pcie_gem_noc
> active (with my test change in place).
> 
> > Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
> > code that wants to do something on the PCIe bus, might the culprit be
> > whatever you have connected to the bus ?
> > 
> > Could something be driving the #WAKE signal and then transacting ?
> > 
> > But also keep in mind depending on what you are doing with this system 
> > if you have a bit of firmware in one of the DSP cores - does that 
> > firmware have scope to talk to any devices on the PCIe bus ?
> 
> As I mentioned above, this is a standard QDrive 3 reference board.
> Furthermore, I don't explicitly do anything with the DSPs. I just boot
> a fairly recent upstream kernel (6.5-rc1) with a standard rootfs. The
> boot firmware is whatever Qualcomm provides by default for these
> systems. So, unless the boot firmware loads anything into the DSPs
> behind my back (which I doubt), the DSPs should not even be running.
> 
> What is more likely though is that the boot firmware initializes a
> bunch of PCIe devices and leaves them on.
> 
> > I'd guess another firmware is unlikely but, a downstream device doing a 
> > #WAKE when you have the PCIe nodes disabled would presumably be bad..
> > 
> > Try looking for an upstream transaction from a device..
> 
> Yes, that makes sense. Do you have any suggestion on how to do that
> without using any specialized hardware (such as JTAG pod or PCIe bus
> analyzer)?
> 
> Thanks for all the input and suggestions!
> 
> --
> Radu
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sa8540p-ride crash when all PCI buses are disabled
  2023-08-16 17:16     ` Manivannan Sadhasivam
@ 2023-08-16 17:56       ` Andrew Halaney
  2023-08-18 16:44         ` Manivannan Sadhasivam
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Halaney @ 2023-08-16 17:56 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: Radu Rendec, Bryan O'Donoghue, linux-arm-msm

On Wed, Aug 16, 2023 at 10:46:01PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Aug 16, 2023 at 12:25:50PM -0400, Radu Rendec wrote:
> > On Tue, 2023-08-15 at 11:54 +0100, Bryan O'Donoghue wrote:
> > > On 14/08/2023 23:36, Radu Rendec wrote:
> > > > I'm consistently getting a system crash followed by a ramdump on
> > > > sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> > > > (count == providers_count).
> > > > 
> > > > Context: all PCIe buses are disabled due to [1]. Previously, due to
> > > > some local kernel misconfiguration, icc_sync_state() never really did
> > > > anything (because count was always less than providers_count).
> > > > 
> > > > I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> > > > What happens is that both avg_bw and peak_bw for this node end up as 0
> > > > after aggregate_requests() gets called. The request list associated
> > > > with the node is empty.
> > > 
> > > If all PCIe buses are disabled, then of course the bandwidth requests
> > > should say zero, the clocks should be disabled and any associated 
> > > regulators should be off.
> > > 
> > > > For testing purposes, I modified icc_sync_state() to skip calling
> > > > aggregate_requests() and subsequently p->set(n, n) for that particular
> > > > node only. With that change in place, the system no longer crashes.
> > > 
> > > So what's happening is that a bus master in the system - perhaps not the 
> > > application processor is issuing a transaction to a register most likely 
> > > that is not clocked/powered.
> > 
> > Yes, that was my assumption as well. But I didn't think it could be
> > something other than the AP. That is an interesting perspective.
> > 
> > My first thought was to analyze the ramdump and hopefully find some
> > clues there. But unfortunately that doesn't seem to be an option with
> > the tools that I have.
> > 
> > > Have you considered that one of the downstream devices might be causing 
> > > a PCIe bus transaction ?
> > 
> > No, I haven't considered that. If that's the case, it will probably be
> > even harder to debug.
> > 
> 
> If the PCIe controller node is disabled in devicetree, then none of the devices
> would be enumerated. In that case, they cannot initiate any transactions on
> their own.
> 
> Qcom observed a similar crash with PCIe SMMU when the PCIe controllers were not
> enabled in devicetree [1]. Since Qcom was going to enable PCIe controllers
> eventually, I concluded that the issue will be gone once they do it.
> 
> But looking at your issue, I think the transaction is triggered by PCIe SMMU as
> observed earlier. Since there are no active votes on the path after
> icc_sync_state(), it ends up in a crash.
> 
> But did you disable all PCIe instances or just pcie2a? The revert patch you
> pointed only applies to pcie2a. But if you are disabling all PCIe instances,
> then I do not see a point in enabling PCIe SMMU as well. Could you try disabling
> the pcie_smmu node and check?

I think this is a good hunch, but do note that this is discussing
sa8540p-ride, not sa8775p-ride. The former has no PCIe SMMU described
(although I believe there maybe one on the sc8280xp family, just in
"bypass mode" (excuse my SMMU ignorance!) by firmware, and not described
to Linux for any variant of that platform).

> 
> - Mani
> 
> [1] https://lore.kernel.org/linux-arm-msm/20230609054141.18938-3-quic_ppareek@quicinc.com/
> 
> > > If you physically remove - can you physically remove - devices from the 
> > > PCIe bus does this error still occur ?
> > 
> > This is a standard QDrive 3 reference board, so I think this is not an
> > option. Taking those things apart is very difficult, and I think all
> > peripherals are soldered onto the board anyway.
> > 
> > > > Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> > > > xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> > > > all have 0 bandwidth after aggregate_requests() gets called, but that
> > > > doesn't seem to be a problem and the system is stable. This makes me
> > > > think there is a missing link somewhere, and something doesn't claim
> > > > any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> > > > none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> > > > to be fine.
> > > 
> > > Yes so if you assume that the AP/kernel side has the right references, 
> > > counts, votes then consider another bus master - a thing that can 
> > > initiate a read or a write might be misbehaving.
> > 
> > There is one thing I wasn't aware of when I wrote the previous email.
> > As it turns out, bandwidth/clock control is done at the bcm level, not
> > at the icc node level. It looks like there is a single bcm called PCI0,
> > and it's linked to the qns_pcie_gem_noc node. The xm_pcie3_* icc nodes
> > are not linked to any bcm.
> > 
> > This means that *all* PCIe buses are shut down when qns_pcie_gem_noc is
> > disabled due to zero bandwidth. I was under the (wrong) impression
> > that, since all xm_pcie3_* nodes had no requests, each corresponding
> > PCIe bus would be shut down separately, leaving only qns_pcie_gem_noc
> > active (with my test change in place).
> > 
> > > Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
> > > code that wants to do something on the PCIe bus, might the culprit be
> > > whatever you have connected to the bus ?
> > > 
> > > Could something be driving the #WAKE signal and then transacting ?
> > > 
> > > But also keep in mind depending on what you are doing with this system 
> > > if you have a bit of firmware in one of the DSP cores - does that 
> > > firmware have scope to talk to any devices on the PCIe bus ?
> > 
> > As I mentioned above, this is a standard QDrive 3 reference board.
> > Furthermore, I don't explicitly do anything with the DSPs. I just boot
> > a fairly recent upstream kernel (6.5-rc1) with a standard rootfs. The
> > boot firmware is whatever Qualcomm provides by default for these
> > systems. So, unless the boot firmware loads anything into the DSPs
> > behind my back (which I doubt), the DSPs should not even be running.
> > 
> > What is more likely though is that the boot firmware initializes a
> > bunch of PCIe devices and leaves them on.
> > 
> > > I'd guess another firmware is unlikely but, a downstream device doing a 
> > > #WAKE when you have the PCIe nodes disabled would presumably be bad..
> > > 
> > > Try looking for an upstream transaction from a device..
> > 
> > Yes, that makes sense. Do you have any suggestion on how to do that
> > without using any specialized hardware (such as JTAG pod or PCIe bus
> > analyzer)?
> > 
> > Thanks for all the input and suggestions!
> > 
> > --
> > Radu
> > 
> 
> -- 
> மணிவண்ணன் சதாசிவம்
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sa8540p-ride crash when all PCI buses are disabled
  2023-08-16 17:56       ` Andrew Halaney
@ 2023-08-18 16:44         ` Manivannan Sadhasivam
  0 siblings, 0 replies; 6+ messages in thread
From: Manivannan Sadhasivam @ 2023-08-18 16:44 UTC (permalink / raw)
  To: Andrew Halaney; +Cc: Radu Rendec, Bryan O'Donoghue, linux-arm-msm

On Wed, Aug 16, 2023 at 12:56:58PM -0500, Andrew Halaney wrote:
> On Wed, Aug 16, 2023 at 10:46:01PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Aug 16, 2023 at 12:25:50PM -0400, Radu Rendec wrote:
> > > On Tue, 2023-08-15 at 11:54 +0100, Bryan O'Donoghue wrote:
> > > > On 14/08/2023 23:36, Radu Rendec wrote:
> > > > > I'm consistently getting a system crash followed by a ramdump on
> > > > > sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> > > > > (count == providers_count).
> > > > > 
> > > > > Context: all PCIe buses are disabled due to [1]. Previously, due to
> > > > > some local kernel misconfiguration, icc_sync_state() never really did
> > > > > anything (because count was always less than providers_count).
> > > > > 
> > > > > I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> > > > > What happens is that both avg_bw and peak_bw for this node end up as 0
> > > > > after aggregate_requests() gets called. The request list associated
> > > > > with the node is empty.
> > > > 
> > > > If all PCIe buses are disabled, then of course the bandwidth requests
> > > > should say zero, the clocks should be disabled and any associated 
> > > > regulators should be off.
> > > > 
> > > > > For testing purposes, I modified icc_sync_state() to skip calling
> > > > > aggregate_requests() and subsequently p->set(n, n) for that particular
> > > > > node only. With that change in place, the system no longer crashes.
> > > > 
> > > > So what's happening is that a bus master in the system - perhaps not the 
> > > > application processor is issuing a transaction to a register most likely 
> > > > that is not clocked/powered.
> > > 
> > > Yes, that was my assumption as well. But I didn't think it could be
> > > something other than the AP. That is an interesting perspective.
> > > 
> > > My first thought was to analyze the ramdump and hopefully find some
> > > clues there. But unfortunately that doesn't seem to be an option with
> > > the tools that I have.
> > > 
> > > > Have you considered that one of the downstream devices might be causing 
> > > > a PCIe bus transaction ?
> > > 
> > > No, I haven't considered that. If that's the case, it will probably be
> > > even harder to debug.
> > > 
> > 
> > If the PCIe controller node is disabled in devicetree, then none of the devices
> > would be enumerated. In that case, they cannot initiate any transactions on
> > their own.
> > 
> > Qcom observed a similar crash with PCIe SMMU when the PCIe controllers were not
> > enabled in devicetree [1]. Since Qcom was going to enable PCIe controllers
> > eventually, I concluded that the issue will be gone once they do it.
> > 
> > But looking at your issue, I think the transaction is triggered by PCIe SMMU as
> > observed earlier. Since there are no active votes on the path after
> > icc_sync_state(), it ends up in a crash.
> > 
> > But did you disable all PCIe instances or just pcie2a? The revert patch you
> > pointed only applies to pcie2a. But if you are disabling all PCIe instances,
> > then I do not see a point in enabling PCIe SMMU as well. Could you try disabling
> > the pcie_smmu node and check?
> 
> I think this is a good hunch, but do note that this is discussing
> sa8540p-ride, not sa8775p-ride. The former has no PCIe SMMU described
> (although I believe there maybe one on the sc8280xp family, just in
> "bypass mode" (excuse my SMMU ignorance!) by firmware, and not described
> to Linux for any variant of that platform).
> 

Ah... I was answering from sa8775p-ride perspective, sorry! Yes, on
sa8540p-ride, PCIe SMMU is configured in bypass mode by bootloader.

In that case I think you need to parse the ramdump to see who is causing the
crash. Qcom folks should be able to help you with that.

- Mani

> > 
> > - Mani
> > 
> > [1] https://lore.kernel.org/linux-arm-msm/20230609054141.18938-3-quic_ppareek@quicinc.com/
> > 
> > > > If you physically remove - can you physically remove - devices from the 
> > > > PCIe bus does this error still occur ?
> > > 
> > > This is a standard QDrive 3 reference board, so I think this is not an
> > > option. Taking those things apart is very difficult, and I think all
> > > peripherals are soldered onto the board anyway.
> > > 
> > > > > Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> > > > > xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> > > > > all have 0 bandwidth after aggregate_requests() gets called, but that
> > > > > doesn't seem to be a problem and the system is stable. This makes me
> > > > > think there is a missing link somewhere, and something doesn't claim
> > > > > any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> > > > > none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> > > > > to be fine.
> > > > 
> > > > Yes so if you assume that the AP/kernel side has the right references, 
> > > > counts, votes then consider another bus master - a thing that can 
> > > > initiate a read or a write might be misbehaving.
> > > 
> > > There is one thing I wasn't aware of when I wrote the previous email.
> > > As it turns out, bandwidth/clock control is done at the bcm level, not
> > > at the icc node level. It looks like there is a single bcm called PCI0,
> > > and it's linked to the qns_pcie_gem_noc node. The xm_pcie3_* icc nodes
> > > are not linked to any bcm.
> > > 
> > > This means that *all* PCIe buses are shut down when qns_pcie_gem_noc is
> > > disabled due to zero bandwidth. I was under the (wrong) impression
> > > that, since all xm_pcie3_* nodes had no requests, each corresponding
> > > PCIe bus would be shut down separately, leaving only qns_pcie_gem_noc
> > > active (with my test change in place).
> > > 
> > > > Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
> > > > code that wants to do something on the PCIe bus, might the culprit be
> > > > whatever you have connected to the bus ?
> > > > 
> > > > Could something be driving the #WAKE signal and then transacting ?
> > > > 
> > > > But also keep in mind depending on what you are doing with this system 
> > > > if you have a bit of firmware in one of the DSP cores - does that 
> > > > firmware have scope to talk to any devices on the PCIe bus ?
> > > 
> > > As I mentioned above, this is a standard QDrive 3 reference board.
> > > Furthermore, I don't explicitly do anything with the DSPs. I just boot
> > > a fairly recent upstream kernel (6.5-rc1) with a standard rootfs. The
> > > boot firmware is whatever Qualcomm provides by default for these
> > > systems. So, unless the boot firmware loads anything into the DSPs
> > > behind my back (which I doubt), the DSPs should not even be running.
> > > 
> > > What is more likely though is that the boot firmware initializes a
> > > bunch of PCIe devices and leaves them on.
> > > 
> > > > I'd guess another firmware is unlikely but, a downstream device doing a 
> > > > #WAKE when you have the PCIe nodes disabled would presumably be bad..
> > > > 
> > > > Try looking for an upstream transaction from a device..
> > > 
> > > Yes, that makes sense. Do you have any suggestion on how to do that
> > > without using any specialized hardware (such as JTAG pod or PCIe bus
> > > analyzer)?
> > > 
> > > Thanks for all the input and suggestions!
> > > 
> > > --
> > > Radu
> > > 
> > 
> > -- 
> > மணிவண்ணன் சதாசிவம்
> > 
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-08-18 16:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-14 22:36 sa8540p-ride crash when all PCI buses are disabled Radu Rendec
2023-08-15 10:54 ` Bryan O'Donoghue
2023-08-16 16:25   ` Radu Rendec
2023-08-16 17:16     ` Manivannan Sadhasivam
2023-08-16 17:56       ` Andrew Halaney
2023-08-18 16:44         ` Manivannan Sadhasivam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.