Re: sa8540p-ride crash when all PCI buses are disabled

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
To: Andrew Halaney <ahalaney@redhat.com>
Cc: Radu Rendec <rrendec@redhat.com>,
	Bryan O'Donoghue <bryan.odonoghue@linaro.org>,
	linux-arm-msm@vger.kernel.org
Subject: Re: sa8540p-ride crash when all PCI buses are disabled
Date: Fri, 18 Aug 2023 22:14:18 +0530	[thread overview]
Message-ID: <20230818164418.GA81989@thinkpad> (raw)
In-Reply-To: <liyxirdfxcxrnji2o6p63cocgl7qllyih7q47zdft7tc5jni2j@h6akjddfu6o3>

On Wed, Aug 16, 2023 at 12:56:58PM -0500, Andrew Halaney wrote:
> On Wed, Aug 16, 2023 at 10:46:01PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Aug 16, 2023 at 12:25:50PM -0400, Radu Rendec wrote:
> > > On Tue, 2023-08-15 at 11:54 +0100, Bryan O'Donoghue wrote:
> > > > On 14/08/2023 23:36, Radu Rendec wrote:
> > > > > I'm consistently getting a system crash followed by a ramdump on
> > > > > sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> > > > > (count == providers_count).
> > > > > 
> > > > > Context: all PCIe buses are disabled due to [1]. Previously, due to
> > > > > some local kernel misconfiguration, icc_sync_state() never really did
> > > > > anything (because count was always less than providers_count).
> > > > > 
> > > > > I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> > > > > What happens is that both avg_bw and peak_bw for this node end up as 0
> > > > > after aggregate_requests() gets called. The request list associated
> > > > > with the node is empty.
> > > > 
> > > > If all PCIe buses are disabled, then of course the bandwidth requests
> > > > should say zero, the clocks should be disabled and any associated 
> > > > regulators should be off.
> > > > 
> > > > > For testing purposes, I modified icc_sync_state() to skip calling
> > > > > aggregate_requests() and subsequently p->set(n, n) for that particular
> > > > > node only. With that change in place, the system no longer crashes.
> > > > 
> > > > So what's happening is that a bus master in the system - perhaps not the 
> > > > application processor is issuing a transaction to a register most likely 
> > > > that is not clocked/powered.
> > > 
> > > Yes, that was my assumption as well. But I didn't think it could be
> > > something other than the AP. That is an interesting perspective.
> > > 
> > > My first thought was to analyze the ramdump and hopefully find some
> > > clues there. But unfortunately that doesn't seem to be an option with
> > > the tools that I have.
> > > 
> > > > Have you considered that one of the downstream devices might be causing 
> > > > a PCIe bus transaction ?
> > > 
> > > No, I haven't considered that. If that's the case, it will probably be
> > > even harder to debug.
> > > 
> > 
> > If the PCIe controller node is disabled in devicetree, then none of the devices
> > would be enumerated. In that case, they cannot initiate any transactions on
> > their own.
> > 
> > Qcom observed a similar crash with PCIe SMMU when the PCIe controllers were not
> > enabled in devicetree [1]. Since Qcom was going to enable PCIe controllers
> > eventually, I concluded that the issue will be gone once they do it.
> > 
> > But looking at your issue, I think the transaction is triggered by PCIe SMMU as
> > observed earlier. Since there are no active votes on the path after
> > icc_sync_state(), it ends up in a crash.
> > 
> > But did you disable all PCIe instances or just pcie2a? The revert patch you
> > pointed only applies to pcie2a. But if you are disabling all PCIe instances,
> > then I do not see a point in enabling PCIe SMMU as well. Could you try disabling
> > the pcie_smmu node and check?
> 
> I think this is a good hunch, but do note that this is discussing
> sa8540p-ride, not sa8775p-ride. The former has no PCIe SMMU described
> (although I believe there maybe one on the sc8280xp family, just in
> "bypass mode" (excuse my SMMU ignorance!) by firmware, and not described
> to Linux for any variant of that platform).
> 

Ah... I was answering from sa8775p-ride perspective, sorry! Yes, on
sa8540p-ride, PCIe SMMU is configured in bypass mode by bootloader.

In that case I think you need to parse the ramdump to see who is causing the
crash. Qcom folks should be able to help you with that.

- Mani

> > 
> > - Mani
> > 
> > [1] https://lore.kernel.org/linux-arm-msm/20230609054141.18938-3-quic_ppareek@quicinc.com/
> > 
> > > > If you physically remove - can you physically remove - devices from the 
> > > > PCIe bus does this error still occur ?
> > > 
> > > This is a standard QDrive 3 reference board, so I think this is not an
> > > option. Taking those things apart is very difficult, and I think all
> > > peripherals are soldered onto the board anyway.
> > > 
> > > > > Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> > > > > xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> > > > > all have 0 bandwidth after aggregate_requests() gets called, but that
> > > > > doesn't seem to be a problem and the system is stable. This makes me
> > > > > think there is a missing link somewhere, and something doesn't claim
> > > > > any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> > > > > none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> > > > > to be fine.
> > > > 
> > > > Yes so if you assume that the AP/kernel side has the right references, 
> > > > counts, votes then consider another bus master - a thing that can 
> > > > initiate a read or a write might be misbehaving.
> > > 
> > > There is one thing I wasn't aware of when I wrote the previous email.
> > > As it turns out, bandwidth/clock control is done at the bcm level, not
> > > at the icc node level. It looks like there is a single bcm called PCI0,
> > > and it's linked to the qns_pcie_gem_noc node. The xm_pcie3_* icc nodes
> > > are not linked to any bcm.
> > > 
> > > This means that *all* PCIe buses are shut down when qns_pcie_gem_noc is
> > > disabled due to zero bandwidth. I was under the (wrong) impression
> > > that, since all xm_pcie3_* nodes had no requests, each corresponding
> > > PCIe bus would be shut down separately, leaving only qns_pcie_gem_noc
> > > active (with my test change in place).
> > > 
> > > > Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
> > > > code that wants to do something on the PCIe bus, might the culprit be
> > > > whatever you have connected to the bus ?
> > > > 
> > > > Could something be driving the #WAKE signal and then transacting ?
> > > > 
> > > > But also keep in mind depending on what you are doing with this system 
> > > > if you have a bit of firmware in one of the DSP cores - does that 
> > > > firmware have scope to talk to any devices on the PCIe bus ?
> > > 
> > > As I mentioned above, this is a standard QDrive 3 reference board.
> > > Furthermore, I don't explicitly do anything with the DSPs. I just boot
> > > a fairly recent upstream kernel (6.5-rc1) with a standard rootfs. The
> > > boot firmware is whatever Qualcomm provides by default for these
> > > systems. So, unless the boot firmware loads anything into the DSPs
> > > behind my back (which I doubt), the DSPs should not even be running.
> > > 
> > > What is more likely though is that the boot firmware initializes a
> > > bunch of PCIe devices and leaves them on.
> > > 
> > > > I'd guess another firmware is unlikely but, a downstream device doing a 
> > > > #WAKE when you have the PCIe nodes disabled would presumably be bad..
> > > > 
> > > > Try looking for an upstream transaction from a device..
> > > 
> > > Yes, that makes sense. Do you have any suggestion on how to do that
> > > without using any specialized hardware (such as JTAG pod or PCIe bus
> > > analyzer)?
> > > 
> > > Thanks for all the input and suggestions!
> > > 
> > > --
> > > Radu
> > > 
> > 
> > -- 
> > மணிவண்ணன் சதாசிவம்
> > 
> 

-- 
மணிவண்ணன் சதாசிவம்

     prev parent reply	other threads:[~2023-08-18 16:45 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-14 22:36 sa8540p-ride crash when all PCI buses are disabled Radu Rendec
2023-08-15 10:54 ` Bryan O'Donoghue
2023-08-16 16:25   ` Radu Rendec
2023-08-16 17:16     ` Manivannan Sadhasivam
2023-08-16 17:56       ` Andrew Halaney
2023-08-18 16:44         ` Manivannan Sadhasivam [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230818164418.GA81989@thinkpad \
    --to=manivannan.sadhasivam@linaro.org \
    --cc=ahalaney@redhat.com \
    --cc=bryan.odonoghue@linaro.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=rrendec@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.