* Re: [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan [not found] ` <CAMdYzYqV72=pQa-U3a2N7MZ2ChBNL74QrxHQLbMZJxiftTK9sA@mail.gmail.com> @ 2023-05-15 11:04 ` Vincenzo Palazzo 2023-05-15 16:51 ` Bjorn Helgaas 2023-07-12 15:42 ` Vincenzo Palazzo 2 siblings, 0 replies; 5+ messages in thread From: Vincenzo Palazzo @ 2023-05-15 11:04 UTC (permalink / raw) To: Peter Geis, Bjorn Helgaas Cc: kw, heiko, robh, linux-pci, shawn.lin, linux-kernel, lgirdwood, linux-rockchip, broonie, bhelgaas, linux-kernel-mentees, lpieralisi, linux-arm-kernel, Dan Johansen, Catalin Marinas, Will Deacon, Robin Murphy > > > > There *is* a way for a PCIe device to say "I need more time". It does > > this by responding to that Vendor ID config read with Request Retry > > Status (RRS, aka CRS in older specs), which means "I'm not ready yet, > > but I will be ready in the future." Adding a delay would definitely > > make a difference here, so my guess is this is what's happening. > > > > Most root complexes return ~0 data to the CPU when a config read > > terminates with UR or RRS. It sounds like rockchip does this for UR > > but possibly not for RRS. > > > > There is a "RRS Software Visibility" feature, which is supposed to > > turn the RRS into a special value (Vendor ID == 0x0001), but per [1], > > rockchip doesn't support it (lspci calls it "CRSVisible"). > > > > But the CPU load instruction corresponding to the config read has to > > complete by reading *something* or else be aborted. It sounds like > > it's aborted in this case. I don't know the arm64 details, but if we > > could catch that abort and determine that it was an RRS and not a UR, > > maybe we could fabricate the magic RRS 0x0001 value. > > > > imx6q_pcie_abort_handler() does something like that, although I think > > it's for arm32, not arm64. But obviously we already catch the abort > > enough to dump the register state and panic, so maybe there's a way to > > extend that? > > Perhaps a hook mechanism that allows drivers to register with the > serror handler and offer to handle specific errors before the generic > code causes the system panic? This sounds to me a good general solution that also help to handle future HW like this one. So this is a Concept Ack for me. Cheers! Vincent. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan [not found] ` <CAMdYzYqV72=pQa-U3a2N7MZ2ChBNL74QrxHQLbMZJxiftTK9sA@mail.gmail.com> 2023-05-15 11:04 ` [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan Vincenzo Palazzo @ 2023-05-15 16:51 ` Bjorn Helgaas 2023-05-15 20:52 ` Peter Geis 2023-07-12 15:42 ` Vincenzo Palazzo 2 siblings, 1 reply; 5+ messages in thread From: Bjorn Helgaas @ 2023-05-15 16:51 UTC (permalink / raw) To: Peter Geis Cc: robh, heiko, Will Deacon, kw, linux-pci, shawn.lin, linux-kernel, lgirdwood, linux-rockchip, broonie, Catalin Marinas, bhelgaas, Robin Murphy, linux-kernel-mentees, lpieralisi, linux-arm-kernel, Dan Johansen On Sat, May 13, 2023 at 07:40:12AM -0400, Peter Geis wrote: > On Fri, May 12, 2023 at 9:24 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > [+cc ARM64 folks, in case you have abort handling tips; thread at: > > https://lore.kernel.org/r/20230509153912.515218-1-vincenzopalazzodev@gmail.com] > > > > Pine64 RockPro64 panics while enumerating some PCIe devices. Adding a > > delay avoids the panic. My theory is a PCIe Request Retry Status to a > > Vendor ID config read causes an abort that we don't handle. > > > > > On Tue, May 09, 2023 at 05:39:12PM +0200, Vincenzo Palazzo wrote: > > >> ... > > >> [ 1.229856] SError Interrupt on CPU4, code 0xbf000002 -- SError > > >> [ 1.229860] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.9.9-2.0-MANJARO-ARM > > >> #1 > > >> [ 1.229862] Hardware name: Pine64 RockPro64 v2.1 (DT) > > >> [ 1.229864] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--) > > >> [ 1.229866] pc : rockchip_pcie_rd_conf+0xb4/0x270 > > >> [ 1.229868] lr : rockchip_pcie_rd_conf+0x1b4/0x270 > > >> ... > > >> [ 1.229939] Kernel panic - not syncing: Asynchronous SError Interrupt > > >> ... > > >> [ 1.229955] nmi_panic+0x8c/0x90 > > >> [ 1.229956] arm64_serror_panic+0x78/0x84 > > >> [ 1.229958] do_serror+0x15c/0x160 > > >> [ 1.229960] el1_error+0x84/0x100 > > >> [ 1.229962] rockchip_pcie_rd_conf+0xb4/0x270 > > >> [ 1.229964] pci_bus_read_config_dword+0x6c/0xd0 > > >> [ 1.229966] pci_bus_generic_read_dev_vendor_id+0x34/0x1b0 > > >> [ 1.229968] pci_scan_single_device+0xa4/0x144 > > > > On Fri, May 12, 2023 at 12:46:21PM +0200, Vincenzo Palazzo wrote: > > > ... Is there any way to tell the kernel "hey we need some more time > > > here"? > > > > We enumerate PCI devices by trying to read the Vendor ID of every > > possible device address (see pci_scan_slot()). On PCIe, if a device > > doesn't exist at that address, the Vendor ID config read will be > > terminated with Unsupported Request (UR) status. This is normal > > and happens every time we enumerate devices. > > > > The crash doesn't happen every time we enumerate, so I don't think > > this UR is the problem. Also, if it *were* the problem, adding a > > delay would not make any difference. > > Is this behavior different if there is a switch device forwarding on > the UR? On rk3399 switches are completely non-functional because of > the panic, which is observed in the output of the dmesg in [2] with > the hack patch enabled. Considering what you just described it looks > like the forwarded UR for each non-existent device behind the switch > is causing an serror. I don't know exactly what the panic looks like, but I wouldn't expect UR handling to be different when there's a switch. pcie-rockchip-host.c does handle devices on the root bus (00) differently than others because rockchip_pcie_valid_device() knows that device 00:00 is the only device on the root bus. That part makes sense because 00:00 is built into the SoC. I'm a little suspicious of the fact that rockchip_pcie_valid_device() also enforces that bus 01 can only have a single device on it. No other *_pcie_valid_device() implementations enforce that. It's true that traditional PCIe devices can only implement device 00, but ARI relaxes that by reusing the Device Number as extended Function Number bits. > > There *is* a way for a PCIe device to say "I need more time". It does > > this by responding to that Vendor ID config read with Request Retry > > Status (RRS, aka CRS in older specs), which means "I'm not ready yet, > > but I will be ready in the future." Adding a delay would definitely > > make a difference here, so my guess is this is what's happening. > > > > Most root complexes return ~0 data to the CPU when a config read > > terminates with UR or RRS. It sounds like rockchip does this for UR > > but possibly not for RRS. > > > > There is a "RRS Software Visibility" feature, which is supposed to > > turn the RRS into a special value (Vendor ID == 0x0001), but per [1], > > rockchip doesn't support it (lspci calls it "CRSVisible"). > > > > But the CPU load instruction corresponding to the config read has to > > complete by reading *something* or else be aborted. It sounds like > > it's aborted in this case. I don't know the arm64 details, but if we > > could catch that abort and determine that it was an RRS and not a UR, > > maybe we could fabricate the magic RRS 0x0001 value. > > > > imx6q_pcie_abort_handler() does something like that, although I think > > it's for arm32, not arm64. But obviously we already catch the abort > > enough to dump the register state and panic, so maybe there's a way to > > extend that? > > Perhaps a hook mechanism that allows drivers to register with the > serror handler and offer to handle specific errors before the generic > code causes the system panic? > > Very Respectfully, > Peter Geis > > [2] https://lore.kernel.org/linux-pci/CAMdYzYqn3L7x-vc+_K6jG0EVTiPGbz8pQ-N1Q1mRbcVXE822Yg@mail.gmail.com/ > > > > > Bjorn > > > > [1] https://lore.kernel.org/linux-pci/CAMdYzYpOFAVq30N+O2gOxXiRtpoHpakFg3LKq3TEZq4S6Y0y0g@mail.gmail.com/ > _______________________________________________ > Linux-kernel-mentees mailing list > Linux-kernel-mentees@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan 2023-05-15 16:51 ` Bjorn Helgaas @ 2023-05-15 20:52 ` Peter Geis 0 siblings, 0 replies; 5+ messages in thread From: Peter Geis @ 2023-05-15 20:52 UTC (permalink / raw) To: Bjorn Helgaas Cc: robh, heiko, Will Deacon, kw, linux-pci, shawn.lin, linux-kernel, lgirdwood, linux-rockchip, broonie, Catalin Marinas, bhelgaas, Robin Murphy, linux-kernel-mentees, lpieralisi, linux-arm-kernel, Dan Johansen On Mon, May 15, 2023 at 12:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Sat, May 13, 2023 at 07:40:12AM -0400, Peter Geis wrote: > > On Fri, May 12, 2023 at 9:24 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > [+cc ARM64 folks, in case you have abort handling tips; thread at: > > > https://lore.kernel.org/r/20230509153912.515218-1-vincenzopalazzodev@gmail.com] > > > > > > Pine64 RockPro64 panics while enumerating some PCIe devices. Adding a > > > delay avoids the panic. My theory is a PCIe Request Retry Status to a > > > Vendor ID config read causes an abort that we don't handle. > > > > > > > On Tue, May 09, 2023 at 05:39:12PM +0200, Vincenzo Palazzo wrote: > > > >> ... > > > >> [ 1.229856] SError Interrupt on CPU4, code 0xbf000002 -- SError > > > >> [ 1.229860] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.9.9-2.0-MANJARO-ARM > > > >> #1 > > > >> [ 1.229862] Hardware name: Pine64 RockPro64 v2.1 (DT) > > > >> [ 1.229864] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--) > > > >> [ 1.229866] pc : rockchip_pcie_rd_conf+0xb4/0x270 > > > >> [ 1.229868] lr : rockchip_pcie_rd_conf+0x1b4/0x270 > > > >> ... > > > >> [ 1.229939] Kernel panic - not syncing: Asynchronous SError Interrupt > > > >> ... > > > >> [ 1.229955] nmi_panic+0x8c/0x90 > > > >> [ 1.229956] arm64_serror_panic+0x78/0x84 > > > >> [ 1.229958] do_serror+0x15c/0x160 > > > >> [ 1.229960] el1_error+0x84/0x100 > > > >> [ 1.229962] rockchip_pcie_rd_conf+0xb4/0x270 > > > >> [ 1.229964] pci_bus_read_config_dword+0x6c/0xd0 > > > >> [ 1.229966] pci_bus_generic_read_dev_vendor_id+0x34/0x1b0 > > > >> [ 1.229968] pci_scan_single_device+0xa4/0x144 > > > > > > On Fri, May 12, 2023 at 12:46:21PM +0200, Vincenzo Palazzo wrote: > > > > ... Is there any way to tell the kernel "hey we need some more time > > > > here"? > > > > > > We enumerate PCI devices by trying to read the Vendor ID of every > > > possible device address (see pci_scan_slot()). On PCIe, if a device > > > doesn't exist at that address, the Vendor ID config read will be > > > terminated with Unsupported Request (UR) status. This is normal > > > and happens every time we enumerate devices. > > > > > > The crash doesn't happen every time we enumerate, so I don't think > > > this UR is the problem. Also, if it *were* the problem, adding a > > > delay would not make any difference. > > > > Is this behavior different if there is a switch device forwarding on > > the UR? On rk3399 switches are completely non-functional because of > > the panic, which is observed in the output of the dmesg in [2] with > > the hack patch enabled. Considering what you just described it looks > > like the forwarded UR for each non-existent device behind the switch > > is causing an serror. > > I don't know exactly what the panic looks like, but I wouldn't expect > UR handling to be different when there's a switch. > > pcie-rockchip-host.c does handle devices on the root bus (00) > differently than others because rockchip_pcie_valid_device() knows > that device 00:00 is the only device on the root bus. That part makes > sense because 00:00 is built into the SoC. > > I'm a little suspicious of the fact that rockchip_pcie_valid_device() > also enforces that bus 01 can only have a single device on it. No > other *_pcie_valid_device() implementations enforce that. It's true > that traditional PCIe devices can only implement device 00, but ARI > relaxes that by reusing the Device Number as extended Function Number > bits. Bjorn, great catch, thank you! I suspect you're actually onto the core of the problem. Looking through various other drivers that implement _pcie_valid_device they all appear to file similar restrictions on scanning for devices. The drivers are all similar enough that I am starting to suspect they are all running some version of the same bugged IP. Then I came across advk_pcie_pio_is_running() in pci-aardvark.c, which describes our issue pretty spot on including the same exact SError. Interestingly enough they made a TF-A patch [3] to catch and handle the error without ever passing it to the kernel. Other limitations they added are ensuring reads are not attempted while the link is down. pci-aardvark.c also implements limitations on Completion Retry Status. It has given me ideas for solving the problem. Very Respectfully, Peter Geis [3] https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/commit/?id=3c7dcdac5c50 > > > > There *is* a way for a PCIe device to say "I need more time". It does > > > this by responding to that Vendor ID config read with Request Retry > > > Status (RRS, aka CRS in older specs), which means "I'm not ready yet, > > > but I will be ready in the future." Adding a delay would definitely > > > make a difference here, so my guess is this is what's happening. > > > > > > Most root complexes return ~0 data to the CPU when a config read > > > terminates with UR or RRS. It sounds like rockchip does this for UR > > > but possibly not for RRS. > > > > > > There is a "RRS Software Visibility" feature, which is supposed to > > > turn the RRS into a special value (Vendor ID == 0x0001), but per [1], > > > rockchip doesn't support it (lspci calls it "CRSVisible"). > > > > > > But the CPU load instruction corresponding to the config read has to > > > complete by reading *something* or else be aborted. It sounds like > > > it's aborted in this case. I don't know the arm64 details, but if we > > > could catch that abort and determine that it was an RRS and not a UR, > > > maybe we could fabricate the magic RRS 0x0001 value. > > > > > > imx6q_pcie_abort_handler() does something like that, although I think > > > it's for arm32, not arm64. But obviously we already catch the abort > > > enough to dump the register state and panic, so maybe there's a way to > > > extend that? > > > > Perhaps a hook mechanism that allows drivers to register with the > > serror handler and offer to handle specific errors before the generic > > code causes the system panic? > > > > Very Respectfully, > > Peter Geis > > > > [2] https://lore.kernel.org/linux-pci/CAMdYzYqn3L7x-vc+_K6jG0EVTiPGbz8pQ-N1Q1mRbcVXE822Yg@mail.gmail.com/ > > > > > > > > Bjorn > > > > > > [1] https://lore.kernel.org/linux-pci/CAMdYzYpOFAVq30N+O2gOxXiRtpoHpakFg3LKq3TEZq4S6Y0y0g@mail.gmail.com/ > > _______________________________________________ > > Linux-kernel-mentees mailing list > > Linux-kernel-mentees@lists.linuxfoundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan [not found] ` <CAMdYzYqV72=pQa-U3a2N7MZ2ChBNL74QrxHQLbMZJxiftTK9sA@mail.gmail.com> 2023-05-15 11:04 ` [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan Vincenzo Palazzo 2023-05-15 16:51 ` Bjorn Helgaas @ 2023-07-12 15:42 ` Vincenzo Palazzo 2 siblings, 0 replies; 5+ messages in thread From: Vincenzo Palazzo @ 2023-07-12 15:42 UTC (permalink / raw) To: Peter Geis, Bjorn Helgaas Cc: kw, heiko, robh, linux-pci, shawn.lin, linux-kernel, lgirdwood, linux-rockchip, broonie, bhelgaas, linux-kernel-mentees, lpieralisi, linux-arm-kernel, Dan Johansen, Catalin Marinas, Will Deacon, Robin Murphy, skhan Hi all, > Perhaps a hook mechanism that allows drivers to register with the > serror handler and offer to handle specific errors before the generic > code causes the system panic? I have some time to work on it. I'm interested in exploring the solution proposed here. However, I would appreciate some guidance in understanding how to implement this type of hook to report the error. Cheers. Vincent. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <20230509153912.515218-1-vincenzopalazzodev@gmail.com>]
* Re: [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan [not found] <20230509153912.515218-1-vincenzopalazzodev@gmail.com> @ 2023-11-20 4:15 ` Tom Fitzhenry 0 siblings, 0 replies; 5+ messages in thread From: Tom Fitzhenry @ 2023-11-20 4:15 UTC (permalink / raw) To: vincenzopalazzodev Cc: bhelgaas, broonie, heiko, kw, lgirdwood, linux-arm-kernel, linux-kernel-mentees, linux-kernel, linux-pci, linux-rockchip, lpieralisi, robh, shawn.lin, skhan, strit My RockPro64 occasionally failed on boot with this crash dump, at least since Linux 5.15. Since Linux 6.5, every boot fails in this manner. I applied a similar patch[0] against Linux 6.6 that sleeps during probe, and I'm now able to boot successfully each time. 0. https://gitlab.manjaro.org/manjaro-arm/packages/core/linux/-/blob/44e81d83b7e002e9955ac3c54e276218dc9ac76d/1005-rk3399-rp64-pcie-Reimplement-rockchip-PCIe-bus-scan-delay.patch _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-11-20 4:18 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CSK8M39MQL2C.3S7JO031H0BA2@vincent-arch>
[not found] ` <ZF7m1npzLZmawT8Y@bhelgaas>
[not found] ` <CAMdYzYqV72=pQa-U3a2N7MZ2ChBNL74QrxHQLbMZJxiftTK9sA@mail.gmail.com>
2023-05-15 11:04 ` [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan Vincenzo Palazzo
2023-05-15 16:51 ` Bjorn Helgaas
2023-05-15 20:52 ` Peter Geis
2023-07-12 15:42 ` Vincenzo Palazzo
[not found] <20230509153912.515218-1-vincenzopalazzodev@gmail.com>
2023-11-20 4:15 ` Tom Fitzhenry
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).