linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Johan Hovold <johan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-arm-kernel@lists.infradead.org, linux-pci@vger.kernel.org,
	anna-maria@linutronix.de, shawnguo@kernel.org,
	s.hauer@pengutronix.de, festevam@gmail.com, bhelgaas@google.com,
	rdunlap@infradead.org, vidyas@nvidia.com,
	ilpo.jarvinen@linux.intel.com, apatel@ventanamicro.com,
	kevin.tian@intel.com, nipun.gupta@amd.com, den@valinux.co.jp,
	andrew@lunn.ch, gregory.clement@bootlin.com,
	sebastian.hesselbarth@gmail.com, gregkh@linuxfoundation.org,
	rafael@kernel.org, alex.williamson@redhat.com, will@kernel.org,
	lorenzo.pieralisi@arm.com, jgg@mellanox.com,
	ammarfaizi2@gnuweeb.org, robin.murphy@arm.com,
	lpieralisi@kernel.org, nm@ti.com, kristo@kernel.org,
	vkoul@kernel.org, okaya@kernel.org, agross@kernel.org,
	andersson@kernel.org, mark.rutland@arm.com,
	shameerali.kolothum.thodi@huawei.com, yuzenghui@huawei.com
Subject: Re: [patch V4 00/21] genirq, irqchip: Convert ARM MSI handling to per device MSI domains
Date: Wed, 17 Jul 2024 13:54:40 +0100	[thread overview]
Message-ID: <86msmg2n73.wl-maz@kernel.org> (raw)
In-Reply-To: <Zpdxe4ce-XwDEods@hovoldconsulting.com>

On Wed, 17 Jul 2024 08:23:39 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Tue, Jul 16, 2024 at 07:21:39PM +0100, Marc Zyngier wrote:
> > On Tue, 16 Jul 2024 15:53:28 +0100,
> > Johan Hovold <johan@kernel.org> wrote:
> > > On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> > > > On Mon, 15 Jul 2024 15:10:01 +0100,
> > > > Johan Hovold <johan@kernel.org> wrote:
> > > > > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > > > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > > > > Johan Hovold <johan@kernel.org> wrote:
> 
> > > > > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > > > > > 
> > > > > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > > > > can confirm that the breakage is caused by commits:
> > > > > > > 
> > > > > > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > > > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > > > > > 
> > > > > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > > > > wifi on one machine:
> > > > > > > 
> > > > > > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > > > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
> > > 
> > > Correction, this doesn't fix the wifi, but I'm not seeing these errors
> > > with the commit before cc23d1dfc959 as the ath11k driver doesn't get
> 
> [ This was supposed to say 3d1c927c08fc, which is the mainline hash,
> sorry. ]
> 
> > > this far (or doesn't probe at all).
> > 
> > I think we need to track one thing at a time. The wifi and nvme
> > problems seem subtly different... Which is the exact commit that
> > breaks nvme on your machine?
> 
> Yeah, forget about 3d1c927c08fc for now, which may have been a red
> herring since we're also appear to be dealing with some sort of race and
> (some) symptoms keep changing from boot to boot. The only thing that for
> certain is that the series breaks MSI and that the NVMe breaks with
> commit 233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for
> PCI/MSI[-X]").
> 
> > > > So is this issue actually tied to the async probing? Does it always
> > > > work if you disable it?
> > > 
> > > There seem to multiple issues here.
> > > 
> > > With the full series applied and normal async (i.e. parallel) probing of
> > > the PCIe controllers I sometimes see allocation failing with -ENOSPC
> > > (e.g. the above ath11k errors). This seems to indicate broken locking
> > > somewhere.
> > 
> > Your log doesn't support this theory. At least not from an ITS
> > perspective, as it keeps dishing out INTIDs (and it is very hard to
> > run out of IRQs with the ITS).
> 
> The log I shared was with synchronous probing which takes parallel
> allocation out of the equation (and gives more readable logs) so that is
> expected. See below for a log with normal async probing that may give
> some more insight into the race as well (i.e. when ath11k allocation
> fails with -ENOSPC.)

Huh, this log is actually pointing at something very ugly. Not a race,
but some horrible ID confusion. See below.

> 
> > > With synchronous probing, allocation always seems to succeed but the
> > > ath11k (and modem) drivers time out as no interrupts are received.
> > > 
> > > The NVMe driver sometimes falls back to INTx signalling and can access
> > > the drive, but often end up with an MSIX (?!) allocation and then fails
> > > to probe:
> > > 
> > > 	[  132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled
> > 
> > So one of my test boxes (ThunderX) fails this exact way, while another
> > (Synquacer) is pretty happy. Still trying to understand the difference
> > in behaviour.
> > 
> > How do you enforce synchronous probing?
> 
> I believe there is a kernel parameter for this (e.g.
> module.async_probe), but I just disable async probing for the Qualcomm
> PCIe driver I'm using:

I had tried this module parameter, but it didn't change anything on my
end.

> 
> --- a/drivers/pci/controller/dwc/pcie-qcom.c
> +++ b/drivers/pci/controller/dwc/pcie-qcom.c
> @@ -1684,7 +1684,7 @@ static struct platform_driver qcom_pcie_driver = {
>                 .name = "qcom-pcie",
>                 .of_match_table = qcom_pcie_match,
>                 .pm = &qcom_pcie_pm_ops,
> -               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> +               //.probe_type = PROBE_PREFER_ASYNCHRONOUS,
>         },
>  };

I'll have a look whether the TX1 PCIe driver uses this. It's
positively ancient, so I wouldn't bet that it has been touched
significantly in the past 5 years.

[...]

> [    8.692011] Reusing ITT for devID 0
> [    8.693668] Reusing ITT for devID 0

This is really odd. It indicates that you have several devices sharing
the same DeviceID, which I seriously doubt it is the case in a
laptop. Do you have any non-transparent bridge here? lspci would help.

> [    8.693871] pcieport 0006:00:00.0: PME: Signaling with IRQ 228
> [    8.694116] pcieport 0006:00:00.0: AER: enabled with IRQ 228
> [    8.696453] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
> [    8.703760] IRQ206 -> 0-7 CPU2
> [    8.710986] pci 0004:00:00.0:   bridge window [mem 0x34300000-0x343fffff]
> [    8.711136] Reusing ITT for devID 0

Where is the bus number gone?

> [    8.717093] IRQ207 -> 0-7 CPU3
> [    8.723889] Reusing ITT for devID 0
> [    8.729600] IRQ208 -> 0-7 CPU4
> [    8.736507] pcieport 0004:00:00.0: PME: Signaling with IRQ 229
> [    8.744261] IRQ209 -> 0-7 CPU5
> [    8.750757] pcieport 0004:00:00.0: AER: enabled with IRQ 229
> [    8.758038] IRQ210 -> 0-7 CPU6
> [    9.071793] IRQ211 -> 0-7 CPU7
> [    9.071807] IRQ212 -> 0-7 CPU0
> [    9.071819] IRQ213 -> 0-7 CPU1
> [    9.071831] IRQ214 -> 0-7 CPU2
> [    9.071842] IRQ215 -> 0-7 CPU3
> [    9.071852] IRQ216 -> 0-7 CPU4
> [    9.071863] IRQ217 -> 0-7 CPU5
> [    9.071875] IRQ218 -> 0-7 CPU6
> [    9.071886] IRQ219 -> 0-7 CPU7
> [    9.071897] IRQ220 -> 0-7 CPU0
> [    9.071907] IRQ221 -> 0-7 CPU1
> [    9.071920] IRQ222 -> 0-7 CPU2
> [    9.071930] IRQ223 -> 0-7 CPU3
> [    9.071941] IRQ224 -> 0-7 CPU4
> [    9.071952] IRQ225 -> 0-7 CPU5
> [    9.071962] IRQ226 -> 0-7 CPU6
> [    9.071973] IRQ227 -> 0-7 CPU7
> [    9.073568] Reusing ITT for devID 0
> [    9.073607] ID:0 pID:8192 vID:196
> [    9.073618] IRQ196 -> 0-7 CPU0
> [    9.073717] IRQ196 -> 0-7 CPU0
> [    9.073737] pcieport 0002:00:00.0: PME: Signaling with IRQ 196
> [    9.086532] pcieport 0002:00:00.0: AER: enabled with IRQ 196
> [    9.102057] mhi-pci-generic 0004:01:00.0: MHI PCI device found: foxconn-sdx55
> [    9.109830] mhi-pci-generic 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
> [    9.119027] mhi-pci-generic 0004:01:00.0: enabling device (0000 -> 0002)
> [    9.127271] ITS: alloc 8224:8
> [    9.141500] ITT 8 entries, 3 bits
> [    9.144502] ID:0 pID:8224 vID:198
> [    9.144597] ID:1 pID:8225 vID:199
> [    9.144605] ID:2 pID:8226 vID:200
> [    9.144612] ID:3 pID:8227 vID:201
> [    9.144619] ID:4 pID:8228 vID:202
> [    9.144689] IRQ198 -> 0-7 CPU1
> [    9.144888] IRQ199 -> 0-7 CPU2
> [    9.144901] IRQ200 -> 0-7 CPU3
> [    9.144914] IRQ201 -> 0-7 CPU4
> [    9.144927] IRQ202 -> 0-7 CPU5
> [    9.151264] IRQ198 -> 0-7 CPU1
> [    9.151479] IRQ199 -> 0-7 CPU2
> [    9.151673] IRQ200 -> 0-7 CPU3
> [    9.151849] IRQ201 -> 0-7 CPU4
> [    9.152056] IRQ202 -> 0-7 CPU5
> [    9.159972] mhi mhi0: Requested to power ON
> [    9.165275] mhi mhi0: Power on setup success
> [    9.279951] ath11k_pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
> [    9.288208] ath11k_pci 0006:01:00.0: enabling device (0000 -> 0002)
> [    9.301708] nvme nvme0: pci function 0002:01:00.0
> [    9.307052] Reusing ITT for devID 100
> [    9.315457] nvme 0002:01:00.0: enabling device (0000 -> 0002)

This is device 0002:01:00.0...

> [    9.326554] Reusing ITT for devID 100

... seen as device 0000:01:00.0. WTF???

> [    9.336332] ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28

I'm starting to suspect that the new code doesn't carry all the
required bits for the DevID, and that we end-up trying to allocated
interrupts from the pool allocated to another device, which can never
be a good thing, and would explain why everything dies a painful
death.

Can you run the same trace with the whole thing reverted? I think
we're on something here.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


  reply	other threads:[~2024-07-17 12:56 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-23 15:18 [patch V4 00/21] genirq, irqchip: Convert ARM MSI handling to per device MSI domains Thomas Gleixner
2024-06-23 15:18 ` [patch V4 01/21] PCI/MSI: Provide MSI_FLAG_PCI_MSI_MASK_PARENT Thomas Gleixner
2024-06-26 19:05   ` [patch V4-1 " Thomas Gleixner
2024-06-23 15:18 ` [patch V4 02/21] irqchip: Provide irq-msi-lib Thomas Gleixner
2024-07-01 10:18   ` Lorenzo Pieralisi
2024-07-03 13:57     ` Thomas Gleixner
2024-06-23 15:18 ` [patch V4 03/21] irqchip/gic-v3-its: Provide MSI parent infrastructure Thomas Gleixner
2024-06-23 15:18 ` [patch V4 04/21] irqchip/irq-msi-lib: Prepare for PCI MSI/MSIX Thomas Gleixner
2024-06-23 15:18 ` [patch V4 05/21] irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X] Thomas Gleixner
2024-06-28 22:24   ` Catalin Marinas
2024-06-29  8:37     ` Thomas Gleixner
2024-06-29  9:42       ` Marc Zyngier
2024-06-29  9:50         ` Marc Zyngier
2024-06-29 10:11           ` Marc Zyngier
2024-06-29 10:44             ` Thomas Gleixner
2024-06-29 19:51               ` Thomas Gleixner
2024-06-30  9:55                 ` Catalin Marinas
2024-06-29  9:18     ` Marc Zyngier
2024-06-23 15:18 ` [patch V4 06/21] irqchip/irq-msi-lib: Prepare for DEVICE MSI to replace platform MSI Thomas Gleixner
2024-06-23 15:18 ` [patch V4 07/21] irqchip/mbigen: Prepare for real per device MSI Thomas Gleixner
2024-06-23 15:18 ` [patch V4 08/21] irqchip/irq-msi-lib: Prepare for DOMAIN_BUS_WIRED_TO_MSI Thomas Gleixner
2024-06-23 15:18 ` [patch V4 09/21] irqchip/gic-v3-its: Switch platform MSI to MSI parent Thomas Gleixner
2024-06-23 15:18 ` [patch V4 10/21] irqchip/mbigen: Remove platform_msi_create_device_domain() fallback Thomas Gleixner
2024-06-25 14:42   ` Lorenzo Pieralisi
2024-06-26  9:13     ` Hanjun Guo
2024-06-23 15:18 ` [patch V4 11/21] genirq/msi: Remove platform_msi_create_device_domain() Thomas Gleixner
2024-06-23 15:18 ` [patch V4 12/21] irqchip/gic_v3_mbi: Switch over to parent domain Thomas Gleixner
2024-06-23 15:18 ` [patch V4 13/21] irqchip/gic-v2m: Switch to device MSI Thomas Gleixner
2024-06-23 15:18 ` [patch V4 14/21] irqchip/imx-mu-msi: Switch to MSI parent Thomas Gleixner
2024-06-23 15:18 ` [patch V4 15/21] irqchip/irq-mvebu-icu: Prepare for real per device MSI Thomas Gleixner
2024-06-23 15:18 ` [patch V4 16/21] irqchip/mvebu-gicp: Switch to MSI parent Thomas Gleixner
2024-06-23 15:19 ` [patch V4 17/21] irqchip/mvebu-odmi: Switch to parent MSI Thomas Gleixner
2024-06-23 15:19 ` [patch V4 18/21] irqchip/irq-mvebu-sei: Switch to MSI parent Thomas Gleixner
2024-06-23 15:19 ` [patch V4 19/21] irqchip/irq-mvebu-icu: Remove platform MSI leftovers Thomas Gleixner
2024-06-23 15:19 ` [patch V4 20/21] genirq/msi: " Thomas Gleixner
2024-06-25 10:02   ` Greg KH
2024-06-23 15:19 ` [patch V4 21/21] genirq/msi: Move msi_device_data to core Thomas Gleixner
2024-06-25 19:46 ` [patch V4 00/21] genirq, irqchip: Convert ARM MSI handling to per device MSI domains Rob Herring
2024-06-26 19:03   ` Thomas Gleixner
2024-07-15 11:18 ` Johan Hovold
2024-07-15 12:58   ` Marc Zyngier
2024-07-15 14:10     ` Johan Hovold
2024-07-16 10:30       ` Marc Zyngier
2024-07-16 14:53         ` Johan Hovold
2024-07-16 18:21           ` Marc Zyngier
2024-07-17  7:23             ` Johan Hovold
2024-07-17 12:54               ` Marc Zyngier [this message]
2024-07-17 13:38                 ` Johan Hovold
2024-07-17 18:07                   ` Marc Zyngier
2024-07-17 20:10                   ` Marc Zyngier
2024-07-18  7:30                     ` Johan Hovold
2024-07-15 13:10   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86msmg2n73.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=agross@kernel.org \
    --cc=alex.williamson@redhat.com \
    --cc=ammarfaizi2@gnuweeb.org \
    --cc=andersson@kernel.org \
    --cc=andrew@lunn.ch \
    --cc=anna-maria@linutronix.de \
    --cc=apatel@ventanamicro.com \
    --cc=bhelgaas@google.com \
    --cc=den@valinux.co.jp \
    --cc=festevam@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=gregory.clement@bootlin.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=jgg@mellanox.com \
    --cc=johan@kernel.org \
    --cc=kevin.tian@intel.com \
    --cc=kristo@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lorenzo.pieralisi@arm.com \
    --cc=lpieralisi@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=nipun.gupta@amd.com \
    --cc=nm@ti.com \
    --cc=okaya@kernel.org \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=robin.murphy@arm.com \
    --cc=s.hauer@pengutronix.de \
    --cc=sebastian.hesselbarth@gmail.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=shawnguo@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=vidyas@nvidia.com \
    --cc=vkoul@kernel.org \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).