* [bug report] powerpc: per device MSI irq domain
@ 2025-12-02 11:17 Nilay Shroff
2025-12-04 10:48 ` Nam Cao
0 siblings, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-12-02 11:17 UTC (permalink / raw)
To: Nam Cao
Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz,
gautam, Gregory Joyce
Hi Nam,
I have been using an NVMe disk on my PowerPC system that supports up to
129 MSI-X interrupt vectors. Everything worked fine until Linux kernel
v6.18, after which the NVMe driver stopped detecting the disk because
the driver probe now fails.
After further investigation, I found that the probe failure in v6.18
occurs during PCI/MSI-X vector allocation. A git bisect identified
commit daaa574aba6f (“powerpc/pseries/msi: Switch to msi_create_parent_
irq_domain()”) as the first bad commit.
Additional debugging showed that the driver probe fails when calling
msi_create_device_irq_domain(). My working hypothesis is that, although
the PCIe NVMe device advertises support for 129 MSI-X vectors, the pSeries
firmware can supply only 128 MSI vectors to the device. This mismatch
appears to cause MSI-X domain creation to fail, which ultimately results
in the NVMe driver failing to probe the device.
Device & MSI-X capability:
==========================
# lspci
0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)
# lspci -vvv -s 0524:28:00.0 | grep -A2 MSI-X
Capabilities: [b0] MSI-X: Enable+ Count=129 Masked-
Vector table: BAR=0 offset=00005200
PBA: BAR=0 offset=0000d600
Relevant device tree excerpt (DTS):
pci@800000020000585 {
...
ibm,pe-total-#msi = <0x80>; /* 128 available under this PHB */
...
pci1014,6d1@0 {
...
ibm,msi-x-ranges = <0x1c 0x01>;
ibm,req#msi-x = <0x81>; /* device supports 0x81 == 129 */
...
}
}
As shown above, The device supports 0x81 (129) MSI-X vectors (ibm,req#msi-x),
but the PHB reports ibm,pe-total-#msi = 0x80 (128), indicating the platform/firmware
provides only 128 MSI vectors for devices under that PHB.
Debugfs IRQ domain (on a kernel just before the bad commit):
===========================================================
# cat /sys/kernel/debug/irq/domains/:pci@800000020000524-3
name: :pci@800000020000524-3
size: 0
mapped: 65
flags: 0x00000013
IRQ_DOMAIN_FLAG_HIERARCHY
IRQ_DOMAIN_NAME_ALLOCATED
IRQ_DOMAIN_FLAG_MSI
parent: pSeries-MSI-1316
name: pSeries-MSI-1316
size: 128
mapped: 65
flags: 0x00000003
IRQ_DOMAIN_FLAG_HIERARCHY
IRQ_DOMAIN_NAME_ALLOCATED
parent: :interrupt-controller@400209f0000
...
This shows the parent domain (pSeries-MSI-1316) has size: 128.
From this, it appears the pseries firmware or parent IRQ domain only
provides 128 MSI vectors to the device, though, the device could
support 129 MSI vectors. But then, the device eventually clamped the MSI
requests to 65 irq vectors and those were mapped successfully.
Debugfs IRQ domain (running the latest kernel):
===============================================
# cat /sys/kernel/debug/irq/domains/\:pci@800000020000524-5
name: :pci@800000020000524-5
size: 128
mapped: 0
flags: 0x00000103
IRQ_DOMAIN_FLAG_HIERARCHY
IRQ_DOMAIN_NAME_ALLOCATED
IRQ_DOMAIN_FLAG_MSI_PARENT
parent: :interrupt-controller@400209f0000
name: :interrupt-controller@400209f0000
size: 0
mapped: 135
flags: 0x00000003
IRQ_DOMAIN_FLAG_HIERARCHY
IRQ_DOMAIN_NAME_ALLOCATED
I do not see a per-device domain such as pSeries-PCI-MSI-0524:28:00.0 created;
and the device probe aborts with -22 during MSI/MSI-X allocation as shown below.
# dmesg | grep "nvme 0524:28:00.0"
[ 15.000370] nvme 0524:28:00.0: ibm,query-pe-dma-windows(53) 280000 8000000 20000524 returned 0, lb=1000000 ps=103 wn=1
[ 15.000772] nvme 0524:28:00.0: ibm,create-pe-dma-window(54) 280000 8000000 20000524 15 25 returned 0 (liobn = 0x70000524 starting addr = 8000000 0)
[ 15.010030] nvme 0524:28:00.0: lsa_required: 0, lsa_enabled: 0, direct mapping: 1
[ 15.015637] nvme 0524:28:00.0: lsa_required: 0, lsa_enabled: 0, direct mapping: 1
[ 15.021223] nvme 0524:28:00.0: enabling device (0140 -> 0142)
[ 15.028379] nvme 0524:28:00.0: probe with driver nvme failed with error -22
Summary / hypothesis:
=====================
- The adapter advertises 129 MSI-X vectors, but the PHB/firmware reports 128 available
MSI vectors for devices in that PCI subtree (ibm,pe-total-#msi = 0x80).
- After the daaa574aba6f change an allocation request for 129 vectors fails when the
parent only has 128 slots. This leads to msi_create_device_irq_domain() failing and
the NVMe driver probe aborting.
- Previously, the kernel ended up clamping the device’s request (to fewer vectors — e.g., 65)
and probe succeeded; after the change the strict parent-domain allocation prevents this
graceful fall-back.
Please let me know if you want an additional details to be captured.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [bug report] powerpc: per device MSI irq domain 2025-12-02 11:17 [bug report] powerpc: per device MSI irq domain Nilay Shroff @ 2025-12-04 10:48 ` Nam Cao 2025-12-04 17:24 ` Nilay Shroff 0 siblings, 1 reply; 5+ messages in thread From: Nam Cao @ 2025-12-04 10:48 UTC (permalink / raw) To: Nilay Shroff Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz, gautam, Gregory Joyce Hi Nilay, Nilay Shroff <nilay@linux.ibm.com> writes: > I have been using an NVMe disk on my PowerPC system that supports up to > 129 MSI-X interrupt vectors. Everything worked fine until Linux kernel > v6.18, after which the NVMe driver stopped detecting the disk because > the driver probe now fails. > > After further investigation, I found that the probe failure in v6.18 > occurs during PCI/MSI-X vector allocation. A git bisect identified > commit daaa574aba6f (“powerpc/pseries/msi: Switch to msi_create_parent_ > irq_domain()”) as the first bad commit. Thanks for the report. I can (kind of) reproduce the problem with QEMU. I think moving rtas_prepare_msi_irqs() into pseries_irq_domain_alloc() should resolve the problem. But I'm not sure because I don't understand how RTAS works. Does IBM have some documentation describing the RTAS API? I failed to google it. Best regards, Nam ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bug report] powerpc: per device MSI irq domain 2025-12-04 10:48 ` Nam Cao @ 2025-12-04 17:24 ` Nilay Shroff 2025-12-06 14:38 ` Nam Cao 0 siblings, 1 reply; 5+ messages in thread From: Nilay Shroff @ 2025-12-04 17:24 UTC (permalink / raw) To: Nam Cao Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz, gautam, Gregory Joyce On 12/4/25 4:18 PM, Nam Cao wrote: > Hi Nilay, > > Nilay Shroff <nilay@linux.ibm.com> writes: >> I have been using an NVMe disk on my PowerPC system that supports up to >> 129 MSI-X interrupt vectors. Everything worked fine until Linux kernel >> v6.18, after which the NVMe driver stopped detecting the disk because >> the driver probe now fails. >> >> After further investigation, I found that the probe failure in v6.18 >> occurs during PCI/MSI-X vector allocation. A git bisect identified >> commit daaa574aba6f (“powerpc/pseries/msi: Switch to msi_create_parent_ >> irq_domain()”) as the first bad commit. > > Thanks for the report. I can (kind of) reproduce the problem with QEMU. > > I think moving rtas_prepare_msi_irqs() into pseries_irq_domain_alloc() > should resolve the problem. But I'm not sure because I don't understand > how RTAS works. > > Does IBM have some documentation describing the RTAS API? I failed to > google it. Yes you can find the architecture document here: https://github.com/linuxppc/public-docs/blob/main/LoPAPR/LoPAR-20200812.pdf You may refer section 7 in the above document, which describes RTAS API. Thanks,--Nilay ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bug report] powerpc: per device MSI irq domain 2025-12-04 17:24 ` Nilay Shroff @ 2025-12-06 14:38 ` Nam Cao 2025-12-08 12:03 ` Nilay Shroff 0 siblings, 1 reply; 5+ messages in thread From: Nam Cao @ 2025-12-06 14:38 UTC (permalink / raw) To: Nilay Shroff Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz, gautam, Gregory Joyce Nilay Shroff <nilay@linux.ibm.com> writes: > Yes you can find the architecture document here: > https://github.com/linuxppc/public-docs/blob/main/LoPAPR/LoPAR-20200812.pdf > > You may refer section 7 in the above document, which describes RTAS API. Thank you, that helped a lot. Can you please confirm that the below diff fixes the problem? It brings back the "fallback" thing that you mentioned. Best regards, Nam diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c index a82aaa786e9e..8898a968a59b 100644 --- a/arch/powerpc/platforms/pseries/msi.c +++ b/arch/powerpc/platforms/pseries/msi.c @@ -19,6 +19,11 @@ #include "pseries.h" +struct pseries_msi_device { + unsigned int msi_quota; + unsigned int msi_used; +}; + static int query_token, change_token; #define RTAS_QUERY_FN 0 @@ -433,8 +438,26 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev struct msi_domain_info *info = domain->host_data; struct pci_dev *pdev = to_pci_dev(dev); int type = (info->flags & MSI_FLAG_PCI_MSIX) ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI; + int ret; + + struct pseries_msi_device *pseries_dev __free(kfree) + = kmalloc(sizeof(*pseries_dev), GFP_KERNEL); + if (!pseries_dev) + return -ENOMEM; + + ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg); + if (ret > 0) { + nvec = ret; + ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg); + } + if (ret < 0) + return ret; - return rtas_prepare_msi_irqs(pdev, nvec, type, arg); + pseries_dev->msi_quota = nvec; + pseries_dev->msi_used = 0; + + arg->scratchpad[0].ptr = no_free_ptr(pseries_dev); + return 0; } /* @@ -443,9 +466,13 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev */ static void pseries_msi_ops_teardown(struct irq_domain *domain, msi_alloc_info_t *arg) { + struct pseries_msi_device *pseries_dev = arg->scratchpad[0].ptr; struct pci_dev *pdev = to_pci_dev(domain->dev); rtas_disable_msi(pdev); + + WARN_ON(pseries_dev->msi_used); + kfree(pseries_dev); } static void pseries_msi_shutdown(struct irq_data *d) @@ -546,12 +573,18 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq unsigned int nr_irqs, void *arg) { struct pci_controller *phb = domain->host_data; + struct pseries_msi_device *pseries_dev; msi_alloc_info_t *info = arg; struct msi_desc *desc = info->desc; struct pci_dev *pdev = msi_desc_to_pci_dev(desc); int hwirq; int i, ret; + pseries_dev = info->scratchpad[0].ptr; + + if (pseries_dev->msi_used + nr_irqs > pseries_dev->msi_quota) + return -ENOSPC; + hwirq = rtas_query_irq_number(pci_get_pdn(pdev), desc->msi_index); if (hwirq < 0) { dev_err(&pdev->dev, "Failed to query HW IRQ: %d\n", hwirq); @@ -567,9 +600,10 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq goto out; irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i, - &pseries_msi_irq_chip, domain->host_data); + &pseries_msi_irq_chip, pseries_dev); } + pseries_dev->msi_used++; return 0; out: @@ -582,9 +616,11 @@ static void pseries_irq_domain_free(struct irq_domain *domain, unsigned int virq unsigned int nr_irqs) { struct irq_data *d = irq_domain_get_irq_data(domain, virq); - struct pci_controller *phb = irq_data_get_irq_chip_data(d); + struct pseries_msi_device *pseries_dev = irq_data_get_irq_chip_data(d); + struct pci_controller *phb = domain->host_data; pr_debug("%s bridge %pOF %d #%d\n", __func__, phb->dn, virq, nr_irqs); + pseries_dev->msi_used -= nr_irqs; irq_domain_free_irqs_parent(domain, virq, nr_irqs); } ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [bug report] powerpc: per device MSI irq domain 2025-12-06 14:38 ` Nam Cao @ 2025-12-08 12:03 ` Nilay Shroff 0 siblings, 0 replies; 5+ messages in thread From: Nilay Shroff @ 2025-12-08 12:03 UTC (permalink / raw) To: Nam Cao Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz, gautam, Gregory Joyce On 12/6/25 8:08 PM, Nam Cao wrote: > Nilay Shroff <nilay@linux.ibm.com> writes: >> Yes you can find the architecture document here: >> https://github.com/linuxppc/public-docs/blob/main/LoPAPR/LoPAR-20200812.pdf >> >> You may refer section 7 in the above document, which describes RTAS API. > > Thank you, that helped a lot. > > Can you please confirm that the below diff fixes the problem? It brings > back the "fallback" thing that you mentioned. > > Best regards, > Nam > > diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c > index a82aaa786e9e..8898a968a59b 100644 > --- a/arch/powerpc/platforms/pseries/msi.c > +++ b/arch/powerpc/platforms/pseries/msi.c > @@ -19,6 +19,11 @@ > > #include "pseries.h" > > +struct pseries_msi_device { > + unsigned int msi_quota; > + unsigned int msi_used; > +}; > + > static int query_token, change_token; > > #define RTAS_QUERY_FN 0 > @@ -433,8 +438,26 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev > struct msi_domain_info *info = domain->host_data; > struct pci_dev *pdev = to_pci_dev(dev); > int type = (info->flags & MSI_FLAG_PCI_MSIX) ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI; > + int ret; > + > + struct pseries_msi_device *pseries_dev __free(kfree) > + = kmalloc(sizeof(*pseries_dev), GFP_KERNEL); > + if (!pseries_dev) > + return -ENOMEM; > + > + ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg); > + if (ret > 0) { > + nvec = ret; > + ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg); > + } > + if (ret < 0) > + return ret; > > - return rtas_prepare_msi_irqs(pdev, nvec, type, arg); > + pseries_dev->msi_quota = nvec; > + pseries_dev->msi_used = 0; > + > + arg->scratchpad[0].ptr = no_free_ptr(pseries_dev); > + return 0; > } > > /* > @@ -443,9 +466,13 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev > */ > static void pseries_msi_ops_teardown(struct irq_domain *domain, msi_alloc_info_t *arg) > { > + struct pseries_msi_device *pseries_dev = arg->scratchpad[0].ptr; > struct pci_dev *pdev = to_pci_dev(domain->dev); > > rtas_disable_msi(pdev); > + > + WARN_ON(pseries_dev->msi_used); > + kfree(pseries_dev); > } > > static void pseries_msi_shutdown(struct irq_data *d) > @@ -546,12 +573,18 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq > unsigned int nr_irqs, void *arg) > { > struct pci_controller *phb = domain->host_data; > + struct pseries_msi_device *pseries_dev; > msi_alloc_info_t *info = arg; > struct msi_desc *desc = info->desc; > struct pci_dev *pdev = msi_desc_to_pci_dev(desc); > int hwirq; > int i, ret; > > + pseries_dev = info->scratchpad[0].ptr; > + > + if (pseries_dev->msi_used + nr_irqs > pseries_dev->msi_quota) > + return -ENOSPC; > + > hwirq = rtas_query_irq_number(pci_get_pdn(pdev), desc->msi_index); > if (hwirq < 0) { > dev_err(&pdev->dev, "Failed to query HW IRQ: %d\n", hwirq); > @@ -567,9 +600,10 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq > goto out; > > irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i, > - &pseries_msi_irq_chip, domain->host_data); > + &pseries_msi_irq_chip, pseries_dev); > } > > + pseries_dev->msi_used++; > return 0; > > out: > @@ -582,9 +616,11 @@ static void pseries_irq_domain_free(struct irq_domain *domain, unsigned int virq > unsigned int nr_irqs) > { > struct irq_data *d = irq_domain_get_irq_data(domain, virq); > - struct pci_controller *phb = irq_data_get_irq_chip_data(d); > + struct pseries_msi_device *pseries_dev = irq_data_get_irq_chip_data(d); > + struct pci_controller *phb = domain->host_data; > > pr_debug("%s bridge %pOF %d #%d\n", __func__, phb->dn, virq, nr_irqs); > + pseries_dev->msi_used -= nr_irqs; > irq_domain_free_irqs_parent(domain, virq, nr_irqs); > } > Thnaks for the patch! I tested it on my system and I confirmed that this patch fixes the bug reported earlier. That said, if you're planning to send a formal patch upstream with the above change then please feel free to add, Acked-by: Nilay Shroff <nilay@inux.ibm.com> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-12-08 12:04 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-02 11:17 [bug report] powerpc: per device MSI irq domain Nilay Shroff 2025-12-04 10:48 ` Nam Cao 2025-12-04 17:24 ` Nilay Shroff 2025-12-06 14:38 ` Nam Cao 2025-12-08 12:03 ` Nilay Shroff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).