linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [bug report] powerpc: per device MSI irq domain
@ 2025-12-02 11:17 Nilay Shroff
  2025-12-04 10:48 ` Nam Cao
  0 siblings, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-12-02 11:17 UTC (permalink / raw)
  To: Nam Cao
  Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz,
	gautam, Gregory Joyce

Hi Nam,

I have been using an NVMe disk on my PowerPC system that supports up to
129 MSI-X interrupt vectors. Everything worked fine until Linux kernel
v6.18, after which the NVMe driver stopped detecting the disk because
the driver probe now fails.

After further investigation, I found that the probe failure in v6.18
occurs during PCI/MSI-X vector allocation. A git bisect identified
commit daaa574aba6f (“powerpc/pseries/msi: Switch to msi_create_parent_
irq_domain()”) as the first bad commit.

Additional debugging showed that the driver probe fails when calling
msi_create_device_irq_domain(). My working hypothesis is that, although
the PCIe NVMe device advertises support for 129 MSI-X vectors, the pSeries
firmware can supply only 128 MSI vectors to the device. This mismatch 
appears to cause MSI-X domain creation to fail, which ultimately results
in the NVMe driver failing to probe the device.

Device & MSI-X capability:
==========================

# lspci 
0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)

# lspci -vvv -s 0524:28:00.0 | grep -A2 MSI-X
	Capabilities: [b0] MSI-X: Enable+ Count=129 Masked-
		Vector table: BAR=0 offset=00005200
		PBA: BAR=0 offset=0000d600

Relevant device tree excerpt (DTS):

pci@800000020000585 {
    ...
    ibm,pe-total-#msi = <0x80>;            /* 128 available under this PHB */
    ...
    pci1014,6d1@0 {
        ...
        ibm,msi-x-ranges = <0x1c 0x01>;
        ibm,req#msi-x        = <0x81>;     /* device supports 0x81 == 129 */
        ...
    }
}

As shown above, The device supports 0x81 (129) MSI-X vectors (ibm,req#msi-x),
but the PHB reports ibm,pe-total-#msi = 0x80 (128), indicating the platform/firmware
provides only 128 MSI vectors for devices under that PHB.

Debugfs IRQ domain (on a kernel just before the bad commit):
===========================================================

# cat /sys/kernel/debug/irq/domains/:pci@800000020000524-3
name:   :pci@800000020000524-3
 size:   0
 mapped: 65
 flags:  0x00000013
    IRQ_DOMAIN_FLAG_HIERARCHY
    IRQ_DOMAIN_NAME_ALLOCATED
    IRQ_DOMAIN_FLAG_MSI
 parent: pSeries-MSI-1316
    name:   pSeries-MSI-1316
     size:   128
     mapped: 65
     flags:  0x00000003
        IRQ_DOMAIN_FLAG_HIERARCHY
        IRQ_DOMAIN_NAME_ALLOCATED
     parent: :interrupt-controller@400209f0000
        ...

This shows the parent domain (pSeries-MSI-1316) has size: 128.
From this, it appears the pseries firmware or parent IRQ domain only
provides 128 MSI vectors to the device, though, the device could
support 129 MSI vectors. But then, the device eventually clamped the MSI 
requests to 65 irq vectors and those were mapped successfully. 

Debugfs IRQ domain (running the latest kernel):
===============================================

# cat   /sys/kernel/debug/irq/domains/\:pci@800000020000524-5 
name:   :pci@800000020000524-5
 size:   128
 mapped: 0
 flags:  0x00000103
            IRQ_DOMAIN_FLAG_HIERARCHY
            IRQ_DOMAIN_NAME_ALLOCATED
            IRQ_DOMAIN_FLAG_MSI_PARENT
 parent: :interrupt-controller@400209f0000
    name:   :interrupt-controller@400209f0000
     size:   0
     mapped: 135
     flags:  0x00000003
                IRQ_DOMAIN_FLAG_HIERARCHY
                IRQ_DOMAIN_NAME_ALLOCATED

I do not see a per-device domain such as pSeries-PCI-MSI-0524:28:00.0 created;
and the device probe aborts with -22 during MSI/MSI-X allocation as shown below.

# dmesg | grep "nvme 0524:28:00.0"
[   15.000370] nvme 0524:28:00.0: ibm,query-pe-dma-windows(53) 280000 8000000 20000524 returned 0, lb=1000000 ps=103 wn=1
[   15.000772] nvme 0524:28:00.0: ibm,create-pe-dma-window(54) 280000 8000000 20000524 15 25 returned 0 (liobn = 0x70000524 starting addr = 8000000 0)
[   15.010030] nvme 0524:28:00.0: lsa_required: 0, lsa_enabled: 0, direct mapping: 1
[   15.015637] nvme 0524:28:00.0: lsa_required: 0, lsa_enabled: 0, direct mapping: 1
[   15.021223] nvme 0524:28:00.0: enabling device (0140 -> 0142)
[   15.028379] nvme 0524:28:00.0: probe with driver nvme failed with error -22


Summary / hypothesis:
=====================
- The adapter advertises 129 MSI-X vectors, but the PHB/firmware reports 128 available
  MSI vectors for devices in that PCI subtree (ibm,pe-total-#msi = 0x80).

- After the daaa574aba6f change an allocation request for 129 vectors fails when the
  parent only has 128 slots. This leads to msi_create_device_irq_domain() failing and
  the NVMe driver probe aborting.

- Previously, the kernel ended up clamping the device’s request (to fewer vectors — e.g., 65)
  and probe succeeded; after the change the strict parent-domain allocation prevents this
  graceful fall-back.

Please let me know if you want an additional details to be captured.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] powerpc: per device MSI irq domain
  2025-12-02 11:17 [bug report] powerpc: per device MSI irq domain Nilay Shroff
@ 2025-12-04 10:48 ` Nam Cao
  2025-12-04 17:24   ` Nilay Shroff
  0 siblings, 1 reply; 5+ messages in thread
From: Nam Cao @ 2025-12-04 10:48 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz,
	gautam, Gregory Joyce

Hi Nilay,

Nilay Shroff <nilay@linux.ibm.com> writes:
> I have been using an NVMe disk on my PowerPC system that supports up to
> 129 MSI-X interrupt vectors. Everything worked fine until Linux kernel
> v6.18, after which the NVMe driver stopped detecting the disk because
> the driver probe now fails.
>
> After further investigation, I found that the probe failure in v6.18
> occurs during PCI/MSI-X vector allocation. A git bisect identified
> commit daaa574aba6f (“powerpc/pseries/msi: Switch to msi_create_parent_
> irq_domain()”) as the first bad commit.

Thanks for the report. I can (kind of) reproduce the problem with QEMU.

I think moving rtas_prepare_msi_irqs() into pseries_irq_domain_alloc()
should resolve the problem. But I'm not sure because I don't understand
how RTAS works.

Does IBM have some documentation describing the RTAS API? I failed to
google it.

Best regards,
Nam


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] powerpc: per device MSI irq domain
  2025-12-04 10:48 ` Nam Cao
@ 2025-12-04 17:24   ` Nilay Shroff
  2025-12-06 14:38     ` Nam Cao
  0 siblings, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-12-04 17:24 UTC (permalink / raw)
  To: Nam Cao
  Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz,
	gautam, Gregory Joyce



On 12/4/25 4:18 PM, Nam Cao wrote:
> Hi Nilay,
> 
> Nilay Shroff <nilay@linux.ibm.com> writes:
>> I have been using an NVMe disk on my PowerPC system that supports up to
>> 129 MSI-X interrupt vectors. Everything worked fine until Linux kernel
>> v6.18, after which the NVMe driver stopped detecting the disk because
>> the driver probe now fails.
>>
>> After further investigation, I found that the probe failure in v6.18
>> occurs during PCI/MSI-X vector allocation. A git bisect identified
>> commit daaa574aba6f (“powerpc/pseries/msi: Switch to msi_create_parent_
>> irq_domain()”) as the first bad commit.
> 
> Thanks for the report. I can (kind of) reproduce the problem with QEMU.
> 
> I think moving rtas_prepare_msi_irqs() into pseries_irq_domain_alloc()
> should resolve the problem. But I'm not sure because I don't understand
> how RTAS works.
> 
> Does IBM have some documentation describing the RTAS API? I failed to
> google it.

Yes you can find the architecture document here: 
https://github.com/linuxppc/public-docs/blob/main/LoPAPR/LoPAR-20200812.pdf

You may refer section 7 in the above document, which describes RTAS API.

Thanks,--Nilay


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] powerpc: per device MSI irq domain
  2025-12-04 17:24   ` Nilay Shroff
@ 2025-12-06 14:38     ` Nam Cao
  2025-12-08 12:03       ` Nilay Shroff
  0 siblings, 1 reply; 5+ messages in thread
From: Nam Cao @ 2025-12-06 14:38 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz,
	gautam, Gregory Joyce

Nilay Shroff <nilay@linux.ibm.com> writes:
> Yes you can find the architecture document here: 
> https://github.com/linuxppc/public-docs/blob/main/LoPAPR/LoPAR-20200812.pdf
>
> You may refer section 7 in the above document, which describes RTAS API.

Thank you, that helped a lot.

Can you please confirm that the below diff fixes the problem? It brings
back the "fallback" thing that you mentioned.

Best regards,
Nam

diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
index a82aaa786e9e..8898a968a59b 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -19,6 +19,11 @@
 
 #include "pseries.h"
 
+struct pseries_msi_device {
+	unsigned int msi_quota;
+	unsigned int msi_used;
+};
+
 static int query_token, change_token;
 
 #define RTAS_QUERY_FN		0
@@ -433,8 +438,26 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev
 	struct msi_domain_info *info = domain->host_data;
 	struct pci_dev *pdev = to_pci_dev(dev);
 	int type = (info->flags & MSI_FLAG_PCI_MSIX) ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI;
+	int ret;
+
+	struct pseries_msi_device *pseries_dev __free(kfree)
+		= kmalloc(sizeof(*pseries_dev), GFP_KERNEL);
+	if (!pseries_dev)
+		return -ENOMEM;
+
+	ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg);
+	if (ret > 0) {
+		nvec = ret;
+		ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg);
+	}
+	if (ret < 0)
+		return ret;
 
-	return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
+	pseries_dev->msi_quota = nvec;
+	pseries_dev->msi_used = 0;
+
+	arg->scratchpad[0].ptr = no_free_ptr(pseries_dev);
+	return 0;
 }
 
 /*
@@ -443,9 +466,13 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev
  */
 static void pseries_msi_ops_teardown(struct irq_domain *domain, msi_alloc_info_t *arg)
 {
+	struct pseries_msi_device *pseries_dev = arg->scratchpad[0].ptr;
 	struct pci_dev *pdev = to_pci_dev(domain->dev);
 
 	rtas_disable_msi(pdev);
+
+	WARN_ON(pseries_dev->msi_used);
+	kfree(pseries_dev);
 }
 
 static void pseries_msi_shutdown(struct irq_data *d)
@@ -546,12 +573,18 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
 				    unsigned int nr_irqs, void *arg)
 {
 	struct pci_controller *phb = domain->host_data;
+	struct pseries_msi_device *pseries_dev;
 	msi_alloc_info_t *info = arg;
 	struct msi_desc *desc = info->desc;
 	struct pci_dev *pdev = msi_desc_to_pci_dev(desc);
 	int hwirq;
 	int i, ret;
 
+	pseries_dev = info->scratchpad[0].ptr;
+
+	if (pseries_dev->msi_used + nr_irqs > pseries_dev->msi_quota)
+		return -ENOSPC;
+
 	hwirq = rtas_query_irq_number(pci_get_pdn(pdev), desc->msi_index);
 	if (hwirq < 0) {
 		dev_err(&pdev->dev, "Failed to query HW IRQ: %d\n", hwirq);
@@ -567,9 +600,10 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
 			goto out;
 
 		irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i,
-					      &pseries_msi_irq_chip, domain->host_data);
+					      &pseries_msi_irq_chip, pseries_dev);
 	}
 
+	pseries_dev->msi_used++;
 	return 0;
 
 out:
@@ -582,9 +616,11 @@ static void pseries_irq_domain_free(struct irq_domain *domain, unsigned int virq
 				    unsigned int nr_irqs)
 {
 	struct irq_data *d = irq_domain_get_irq_data(domain, virq);
-	struct pci_controller *phb = irq_data_get_irq_chip_data(d);
+	struct pseries_msi_device *pseries_dev = irq_data_get_irq_chip_data(d);
+	struct pci_controller *phb = domain->host_data;
 
 	pr_debug("%s bridge %pOF %d #%d\n", __func__, phb->dn, virq, nr_irqs);
+	pseries_dev->msi_used -= nr_irqs;
 	irq_domain_free_irqs_parent(domain, virq, nr_irqs);
 }
 


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [bug report] powerpc: per device MSI irq domain
  2025-12-06 14:38     ` Nam Cao
@ 2025-12-08 12:03       ` Nilay Shroff
  0 siblings, 0 replies; 5+ messages in thread
From: Nilay Shroff @ 2025-12-08 12:03 UTC (permalink / raw)
  To: Nam Cao
  Cc: linuxppc-dev, maddy, mpe, npiggin, christophe.leroy, tglx, maz,
	gautam, Gregory Joyce



On 12/6/25 8:08 PM, Nam Cao wrote:
> Nilay Shroff <nilay@linux.ibm.com> writes:
>> Yes you can find the architecture document here: 
>> https://github.com/linuxppc/public-docs/blob/main/LoPAPR/LoPAR-20200812.pdf
>>
>> You may refer section 7 in the above document, which describes RTAS API.
> 
> Thank you, that helped a lot.
> 
> Can you please confirm that the below diff fixes the problem? It brings
> back the "fallback" thing that you mentioned.
> 
> Best regards,
> Nam
> 
> diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
> index a82aaa786e9e..8898a968a59b 100644
> --- a/arch/powerpc/platforms/pseries/msi.c
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -19,6 +19,11 @@
>  
>  #include "pseries.h"
>  
> +struct pseries_msi_device {
> +	unsigned int msi_quota;
> +	unsigned int msi_used;
> +};
> +
>  static int query_token, change_token;
>  
>  #define RTAS_QUERY_FN		0
> @@ -433,8 +438,26 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev
>  	struct msi_domain_info *info = domain->host_data;
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	int type = (info->flags & MSI_FLAG_PCI_MSIX) ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI;
> +	int ret;
> +
> +	struct pseries_msi_device *pseries_dev __free(kfree)
> +		= kmalloc(sizeof(*pseries_dev), GFP_KERNEL);
> +	if (!pseries_dev)
> +		return -ENOMEM;
> +
> +	ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg);
> +	if (ret > 0) {
> +		nvec = ret;
> +		ret = rtas_prepare_msi_irqs(pdev, nvec, type, arg);
> +	}
> +	if (ret < 0)
> +		return ret;
>  
> -	return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
> +	pseries_dev->msi_quota = nvec;
> +	pseries_dev->msi_used = 0;
> +
> +	arg->scratchpad[0].ptr = no_free_ptr(pseries_dev);
> +	return 0;
>  }
>  
>  /*
> @@ -443,9 +466,13 @@ static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device *dev
>   */
>  static void pseries_msi_ops_teardown(struct irq_domain *domain, msi_alloc_info_t *arg)
>  {
> +	struct pseries_msi_device *pseries_dev = arg->scratchpad[0].ptr;
>  	struct pci_dev *pdev = to_pci_dev(domain->dev);
>  
>  	rtas_disable_msi(pdev);
> +
> +	WARN_ON(pseries_dev->msi_used);
> +	kfree(pseries_dev);
>  }
>  
>  static void pseries_msi_shutdown(struct irq_data *d)
> @@ -546,12 +573,18 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
>  				    unsigned int nr_irqs, void *arg)
>  {
>  	struct pci_controller *phb = domain->host_data;
> +	struct pseries_msi_device *pseries_dev;
>  	msi_alloc_info_t *info = arg;
>  	struct msi_desc *desc = info->desc;
>  	struct pci_dev *pdev = msi_desc_to_pci_dev(desc);
>  	int hwirq;
>  	int i, ret;
>  
> +	pseries_dev = info->scratchpad[0].ptr;
> +
> +	if (pseries_dev->msi_used + nr_irqs > pseries_dev->msi_quota)
> +		return -ENOSPC;
> +
>  	hwirq = rtas_query_irq_number(pci_get_pdn(pdev), desc->msi_index);
>  	if (hwirq < 0) {
>  		dev_err(&pdev->dev, "Failed to query HW IRQ: %d\n", hwirq);
> @@ -567,9 +600,10 @@ static int pseries_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
>  			goto out;
>  
>  		irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i,
> -					      &pseries_msi_irq_chip, domain->host_data);
> +					      &pseries_msi_irq_chip, pseries_dev);
>  	}
>  
> +	pseries_dev->msi_used++;
>  	return 0;
>  
>  out:
> @@ -582,9 +616,11 @@ static void pseries_irq_domain_free(struct irq_domain *domain, unsigned int virq
>  				    unsigned int nr_irqs)
>  {
>  	struct irq_data *d = irq_domain_get_irq_data(domain, virq);
> -	struct pci_controller *phb = irq_data_get_irq_chip_data(d);
> +	struct pseries_msi_device *pseries_dev = irq_data_get_irq_chip_data(d);
> +	struct pci_controller *phb = domain->host_data;
>  
>  	pr_debug("%s bridge %pOF %d #%d\n", __func__, phb->dn, virq, nr_irqs);
> +	pseries_dev->msi_used -= nr_irqs;
>  	irq_domain_free_irqs_parent(domain, virq, nr_irqs);
>  }
>  

Thnaks for the patch! I tested it on my system and I confirmed that
this patch fixes the bug reported earlier. That said, if you're
planning to send a formal patch upstream with the above change then
please feel free to add,

Acked-by: Nilay Shroff <nilay@inux.ibm.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-08 12:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-02 11:17 [bug report] powerpc: per device MSI irq domain Nilay Shroff
2025-12-04 10:48 ` Nam Cao
2025-12-04 17:24   ` Nilay Shroff
2025-12-06 14:38     ` Nam Cao
2025-12-08 12:03       ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).