public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq()
@ 2026-02-11 17:55 Niklas Cassel
  2026-02-11 19:26 ` Frank Li
  2026-02-25 21:44 ` Bjorn Helgaas
  0 siblings, 2 replies; 5+ messages in thread
From: Niklas Cassel @ 2026-02-11 17:55 UTC (permalink / raw)
  To: Jingoo Han, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Bjorn Helgaas,
	Kishon Vijay Abraham I, Gustavo Pimentel
  Cc: Shinichiro Kawasaki, Damien Le Moal, Koichiro Den, Niklas Cassel,
	linux-pci

When running e.g. fio with a larger queue depth against nvmet-pci-epf we
get IOMMU errors on the host, e.g.:

arm-smmu-v3 fc900000.iommu:      0x0000010000000010
arm-smmu-v3 fc900000.iommu:      0x0000020000000000
arm-smmu-v3 fc900000.iommu:      0x000000090000f040
arm-smmu-v3 fc900000.iommu:      0x0000000000000000
arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0

The reason for this is that the writel() is immediately followed by a call
to unmap(), which will tear down the outbound address translation.

PCI writes are posted, i.e. don't wait for a completion. Thus, when the
writel() returns, might not have completed yet, and could even still be
buffered in the PCI bridge, at the time unmap() is called.

Flush the write by performing a read() of the same address, to ensure that
the write has reached the destination before calling unmap().

This will add some latency, but that is certainly preferred over corrupting
the host memory.

The same problem was solved for dw_pcie_ep_raise_msi_irq(), in commit
8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping"), however
there it was solved by dedicating an outbound iATU only for MSI. For MSI-X,
we can't do the same, as each vector can have a different msg_addr, and
because the msg_addr is allowed to be changed while the vector is masked.

Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
---
 drivers/pci/controller/dwc/pcie-designware-ep.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 5d8024d5e5c6..aef41f0218a3 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -1005,6 +1005,9 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
 
 	writel(msg_data, ep->msi_mem + offset);
 
+	/* flush posted write before unmap */
+	readl(ep->msi_mem + offset);
+
 	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
 
 	return 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq()
  2026-02-11 17:55 [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq() Niklas Cassel
@ 2026-02-11 19:26 ` Frank Li
  2026-02-12 12:47   ` Niklas Cassel
  2026-02-25 21:44 ` Bjorn Helgaas
  1 sibling, 1 reply; 5+ messages in thread
From: Frank Li @ 2026-02-11 19:26 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jingoo Han, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Bjorn Helgaas,
	Kishon Vijay Abraham I, Gustavo Pimentel, Shinichiro Kawasaki,
	Damien Le Moal, Koichiro Den, linux-pci

On Wed, Feb 11, 2026 at 06:55:41PM +0100, Niklas Cassel wrote:
> When running e.g. fio with a larger queue depth against nvmet-pci-epf we
> get IOMMU errors on the host, e.g.:
>
> arm-smmu-v3 fc900000.iommu:      0x0000010000000010
> arm-smmu-v3 fc900000.iommu:      0x0000020000000000
> arm-smmu-v3 fc900000.iommu:      0x000000090000f040
> arm-smmu-v3 fc900000.iommu:      0x0000000000000000
> arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
> arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
>
> The reason for this is that the writel() is immediately followed by a call
> to unmap(), which will tear down the outbound address translation.
>
> PCI writes are posted, i.e. don't wait for a completion. Thus, when the
> writel() returns, might not have completed yet, and could even still be
> buffered in the PCI bridge, at the time unmap() is called.
>
> Flush the write by performing a read() of the same address, to ensure that
> the write has reached the destination before calling unmap().
>
> This will add some latency, but that is certainly preferred over corrupting
> the host memory.
>
> The same problem was solved for dw_pcie_ep_raise_msi_irq(), in commit
> 8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping"), however
> there it was solved by dedicating an outbound iATU only for MSI. For MSI-X,
> we can't do the same, as each vector can have a different msg_addr, and
> because the msg_addr is allowed to be changed while the vector is masked.
>
> Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler")

Cc stable?

> Signed-off-by: Niklas Cassel <cassel@kernel.org>
> ---

Reviewed-by: Frank Li <Frank.Li@nxp.com>

>  drivers/pci/controller/dwc/pcie-designware-ep.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 5d8024d5e5c6..aef41f0218a3 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -1005,6 +1005,9 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>
>  	writel(msg_data, ep->msi_mem + offset);
>
> +	/* flush posted write before unmap */
> +	readl(ep->msi_mem + offset);
> +
>  	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
>
>  	return 0;
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq()
  2026-02-11 19:26 ` Frank Li
@ 2026-02-12 12:47   ` Niklas Cassel
  0 siblings, 0 replies; 5+ messages in thread
From: Niklas Cassel @ 2026-02-12 12:47 UTC (permalink / raw)
  To: Frank Li
  Cc: Jingoo Han, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Bjorn Helgaas,
	Kishon Vijay Abraham I, Gustavo Pimentel, Shinichiro Kawasaki,
	Damien Le Moal, Koichiro Den, linux-pci

On Wed, Feb 11, 2026 at 02:26:39PM -0500, Frank Li wrote:
> On Wed, Feb 11, 2026 at 06:55:41PM +0100, Niklas Cassel wrote:
> > When running e.g. fio with a larger queue depth against nvmet-pci-epf we
> > get IOMMU errors on the host, e.g.:
> >
> > arm-smmu-v3 fc900000.iommu:      0x0000010000000010
> > arm-smmu-v3 fc900000.iommu:      0x0000020000000000
> > arm-smmu-v3 fc900000.iommu:      0x000000090000f040
> > arm-smmu-v3 fc900000.iommu:      0x0000000000000000
> > arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
> > arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
> >
> > The reason for this is that the writel() is immediately followed by a call
> > to unmap(), which will tear down the outbound address translation.
> >
> > PCI writes are posted, i.e. don't wait for a completion. Thus, when the
> > writel() returns, might not have completed yet, and could even still be
> > buffered in the PCI bridge, at the time unmap() is called.
> >
> > Flush the write by performing a read() of the same address, to ensure that
> > the write has reached the destination before calling unmap().
> >
> > This will add some latency, but that is certainly preferred over corrupting
> > the host memory.
> >
> > The same problem was solved for dw_pcie_ep_raise_msi_irq(), in commit
> > 8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping"), however
> > there it was solved by dedicating an outbound iATU only for MSI. For MSI-X,
> > we can't do the same, as each vector can have a different msg_addr, and
> > because the msg_addr is allowed to be changed while the vector is masked.
> >
> > Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler")
> 
> Cc stable?

Considering that the stable tooling backports everything with a Fixes: tag
nowadays, I don't really see the point in adding Cc: stable anymore.

Perhaps the maintainers can amend the commit message when applying if they
think that it will give it an even higher chance of being backported.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq()
  2026-02-11 17:55 [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq() Niklas Cassel
  2026-02-11 19:26 ` Frank Li
@ 2026-02-25 21:44 ` Bjorn Helgaas
  2026-02-25 22:34   ` Niklas Cassel
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2026-02-25 21:44 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jingoo Han, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Bjorn Helgaas,
	Kishon Vijay Abraham I, Gustavo Pimentel, Shinichiro Kawasaki,
	Damien Le Moal, Koichiro Den, linux-pci

On Wed, Feb 11, 2026 at 06:55:41PM +0100, Niklas Cassel wrote:
> When running e.g. fio with a larger queue depth against nvmet-pci-epf we
> get IOMMU errors on the host, e.g.:
> 
> arm-smmu-v3 fc900000.iommu:      0x0000010000000010
> arm-smmu-v3 fc900000.iommu:      0x0000020000000000
> arm-smmu-v3 fc900000.iommu:      0x000000090000f040
> arm-smmu-v3 fc900000.iommu:      0x0000000000000000
> arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
> arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
> 
> The reason for this is that the writel() is immediately followed by a call
> to unmap(), which will tear down the outbound address translation.
> 
> PCI writes are posted, i.e. don't wait for a completion. Thus, when the
> writel() returns, might not have completed yet, and could even still be
> buffered in the PCI bridge, at the time unmap() is called.
> 
> Flush the write by performing a read() of the same address, to ensure that
> the write has reached the destination before calling unmap().
> 
> This will add some latency, but that is certainly preferred over corrupting
> the host memory.
> 
> The same problem was solved for dw_pcie_ep_raise_msi_irq(), in commit
> 8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping"), however
> there it was solved by dedicating an outbound iATU only for MSI. For MSI-X,
> we can't do the same, as each vector can have a different msg_addr, and
> because the msg_addr is allowed to be changed while the vector is masked.
> 
> Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler")
> Signed-off-by: Niklas Cassel <cassel@kernel.org>

beb4641a787d appeared in v4.19 (2018!) so it doesn't strictly qualify
as a post-merge window fix, but I do understand that it fixes a
problem similar to the 8719c64e76bf bug that we added in v7.0.

I don't think 8719c64e76bf and its fix make it any more likely that
we'll hit *this* problem, but it's certainly a trivial low-risk
change.

I put this on pci/for-linus for v7.0, thanks!

> ---
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 5d8024d5e5c6..aef41f0218a3 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -1005,6 +1005,9 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  
>  	writel(msg_data, ep->msi_mem + offset);
>  
> +	/* flush posted write before unmap */
> +	readl(ep->msi_mem + offset);
> +
>  	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
>  
>  	return 0;
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq()
  2026-02-25 21:44 ` Bjorn Helgaas
@ 2026-02-25 22:34   ` Niklas Cassel
  0 siblings, 0 replies; 5+ messages in thread
From: Niklas Cassel @ 2026-02-25 22:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jingoo Han, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Bjorn Helgaas,
	Kishon Vijay Abraham I, Gustavo Pimentel, Shinichiro Kawasaki,
	Damien Le Moal, Koichiro Den, linux-pci

On Wed, Feb 25, 2026 at 03:44:40PM -0600, Bjorn Helgaas wrote:
> On Wed, Feb 11, 2026 at 06:55:41PM +0100, Niklas Cassel wrote:
> > When running e.g. fio with a larger queue depth against nvmet-pci-epf we
> > get IOMMU errors on the host, e.g.:
> > 
> > arm-smmu-v3 fc900000.iommu:      0x0000010000000010
> > arm-smmu-v3 fc900000.iommu:      0x0000020000000000
> > arm-smmu-v3 fc900000.iommu:      0x000000090000f040
> > arm-smmu-v3 fc900000.iommu:      0x0000000000000000
> > arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
> > arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
> > 
> > The reason for this is that the writel() is immediately followed by a call
> > to unmap(), which will tear down the outbound address translation.
> > 
> > PCI writes are posted, i.e. don't wait for a completion. Thus, when the
> > writel() returns, might not have completed yet, and could even still be
> > buffered in the PCI bridge, at the time unmap() is called.
> > 
> > Flush the write by performing a read() of the same address, to ensure that
> > the write has reached the destination before calling unmap().
> > 
> > This will add some latency, but that is certainly preferred over corrupting
> > the host memory.
> > 
> > The same problem was solved for dw_pcie_ep_raise_msi_irq(), in commit
> > 8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping"), however
> > there it was solved by dedicating an outbound iATU only for MSI. For MSI-X,
> > we can't do the same, as each vector can have a different msg_addr, and
> > because the msg_addr is allowed to be changed while the vector is masked.
> > 
> > Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler")
> > Signed-off-by: Niklas Cassel <cassel@kernel.org>
> 
> beb4641a787d appeared in v4.19 (2018!) so it doesn't strictly qualify
> as a post-merge window fix, but I do understand that it fixes a
> problem similar to the 8719c64e76bf bug that we added in v7.0.

Yes, the problem has been there a very long time.
(And I am basically the guilty one, as the commit that implemented
dw_pcie_ep_raise_msix_irq() basically copied dw_pcie_ep_raise_msi_irq()
which was originally written by me.)

However, the problem is extremely easy to reproduce with nvmet-pci-epf.

Just do a fio --rw=randread --bs=4k --iodepth=32
and you trigger it within a few seconds.

While pci-epf-test has a read and a write test case, these test cases
only raise a single IRQ at the end of the test.

nvmet-pci-epf raises an IRQ after each I/O is completed.

The problem is easier to reproduce the more IRQs you trigger.
E.g. when you run fio with --iodepth=1, you don't trigger the bug.


At least I am glad that we have finally discovered and fixed this bug
after all such a long time.

We have the pci-epf-mhi driver, the pci-epf-ntb, and the pci-epf-vntb
driver, but since this problem has not been discovered before, it is
obvious that they don't raise as many IRQs as nvmet-pci-epf.
And if you look at those EPF drivers, pci-epf-mhi and pci-epf-ntb only
raise an interrupt once after link up.

pci-epf-vntb appears to do it on each doorbell_set(), but that is
probably also not using interrupts nearly as much as nvmet-pci-epf.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-25 22:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-11 17:55 [PATCH] PCI: dwc: ep: Flush before unmap in dw_pcie_ep_raise_msix_irq() Niklas Cassel
2026-02-11 19:26 ` Frank Li
2026-02-12 12:47   ` Niklas Cassel
2026-02-25 21:44 ` Bjorn Helgaas
2026-02-25 22:34   ` Niklas Cassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox