* [PATCH 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device
@ 2025-12-10 17:14 Jinhui Guo
2025-12-10 17:14 ` [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
2025-12-10 17:14 ` [PATCH 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
0 siblings, 2 replies; 6+ messages in thread
From: Jinhui Guo @ 2025-12-10 17:14 UTC (permalink / raw)
To: dwmw2, baolu.lu, joro, will
Cc: haifeng.zhao, guojinhui.liam, iommu, linux-kernel
We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation
that cannot complete, especially under GDR high-load conditions.
1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down
event and link-down without any event.
a) NIC link-down with an explicit link-dow event.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
__context_flush_dev_iotlb.part.0
domain_context_clear_one_cb
pci_for_each_dma_alias
device_block_translation
blocking_domain_attach_dev
iommu_deinit_device
__iommu_group_remove_device
iommu_release_device
iommu_bus_notifier
blocking_notifier_call_chain
bus_notify
device_del
pci_remove_bus_device
pci_stop_and_remove_bus_device
pciehp_unconfigure_device
pciehp_disable_slot
pciehp_handle_presence_or_link_change
pciehp_ist
b) NIC link-down without an event - hard-lock on VM destroy.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
__context_flush_dev_iotlb.part.0
domain_context_clear_one_cb
pci_for_each_dma_alias
device_block_translation
blocking_domain_attach_dev
__iommu_attach_device
__iommu_device_set_domain
__iommu_group_set_domain_internal
iommu_detach_group
vfio_iommu_type1_detach_group
vfio_group_detach_container
vfio_group_fops_release
__fput
2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
scalable mode; NIC link-down without an event hard-locks on VM destroy.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
intel_pasid_tear_down_entry
device_block_translation
blocking_domain_attach_dev
__iommu_attach_device
__iommu_device_set_domain
__iommu_group_set_domain_internal
iommu_detach_group
vfio_iommu_type1_detach_group
vfio_group_detach_container
vfio_group_fops_release
__fput
Fix both issues with two patches:
1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using
pci_device_is_present().
2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to
skip ATS invalidation in devtlb_invalidation_with_pasid().
Jinhui Guo (2):
iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without
scalable mode
iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in
scalable mode
drivers/iommu/intel/pasid.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--
2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode 2025-12-10 17:14 [PATCH 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo @ 2025-12-10 17:14 ` Jinhui Guo 2025-12-11 2:10 ` Baolu Lu 2025-12-10 17:14 ` [PATCH 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo 1 sibling, 1 reply; 6+ messages in thread From: Jinhui Guo @ 2025-12-10 17:14 UTC (permalink / raw) To: dwmw2, baolu.lu, joro, will Cc: haifeng.zhao, guojinhui.liam, iommu, linux-kernel PCIe endpoints with ATS enabled and passed through to userspace (e.g., QEMU, DPDK) can hard-lock the host when their link drops, either by surprise removal or by a link fault. Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") adds pci_dev_is_disconnected() to devtlb_invalidation_with_pasid() so ATS invalidation is skipped only when the device is being safely removed, but it applies only when Intel IOMMU scalable mode is enabled. With scalable mode disabled or unsupported, a system hard-lock occurs when a PCIe endpoint's link drops because the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), which calls qi_flush_dev_iotlb() and can also hard-lock the system when a PCIe endpoint's link drops. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 intel_context_flush_no_pasid device_pasid_table_teardown pci_pasid_table_teardown pci_for_each_dma_alias intel_pasid_teardown_sm_context intel_iommu_release_device iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Sometimes the endpoint loses connection without a link-down event (e.g., due to a link fault); killing the process (virsh destroy) then hard-locks the host. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput pci_dev_is_disconnected() only covers safe-removal paths; pci_device_is_present() tests accessibility by reading vendor/device IDs and internally calls pci_dev_is_disconnected(). On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs. Since __context_flush_dev_iotlb() is only called on {attach,release}_dev paths (not hot), add pci_device_is_present() there to skip inaccessible devices and avoid the hard-lock. Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> --- drivers/iommu/intel/pasid.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 3e2255057079..b1e8eb6a6504 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -1099,9 +1099,20 @@ int intel_pasid_setup_sm_context(struct device *dev) */ static void __context_flush_dev_iotlb(struct device_domain_info *info) { + struct pci_dev *pdev; + if (!info->ats_enabled) return; + /* + * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the + * Intel IOMMU from waiting indefinitely for an ATS invalidation that + * cannot complete. + */ + pdev = dev_is_pci(info->dev) ? to_pci_dev(info->dev) : NULL; + if (pdev && !pci_device_is_present(pdev)) + return; + qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn), info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH); -- 2.20.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode 2025-12-10 17:14 ` [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo @ 2025-12-11 2:10 ` Baolu Lu 2025-12-11 4:17 ` Jinhui Guo 0 siblings, 1 reply; 6+ messages in thread From: Baolu Lu @ 2025-12-11 2:10 UTC (permalink / raw) To: Jinhui Guo, dwmw2, joro, will; +Cc: haifeng.zhao, iommu, linux-kernel On 12/11/25 01:14, Jinhui Guo wrote: > PCIe endpoints with ATS enabled and passed through to userspace > (e.g., QEMU, DPDK) can hard-lock the host when their link drops, > either by surprise removal or by a link fault. > > Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation > request when device is disconnected") adds pci_dev_is_disconnected() > to devtlb_invalidation_with_pasid() so ATS invalidation is skipped > only when the device is being safely removed, but it applies only > when Intel IOMMU scalable mode is enabled. > > With scalable mode disabled or unsupported, a system hard-lock > occurs when a PCIe endpoint's link drops because the Intel IOMMU > waits indefinitely for an ATS invalidation that cannot complete. > > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > domain_context_clear_one_cb > pci_for_each_dma_alias > device_block_translation > blocking_domain_attach_dev > iommu_deinit_device > __iommu_group_remove_device > iommu_release_device > iommu_bus_notifier > blocking_notifier_call_chain > bus_notify > device_del > pci_remove_bus_device > pci_stop_and_remove_bus_device > pciehp_unconfigure_device > pciehp_disable_slot > pciehp_handle_presence_or_link_change > pciehp_ist > > Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") > adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), > which calls qi_flush_dev_iotlb() and can also hard-lock the system > when a PCIe endpoint's link drops. > > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > intel_context_flush_no_pasid > device_pasid_table_teardown > pci_pasid_table_teardown > pci_for_each_dma_alias > intel_pasid_teardown_sm_context > intel_iommu_release_device > iommu_deinit_device > __iommu_group_remove_device > iommu_release_device > iommu_bus_notifier > blocking_notifier_call_chain > bus_notify > device_del > pci_remove_bus_device > pci_stop_and_remove_bus_device > pciehp_unconfigure_device > pciehp_disable_slot > pciehp_handle_presence_or_link_change > pciehp_ist > > Sometimes the endpoint loses connection without a link-down event > (e.g., due to a link fault); killing the process (virsh destroy) > then hard-locks the host. > > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > domain_context_clear_one_cb > pci_for_each_dma_alias > device_block_translation > blocking_domain_attach_dev > __iommu_attach_device > __iommu_device_set_domain > __iommu_group_set_domain_internal > iommu_detach_group > vfio_iommu_type1_detach_group > vfio_group_detach_container > vfio_group_fops_release > __fput > > pci_dev_is_disconnected() only covers safe-removal paths; > pci_device_is_present() tests accessibility by reading > vendor/device IDs and internally calls pci_dev_is_disconnected(). > On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs. > > Since __context_flush_dev_iotlb() is only called on > {attach,release}_dev paths (not hot), add pci_device_is_present() > there to skip inaccessible devices and avoid the hard-lock. > > Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") > Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Cc: stable@vger.kernel.org > Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> > --- > drivers/iommu/intel/pasid.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c > index 3e2255057079..b1e8eb6a6504 100644 > --- a/drivers/iommu/intel/pasid.c > +++ b/drivers/iommu/intel/pasid.c > @@ -1099,9 +1099,20 @@ int intel_pasid_setup_sm_context(struct device *dev) > */ > static void __context_flush_dev_iotlb(struct device_domain_info *info) > { > + struct pci_dev *pdev; > + > if (!info->ats_enabled) > return; > > + /* > + * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the > + * Intel IOMMU from waiting indefinitely for an ATS invalidation that > + * cannot complete. > + */ > + pdev = dev_is_pci(info->dev) ? to_pci_dev(info->dev) : NULL; > + if (pdev && !pci_device_is_present(pdev)) > + return; Could simply be if (dev_is_pci(info->dev) && !pci_device_is_present(to_pci_dev(info->dev))) return; ? > + > qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn), > info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH); > Thanks, baolu ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode 2025-12-11 2:10 ` Baolu Lu @ 2025-12-11 4:17 ` Jinhui Guo 2025-12-11 4:59 ` Baolu Lu 0 siblings, 1 reply; 6+ messages in thread From: Jinhui Guo @ 2025-12-11 4:17 UTC (permalink / raw) To: baolu.lu; +Cc: dwmw2, guojinhui.liam, iommu, joro, linux-kernel, will On Thu, Dec 11, 2025 10:10:52AM +0800, Baolu Lu wrote: > > Since __context_flush_dev_iotlb() is only called on > > {attach,release}_dev paths (not hot), add pci_device_is_present() > > there to skip inaccessible devices and avoid the hard-lock. > > > > Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") > > Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") > > Cc: stable@vger.kernel.org > > > Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> > > --- > > drivers/iommu/intel/pasid.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c > > index 3e2255057079..b1e8eb6a6504 100644 > > --- a/drivers/iommu/intel/pasid.c > > +++ b/drivers/iommu/intel/pasid.c > > @@ -1099,9 +1099,20 @@ int intel_pasid_setup_sm_context(struct device *dev) > > */ > > static void __context_flush_dev_iotlb(struct device_domain_info *info) > > { > > + struct pci_dev *pdev; > > + > > if (!info->ats_enabled) > > return; > > > > + /* > > + * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the > > + * Intel IOMMU from waiting indefinitely for an ATS invalidation that > > + * cannot complete. > > + */ > > + pdev = dev_is_pci(info->dev) ? to_pci_dev(info->dev) : NULL; > > + if (pdev && !pci_device_is_present(pdev)) > > + return; > > Could simply be > > if (dev_is_pci(info->dev) && > !pci_device_is_present(to_pci_dev(info->dev))) > return; > > ? > > > + > > qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn), > > info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH); > > > Thanks, > baolu Hi, baolu Thanks for your time and suggestions. I’ve sent v2 (https://lore.kernel.org/all/20251211035946.2071-1-guojinhui.liam@bytedance.com/) with the following changes: 1. Simplified the pci_device_is_present() check in __context_flush_dev_iotlb(). 2. Added Cc: stable@vger.kernel.org to both patches. Best Regards, Jinhui ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode 2025-12-11 4:17 ` Jinhui Guo @ 2025-12-11 4:59 ` Baolu Lu 0 siblings, 0 replies; 6+ messages in thread From: Baolu Lu @ 2025-12-11 4:59 UTC (permalink / raw) To: Jinhui Guo; +Cc: dwmw2, iommu, joro, linux-kernel, will On 12/11/25 12:17, Jinhui Guo wrote: > On Thu, Dec 11, 2025 10:10:52AM +0800, Baolu Lu wrote: >>> Since __context_flush_dev_iotlb() is only called on >>> {attach,release}_dev paths (not hot), add pci_device_is_present() >>> there to skip inaccessible devices and avoid the hard-lock. >>> >>> Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") >>> Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") >> Cc:stable@vger.kernel.org >> >>> Signed-off-by: Jinhui Guo<guojinhui.liam@bytedance.com> >>> --- >>> drivers/iommu/intel/pasid.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c >>> index 3e2255057079..b1e8eb6a6504 100644 >>> --- a/drivers/iommu/intel/pasid.c >>> +++ b/drivers/iommu/intel/pasid.c >>> @@ -1099,9 +1099,20 @@ int intel_pasid_setup_sm_context(struct device *dev) >>> */ >>> static void __context_flush_dev_iotlb(struct device_domain_info *info) >>> { >>> + struct pci_dev *pdev; >>> + >>> if (!info->ats_enabled) >>> return; >>> >>> + /* >>> + * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the >>> + * Intel IOMMU from waiting indefinitely for an ATS invalidation that >>> + * cannot complete. >>> + */ >>> + pdev = dev_is_pci(info->dev) ? to_pci_dev(info->dev) : NULL; >>> + if (pdev && !pci_device_is_present(pdev)) >>> + return; >> Could simply be >> >> if (dev_is_pci(info->dev) && >> !pci_device_is_present(to_pci_dev(info->dev))) >> return; >> >> ? >> >>> + >>> qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn), >>> info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH); >>> >> Thanks, >> baolu > Hi, baolu > > Thanks for your time and suggestions. > > I’ve sent v2 (https://lore.kernel.org/all/20251211035946.2071-1- > guojinhui.liam@bytedance.com/) > with the following changes: > > 1. Simplified the pci_device_is_present() check in __context_flush_dev_iotlb(). > 2. Added Cc:stable@vger.kernel.org to both patches. Thanks! I would suggest not updating the patch versions so frequently next time. It is important to provide the reviewers with more time to comment on your submissions. :-) Thanks, baolu ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-10 17:14 [PATCH 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo 2025-12-10 17:14 ` [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo @ 2025-12-10 17:14 ` Jinhui Guo 1 sibling, 0 replies; 6+ messages in thread From: Jinhui Guo @ 2025-12-10 17:14 UTC (permalink / raw) To: dwmw2, baolu.lu, joro, will Cc: haifeng.zhao, guojinhui.liam, iommu, linux-kernel Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> --- drivers/iommu/intel/pasid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index b1e8eb6a6504..e9dea47decdb 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -218,7 +218,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, if (!info || !info->ats_enabled) return; - if (pci_dev_is_disconnected(to_pci_dev(dev))) + if (!pci_device_is_present(to_pci_dev(dev))) return; sid = PCI_DEVID(info->bus, info->devfn); -- 2.20.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-12-11 5:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-10 17:14 [PATCH 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo 2025-12-10 17:14 ` [PATCH 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo 2025-12-11 2:10 ` Baolu Lu 2025-12-11 4:17 ` Jinhui Guo 2025-12-11 4:59 ` Baolu Lu 2025-12-10 17:14 ` [PATCH 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox