* [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device
@ 2025-12-11 3:59 Jinhui Guo
2025-12-11 3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Jinhui Guo @ 2025-12-11 3:59 UTC (permalink / raw)
To: dwmw2, baolu.lu, joro, will; +Cc: guojinhui.liam, iommu, linux-kernel, stable
Hi, all
We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation
that cannot complete, especially under GDR high-load conditions.
1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down
event and link-down without any event.
a) NIC link-down with an explicit link-dow event.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
__context_flush_dev_iotlb.part.0
domain_context_clear_one_cb
pci_for_each_dma_alias
device_block_translation
blocking_domain_attach_dev
iommu_deinit_device
__iommu_group_remove_device
iommu_release_device
iommu_bus_notifier
blocking_notifier_call_chain
bus_notify
device_del
pci_remove_bus_device
pci_stop_and_remove_bus_device
pciehp_unconfigure_device
pciehp_disable_slot
pciehp_handle_presence_or_link_change
pciehp_ist
b) NIC link-down without an event - hard-lock on VM destroy.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
__context_flush_dev_iotlb.part.0
domain_context_clear_one_cb
pci_for_each_dma_alias
device_block_translation
blocking_domain_attach_dev
__iommu_attach_device
__iommu_device_set_domain
__iommu_group_set_domain_internal
iommu_detach_group
vfio_iommu_type1_detach_group
vfio_group_detach_container
vfio_group_fops_release
__fput
2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
scalable mode; NIC link-down without an event hard-locks on VM destroy.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
intel_pasid_tear_down_entry
device_block_translation
blocking_domain_attach_dev
__iommu_attach_device
__iommu_device_set_domain
__iommu_group_set_domain_internal
iommu_detach_group
vfio_iommu_type1_detach_group
vfio_group_detach_container
vfio_group_fops_release
__fput
Fix both issues with two patches:
1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using
pci_device_is_present().
2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to
skip ATS invalidation in devtlb_invalidation_with_pasid().
Best Regards,
Jinhui
---
v1: https://lore.kernel.org/all/20251210171431.1589-1-guojinhui.liam@bytedance.com/
Changelog in v1 -> v2 (suggested by Baolu Lu)
- Simplify the pci_device_is_present() check in __context_flush_dev_iotlb().
- Add Cc: stable@vger.kernel.org to both patches.
Jinhui Guo (2):
iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without
scalable mode
iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in
scalable mode
drivers/iommu/intel/pasid.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
--
2.20.1
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode 2025-12-11 3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo @ 2025-12-11 3:59 ` Jinhui Guo 2025-12-11 3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo ` (2 subsequent siblings) 3 siblings, 0 replies; 12+ messages in thread From: Jinhui Guo @ 2025-12-11 3:59 UTC (permalink / raw) To: dwmw2, baolu.lu, joro, will; +Cc: guojinhui.liam, iommu, linux-kernel, stable PCIe endpoints with ATS enabled and passed through to userspace (e.g., QEMU, DPDK) can hard-lock the host when their link drops, either by surprise removal or by a link fault. Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") adds pci_dev_is_disconnected() to devtlb_invalidation_with_pasid() so ATS invalidation is skipped only when the device is being safely removed, but it applies only when Intel IOMMU scalable mode is enabled. With scalable mode disabled or unsupported, a system hard-lock occurs when a PCIe endpoint's link drops because the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), which calls qi_flush_dev_iotlb() and can also hard-lock the system when a PCIe endpoint's link drops. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 intel_context_flush_no_pasid device_pasid_table_teardown pci_pasid_table_teardown pci_for_each_dma_alias intel_pasid_teardown_sm_context intel_iommu_release_device iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Sometimes the endpoint loses connection without a link-down event (e.g., due to a link fault); killing the process (virsh destroy) then hard-locks the host. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput pci_dev_is_disconnected() only covers safe-removal paths; pci_device_is_present() tests accessibility by reading vendor/device IDs and internally calls pci_dev_is_disconnected(). On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs. Since __context_flush_dev_iotlb() is only called on {attach,release}_dev paths (not hot), add pci_device_is_present() there to skip inaccessible devices and avoid the hard-lock. Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> --- drivers/iommu/intel/pasid.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 3e2255057079..a369690f5926 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -1102,6 +1102,15 @@ static void __context_flush_dev_iotlb(struct device_domain_info *info) if (!info->ats_enabled) return; + /* + * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the + * Intel IOMMU from waiting indefinitely for an ATS invalidation that + * cannot complete. + */ + if (dev_is_pci(info->dev) && + !pci_device_is_present(to_pci_dev(info->dev))) + return; + qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn), info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH); -- 2.20.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-11 3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo 2025-12-11 3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo @ 2025-12-11 3:59 ` Jinhui Guo 2025-12-18 8:04 ` Tian, Kevin 2026-01-20 6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu 2026-03-01 3:50 ` Ethan Zhao 3 siblings, 1 reply; 12+ messages in thread From: Jinhui Guo @ 2025-12-11 3:59 UTC (permalink / raw) To: dwmw2, baolu.lu, joro, will; +Cc: guojinhui.liam, iommu, linux-kernel, stable Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> --- drivers/iommu/intel/pasid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index a369690f5926..e64d445de964 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -218,7 +218,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, if (!info || !info->ats_enabled) return; - if (pci_dev_is_disconnected(to_pci_dev(dev))) + if (!pci_device_is_present(to_pci_dev(dev))) return; sid = PCI_DEVID(info->bus, info->devfn); -- 2.20.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* RE: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-11 3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo @ 2025-12-18 8:04 ` Tian, Kevin 2025-12-22 11:19 ` Jinhui Guo 0 siblings, 1 reply; 12+ messages in thread From: Tian, Kevin @ 2025-12-18 8:04 UTC (permalink / raw) To: Guo, Jinhui, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org Cc: Guo, Jinhui, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, stable@vger.kernel.org > From: Jinhui Guo <guojinhui.liam@bytedance.com> > Sent: Thursday, December 11, 2025 12:00 PM > > Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation > request when device is disconnected") relies on > pci_dev_is_disconnected() to skip ATS invalidation for > safely-removed devices, but it does not cover link-down caused > by faults, which can still hard-lock the system. According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done: " For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. " Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()? > > For example, if a VM fails to connect to the PCIe device, 'failed' for what reason? > "virsh destroy" is executed to release resources and isolate > the fault, but a hard-lockup occurs while releasing the group fd. > > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > intel_pasid_tear_down_entry > device_block_translation > blocking_domain_attach_dev > __iommu_attach_device > __iommu_device_set_domain > __iommu_group_set_domain_internal > iommu_detach_group > vfio_iommu_type1_detach_group > vfio_group_detach_container > vfio_group_fops_release > __fput > > Although pci_device_is_present() is slower than > pci_dev_is_disconnected(), it still takes only ~70 µs on a > ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed > and width increase. > > Besides, devtlb_invalidation_with_pasid() is called only in the > paths below, which are far less frequent than memory map/unmap. > > 1. mm-struct release > 2. {attach,release}_dev > 3. set/remove PASID > 4. dirty-tracking setup > surprise removal can happen at any time, e.g. after the check of pci_device_is_present(). In the end we need the logic in qi_check_fault() to check the presence upon ITE timeout error received to break the infinite loop. So in your case even with that logici in place you still observe lockup (probably due to hardware ITE timeout is longer than the lockup detection on the CPU? In any case this change cannot 100% fix the lockup. It just reduces the possibility which should be made clear. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-18 8:04 ` Tian, Kevin @ 2025-12-22 11:19 ` Jinhui Guo 2025-12-23 4:06 ` Baolu Lu 0 siblings, 1 reply; 12+ messages in thread From: Jinhui Guo @ 2025-12-22 11:19 UTC (permalink / raw) To: kevin.tian Cc: baolu.lu, dwmw2, guojinhui.liam, iommu, joro, linux-kernel, stable, will On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote: > > From: Jinhui Guo <guojinhui.liam@bytedance.com> > > Sent: Thursday, December 11, 2025 12:00 PM > > > > Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation > > request when device is disconnected") relies on > > pci_dev_is_disconnected() to skip ATS invalidation for > > safely-removed devices, but it does not cover link-down caused > > by faults, which can still hard-lock the system. > > According to the commit msg it actually tries to fix the hard lockup > with surprise removal. For safe removal the device is not removed > before invalidation is done: > > " > For safe removal, device wouldn't be removed until the whole software > handling process is done, it wouldn't trigger the hard lock up issue > caused by too long ATS Invalidation timeout wait. > " > > Can you help articulate the problem especially about the part > 'link-down caused by faults"? What are those faults? How are > they different from the said surprise removal in the commit > msg to not set pci_dev_is_disconnected()? > Hi, kevin, sorry for the delayed reply. A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch. We have, however, observed cases where no interrupt is generated when the device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint remains unknown. > > > > For example, if a VM fails to connect to the PCIe device, > > 'failed' for what reason? > > > "virsh destroy" is executed to release resources and isolate > > the fault, but a hard-lockup occurs while releasing the group fd. > > > > Call Trace: > > qi_submit_sync > > qi_flush_dev_iotlb > > intel_pasid_tear_down_entry > > device_block_translation > > blocking_domain_attach_dev > > __iommu_attach_device > > __iommu_device_set_domain > > __iommu_group_set_domain_internal > > iommu_detach_group > > vfio_iommu_type1_detach_group > > vfio_group_detach_container > > vfio_group_fops_release > > __fput > > > > Although pci_device_is_present() is slower than > > pci_dev_is_disconnected(), it still takes only ~70 µs on a > > ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed > > and width increase. > > > > Besides, devtlb_invalidation_with_pasid() is called only in the > > paths below, which are far less frequent than memory map/unmap. > > > > 1. mm-struct release > > 2. {attach,release}_dev > > 3. set/remove PASID > > 4. dirty-tracking setup > > > > surprise removal can happen at any time, e.g. after the check of > pci_device_is_present(). In the end we need the logic in > qi_check_fault() to check the presence upon ITE timeout error > received to break the infinite loop. So in your case even with > that logici in place you still observe lockup (probably due to > hardware ITE timeout is longer than the lockup detection on > the CPU? Are you referring to the timeout added in patch https://lore.kernel.org/all/20240222090251.2849702-4-haifeng.zhao@linux.intel.com/ ? Our lockup-detection timeout is the default 10 s. We see ITE-timeout messages in the kernel log. Yet the system still hard-locks—probably because, as you mentioned, the hardware ITE timeout is longer than the CPU’s lockup-detection window. I’ll reproduce the case and follow up with a deeper analysis. kernel: [ 2402.642685][ T607] vfio-pci 0000:3f:00.0: Unable to change power state from D0 to D3hot, device inaccessible kernel: [ 2403.441828][T49880] DMAR: VT-d detected Invalidation Time-out Error: SID 0 kernel: [ 2403.441830][ C0] DMAR: DRHD: handling fault status reg 40 kernel: [ 2403.441831][T49880] DMAR: QI HEAD: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07fc kernel: [ 2403.441833][T49880] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07f8 kernel: [ 2403.441879][T49880] DMAR: Invalidation Time-out Error (ITE) cleared kernel: [ 2423.643527][ C7] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: kernel: [ 2423.643551][ C7] rcu: 8-...0: (0 ticks this GP) idle=198c/1/0x4000000000000000 softirq=19450/19450 fqs=4403 kernel: [ 2423.643567][ C7] rcu: (detected by 7, t=21002 jiffies, g=238909, q=4932 ncpus=96) kernel: [ 2423.643578][ C7] Sending NMI from CPU 7 to CPUs 8: kernel: [ 2423.643581][ C8] NMI backtrace for cpu 8 kernel: [ 2423.643585][ C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S E 6.18.0 #5 PREEMPT(voluntary) kernel: [ 2423.643588][ C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE kernel: [ 2423.643589][ C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021 kernel: [ 2423.643590][ C8] RIP: 0010:qi_submit_sync+0x6cf/0x8d0 kernel: [ 2423.643597][ C8] Code: 89 4c 24 50 89 70 34 48 c7 c7 f0 f5 4a a5 e8 48 15 89 ff 48 8b 4c 24 50 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 <75> 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 kernel: [ 2423.643598][ C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000097 kernel: [ 2423.643600][ C8] RAX: ffff9dac803a06bc RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 2423.643601][ C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480 kernel: [ 2423.643602][ C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003 kernel: [ 2423.643603][ C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040 kernel: [ 2423.643605][ C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000000af kernel: [ 2423.643606][ C8] FS: 0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000 kernel: [ 2423.643607][ C8] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 2423.643608][ C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0 kernel: [ 2423.643610][ C8] PKRU: 55555554 kernel: [ 2423.643611][ C8] Call Trace: kernel: [ 2423.643613][ C8] <TASK> kernel: [ 2423.643616][ C8] ? __pfx_domain_context_clear_one_cb+0x10/0x10 kernel: [ 2423.643620][ C8] qi_flush_dev_iotlb+0xd5/0xe0 kernel: [ 2423.643622][ C8] __context_flush_dev_iotlb.part.0+0x3c/0x80 kernel: [ 2423.643625][ C8] domain_context_clear_one_cb+0x16/0x20 kernel: [ 2423.643626][ C8] pci_for_each_dma_alias+0x3b/0x140 kernel: [ 2423.643631][ C8] device_block_translation+0x122/0x180 kernel: [ 2423.643634][ C8] blocking_domain_attach_dev+0x39/0x50 kernel: [ 2423.643636][ C8] __iommu_attach_device+0x1b/0x90 kernel: [ 2423.643639][ C8] __iommu_device_set_domain+0x5d/0xb0 kernel: [ 2423.643642][ C8] __iommu_group_set_domain_internal+0x60/0x110 kernel: [ 2423.643644][ C8] iommu_detach_group+0x3a/0x60 kernel: [ 2423.643650][ C8] vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1] kernel: [ 2423.643654][ C8] ? __dentry_kill+0x12a/0x180 kernel: [ 2423.643660][ C8] ? __pm_runtime_idle+0x44/0xe0 kernel: [ 2423.643666][ C8] vfio_group_detach_container+0x4f/0x160 [vfio] kernel: [ 2423.643672][ C8] vfio_group_fops_release+0x3e/0x80 [vfio] kernel: [ 2423.643677][ C8] __fput+0xe6/0x2b0 kernel: [ 2423.643682][ C8] task_work_run+0x58/0x90 kernel: [ 2423.643688][ C8] do_exit+0x29b/0xa80 kernel: [ 2423.643694][ C8] do_group_exit+0x2c/0x80 kernel: [ 2423.643696][ C8] get_signal+0x8f9/0x900 kernel: [ 2423.643700][ C8] arch_do_signal_or_restart+0x29/0x210 kernel: [ 2423.643704][ C8] ? __schedule+0x582/0xe80 kernel: [ 2423.643708][ C8] exit_to_user_mode_loop+0x8e/0x4f0 kernel: [ 2423.643712][ C8] do_syscall_64+0x262/0x630 kernel: [ 2423.643717][ C8] entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: [ 2423.643720][ C8] RIP: 0033:0x7fde19078514 kernel: [ 2423.643722][ C8] Code: Unable to access opcode bytes at 0x7fde190784ea. kernel: [ 2423.643723][ C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 kernel: [ 2423.643724][ C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514 kernel: [ 2423.643726][ C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000 kernel: [ 2423.643727][ C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000 kernel: [ 2423.643728][ C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0 kernel: [ 2423.643729][ C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 2423.643731][ C8] </TASK> kernel: [ 2424.375254][T81463] vfio-pci 0000:3f:00.0: Unable to change power state from D3cold to D0, device inaccessible ... kernel: [ 2448.327929][ C8] watchdog: CPU8: Watchdog detected hard LOCKUP on cpu 8 kernel: [ 2448.327932][ C8] Modules linked in: vfio_pci(E) vfio_pci_core(E) vfio_iommu_type1(E) vfio(E) udp_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) ip_set_hash_net(E) nft_compat(E) x_tables(E) ip_set(E) msr(E) nf_tables(E) ... kernel: [ 2448.327963][ C8] ib_core(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) nvme(E) xhci_hcd(E) i2c_i801(E) nvme_core(E) usbcore(E) scsi_mod(E) mlx5_core(E) i2c_smbus(E) lpc_ich(E) usb_common(E) scsi_common(E) wmi(E) kernel: [ 2448.327972][ C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S EL 6.18.0 #5 PREEMPT(voluntary) kernel: [ 2448.327975][ C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP kernel: [ 2448.327976][ C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021 kernel: [ 2448.327977][ C8] RIP: 0010:qi_submit_sync+0x6e7/0x8d0 kernel: [ 2448.327981][ C8] Code: 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 75 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 e8 18 <41> 01 c7 45 0f b6 ff 41 29 c7 44 39 fa 75 cb 48 85 c9 0f 85 05 01 kernel: [ 2448.327983][ C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000046 kernel: [ 2448.327984][ C8] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 2448.327985][ C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480 kernel: [ 2448.327986][ C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003 kernel: [ 2448.327987][ C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040 kernel: [ 2448.327988][ C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000001b3 kernel: [ 2448.327989][ C8] FS: 0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000 kernel: [ 2448.327990][ C8] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 2448.327991][ C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0 kernel: [ 2448.327992][ C8] PKRU: 55555554 kernel: [ 2448.327993][ C8] Call Trace: kernel: [ 2448.327995][ C8] <TASK> kernel: [ 2448.327997][ C8] ? __pfx_domain_context_clear_one_cb+0x10/0x10 kernel: [ 2448.328000][ C8] qi_flush_dev_iotlb+0xd5/0xe0 kernel: [ 2448.328002][ C8] __context_flush_dev_iotlb.part.0+0x3c/0x80 kernel: [ 2448.328004][ C8] domain_context_clear_one_cb+0x16/0x20 kernel: [ 2448.328006][ C8] pci_for_each_dma_alias+0x3b/0x140 kernel: [ 2448.328010][ C8] device_block_translation+0x122/0x180 kernel: [ 2448.328012][ C8] blocking_domain_attach_dev+0x39/0x50 kernel: [ 2448.328014][ C8] __iommu_attach_device+0x1b/0x90 kernel: [ 2448.328017][ C8] __iommu_device_set_domain+0x5d/0xb0 kernel: [ 2448.328019][ C8] __iommu_group_set_domain_internal+0x60/0x110 kernel: [ 2448.328021][ C8] iommu_detach_group+0x3a/0x60 kernel: [ 2448.328023][ C8] vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1] kernel: [ 2448.328026][ C8] ? __dentry_kill+0x12a/0x180 kernel: [ 2448.328030][ C8] ? __pm_runtime_idle+0x44/0xe0 kernel: [ 2448.328035][ C8] vfio_group_detach_container+0x4f/0x160 [vfio] kernel: [ 2448.328041][ C8] vfio_group_fops_release+0x3e/0x80 [vfio] kernel: [ 2448.328046][ C8] __fput+0xe6/0x2b0 kernel: [ 2448.328049][ C8] task_work_run+0x58/0x90 kernel: [ 2448.328053][ C8] do_exit+0x29b/0xa80 kernel: [ 2448.328057][ C8] do_group_exit+0x2c/0x80 kernel: [ 2448.328060][ C8] get_signal+0x8f9/0x900 kernel: [ 2448.328064][ C8] arch_do_signal_or_restart+0x29/0x210 kernel: [ 2448.328068][ C8] ? __schedule+0x582/0xe80 kernel: [ 2448.328070][ C8] exit_to_user_mode_loop+0x8e/0x4f0 kernel: [ 2448.328074][ C8] do_syscall_64+0x262/0x630 kernel: [ 2448.328076][ C8] entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: [ 2448.328078][ C8] RIP: 0033:0x7fde19078514 kernel: [ 2448.328080][ C8] Code: Unable to access opcode bytes at 0x7fde190784ea. kernel: [ 2448.328081][ C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 kernel: [ 2448.328082][ C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514 kernel: [ 2448.328083][ C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000 kernel: [ 2448.328085][ C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000 kernel: [ 2448.328085][ C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0 kernel: [ 2448.328086][ C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 2448.328088][ C8] </TASK> kernel: [ 2450.245901][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 41s! [mongoosev3-agen:4727] > > In any case this change cannot 100% fix the lockup. It just > reduces the possibility which should be made clear. I agree with the above, but it's better to cover more corner cases. Best Regards, Jinhui ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-22 11:19 ` Jinhui Guo @ 2025-12-23 4:06 ` Baolu Lu 2025-12-23 14:58 ` Jinhui Guo 2025-12-24 3:08 ` Tian, Kevin 0 siblings, 2 replies; 12+ messages in thread From: Baolu Lu @ 2025-12-23 4:06 UTC (permalink / raw) To: Jinhui Guo, kevin.tian Cc: dwmw2, iommu, joro, linux-kernel, stable, will, Bjorn Helgaas On 12/22/25 19:19, Jinhui Guo wrote: > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote: >>> From: Jinhui Guo<guojinhui.liam@bytedance.com> >>> Sent: Thursday, December 11, 2025 12:00 PM >>> >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation >>> request when device is disconnected") relies on >>> pci_dev_is_disconnected() to skip ATS invalidation for >>> safely-removed devices, but it does not cover link-down caused >>> by faults, which can still hard-lock the system. >> According to the commit msg it actually tries to fix the hard lockup >> with surprise removal. For safe removal the device is not removed >> before invalidation is done: >> >> " >> For safe removal, device wouldn't be removed until the whole software >> handling process is done, it wouldn't trigger the hard lock up issue >> caused by too long ATS Invalidation timeout wait. >> " >> >> Can you help articulate the problem especially about the part >> 'link-down caused by faults"? What are those faults? How are >> they different from the said surprise removal in the commit >> msg to not set pci_dev_is_disconnected()? >> > Hi, kevin, sorry for the delayed reply. > > A normal or surprise removal of a PCIe device on a hot-plug port normally > triggers an interrupt from the PCIe switch. > > We have, however, observed cases where no interrupt is generated when the > device suddenly loses its link; the behaviour is identical to setting the > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly > what goes wrong in the LTSSM between the PCIe switch and the endpoint remains > unknown. In this scenario, the hardware has effectively vanished, yet the device driver remains bound and the IOMMU resources haven't been released. I’m just curious if this stale state could trigger issues in other places before the kernel fully realizes the device is gone? I’m not objecting to the fix. I'm just interested in whether this 'zombie' state creates risks elsewhere. > >>> For example, if a VM fails to connect to the PCIe device, >> 'failed' for what reason? >> >>> "virsh destroy" is executed to release resources and isolate >>> the fault, but a hard-lockup occurs while releasing the group fd. >>> >>> Call Trace: >>> qi_submit_sync >>> qi_flush_dev_iotlb >>> intel_pasid_tear_down_entry >>> device_block_translation >>> blocking_domain_attach_dev >>> __iommu_attach_device >>> __iommu_device_set_domain >>> __iommu_group_set_domain_internal >>> iommu_detach_group >>> vfio_iommu_type1_detach_group >>> vfio_group_detach_container >>> vfio_group_fops_release >>> __fput >>> >>> Although pci_device_is_present() is slower than >>> pci_dev_is_disconnected(), it still takes only ~70 µs on a >>> ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed >>> and width increase. >>> >>> Besides, devtlb_invalidation_with_pasid() is called only in the >>> paths below, which are far less frequent than memory map/unmap. >>> >>> 1. mm-struct release >>> 2. {attach,release}_dev >>> 3. set/remove PASID >>> 4. dirty-tracking setup >>> >> surprise removal can happen at any time, e.g. after the check of >> pci_device_is_present(). In the end we need the logic in >> qi_check_fault() to check the presence upon ITE timeout error >> received to break the infinite loop. So in your case even with >> that logici in place you still observe lockup (probably due to >> hardware ITE timeout is longer than the lockup detection on >> the CPU? > Are you referring to the timeout added in patch > https://lore.kernel.org/all/20240222090251.2849702-4- > haifeng.zhao@linux.intel.com/ ? This doesn't appear to be a deterministic solution, because ... > Our lockup-detection timeout is the default 10 s. > > We see ITE-timeout messages in the kernel log. Yet the system still > hard-locks—probably because, as you mentioned, the hardware ITE timeout > is longer than the CPU’s lockup-detection window. I’ll reproduce the > case and follow up with a deeper analysis. ... as you see, neither the PCI nor the VT-d specifications mandate a specific device-TLB invalidation timeout value for hardware implementations. Consequently, the ITE timeout value may exceed the CPU watchdog threshold, meaning a hard lockup will be detected before the ITE even occurs. Thanks, baolu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-23 4:06 ` Baolu Lu @ 2025-12-23 14:58 ` Jinhui Guo 2025-12-24 3:08 ` Tian, Kevin 1 sibling, 0 replies; 12+ messages in thread From: Jinhui Guo @ 2025-12-23 14:58 UTC (permalink / raw) To: baolu.lu Cc: bhelgaas, dwmw2, guojinhui.liam, iommu, joro, kevin.tian, linux-kernel, stable, will On Tue, Dec 23, 2025 12:06:24 +0800, Baolu Lu wrote: > > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote: > >>> From: Jinhui Guo<guojinhui.liam@bytedance.com> > >>> Sent: Thursday, December 11, 2025 12:00 PM > >>> > >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation > >>> request when device is disconnected") relies on > >>> pci_dev_is_disconnected() to skip ATS invalidation for > >>> safely-removed devices, but it does not cover link-down caused > >>> by faults, which can still hard-lock the system. > >> According to the commit msg it actually tries to fix the hard lockup > >> with surprise removal. For safe removal the device is not removed > >> before invalidation is done: > >> > >> " > >> For safe removal, device wouldn't be removed until the whole software > >> handling process is done, it wouldn't trigger the hard lock up issue > >> caused by too long ATS Invalidation timeout wait. > >> " > >> > >> Can you help articulate the problem especially about the part > >> 'link-down caused by faults"? What are those faults? How are > >> they different from the said surprise removal in the commit > >> msg to not set pci_dev_is_disconnected()? > >> > > Hi, kevin, sorry for the delayed reply. > > > > A normal or surprise removal of a PCIe device on a hot-plug port normally > > triggers an interrupt from the PCIe switch. > > > > We have, however, observed cases where no interrupt is generated when the > > device suddenly loses its link; the behaviour is identical to setting the > > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly > > what goes wrong in the LTSSM between the PCIe switch and the endpoint remains > > unknown. > > In this scenario, the hardware has effectively vanished, yet the device > driver remains bound and the IOMMU resources haven't been released. I’m > just curious if this stale state could trigger issues in other places > before the kernel fully realizes the device is gone? I’m not objecting > to the fix. I'm just interested in whether this 'zombie' state creates > risks elsewhere. Hi, Baolu In our scenario we see no other issues; a hard-LOCKUP panic is triggered the moment the Mellanox Ethernet device vanishes. But we can analyze what happens when we access the Mellanox Ethernet device whose link is disabled. (If we check whether the PCIe endpoint device (Mellanox Ethernet) is present before issuing device-IOTLB invalidation to the Intel IOMMU, no other issues appear.) According to the PCIe spec, Rev. 5.0 v1.0, Sec. 2.4.1, there are two kinds of TLPs: posted and non-posted. Non-posted TLPs require a completion TLP; posted TLPs do not. - A Posted Request is a Memory Write Request or a Message Request. - A Read Request is a Configuration Read Request, an I/O Read Request, or a Memory Read Request. - An NPR (Non-Posted Request) with Data is a Configuration Write Request, an I/O Write Request, or an AtomicOp Request. - A Non-Posted Request is a Read Request or an NPR with Data. When the CPU issues a PCIe memory-write TLP (posted) via a MOV instruction, the instruction retires immediately after the packet reaches the Root Complex; no Data-Link ACK/NAK is required. A memory-read TLP (non-posted), however, stalls the core until the corresponding Completion TLP is received - if that Completion never arrives, the CPU hangs. (The CPU hangs if the LTSSM does not enter the Disabled state.) However, if the LTSSM enters the Disabled state, the Root Port returns Completer-Abort (CA) for any non-posted TLP, so the request completes with status 0xFFFFFFFF without stalling. I ran some tests on the machine after setting the Link Disable bit in the switch’s Link Control register (offset 10h). - setpci -s 0000:3c:08.0 CAP_EXP+10.w=0x0010 +-[0000:3a]-+-00.0-[3b-3f]----00.0-[3c-3f]--+-00.0-[3d]---- | | +-04.0-[3e]---- | | \-08.0-[3f]----00.0 Mellanox Technologies MT27800 Family [ConnectX-5] # lspci -vvv -s 0000:3f:00.0 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] ... Region 0: Memory at 3af804000000 (64-bit, prefetchable) [size=32M] ... 1) Issue a PCI config-space read request and it returns 0xFFFFFFFF. # lspci -vvv -s 0000:3f:00.0 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: mlx5_core Kernel modules: mlx5_core 2) Issuing a PCI memory read request through /dev/mem also returns 0xFFFFFFFF. # ./devmem Usage: ./devmem <phys_addr> <size> <offset> [value] phys_addr : physical base address of the BAR (hex or decimal) size : mapping length in bytes (hex or decimal) offset : register offset from BAR base (hex or decimal) value : optional 32-bit value to write (hex or decimal) Example: ./devmem 0x600000000 0x1000 0x0 0xDEADBEEF # ./devmem 0x3af804000000 0x2000000 0x0 0x3af804000000 = 0xffffffff Before the link was disabled, we could read 0x3af804000000 with devmem and obtain a valid result. # ./devmem 0x3af804000000 0x2000000 0x0 0x3af804000000 = 0x10002300 Besides, after searching the kernel code, I found many EP drivers already check whether their endpoint is still present. There may be exception cases in some PCIe endpoint drivers, such as commit 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device"). Best Regards, Jinhui ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-23 4:06 ` Baolu Lu 2025-12-23 14:58 ` Jinhui Guo @ 2025-12-24 3:08 ` Tian, Kevin 2026-02-10 23:39 ` Bjorn Helgaas 1 sibling, 1 reply; 12+ messages in thread From: Tian, Kevin @ 2025-12-24 3:08 UTC (permalink / raw) To: Baolu Lu, Guo, Jinhui, Bjorn Helgaas Cc: dwmw2@infradead.org, iommu@lists.linux.dev, joro@8bytes.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, will@kernel.org, Bjorn Helgaas +Bjorn for guidance. quick context - previously intel-iommu driver fixed a lockup issue in surprise removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the lockup issue in a setup where no interrupt is raised to pci core upon surprise removal (so pci_dev_is_disconnected() is false), hence suggesting to replace the check with pci_device_is_present() instead. Bjorn, is it a common practice to fix it directly/only in drivers or should the pci core be notified e.g. simulating a late removal event? By searching the code looks it's the former, but better confirm with you before picking this fix... > From: Baolu Lu <baolu.lu@linux.intel.com> > Sent: Tuesday, December 23, 2025 12:06 PM > > On 12/22/25 19:19, Jinhui Guo wrote: > > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote: > >>> From: Jinhui Guo<guojinhui.liam@bytedance.com> > >>> Sent: Thursday, December 11, 2025 12:00 PM > >>> > >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation > >>> request when device is disconnected") relies on > >>> pci_dev_is_disconnected() to skip ATS invalidation for > >>> safely-removed devices, but it does not cover link-down caused > >>> by faults, which can still hard-lock the system. > >> According to the commit msg it actually tries to fix the hard lockup > >> with surprise removal. For safe removal the device is not removed > >> before invalidation is done: > >> > >> " > >> For safe removal, device wouldn't be removed until the whole software > >> handling process is done, it wouldn't trigger the hard lock up issue > >> caused by too long ATS Invalidation timeout wait. > >> " > >> > >> Can you help articulate the problem especially about the part > >> 'link-down caused by faults"? What are those faults? How are > >> they different from the said surprise removal in the commit > >> msg to not set pci_dev_is_disconnected()? > >> > > Hi, kevin, sorry for the delayed reply. > > > > A normal or surprise removal of a PCIe device on a hot-plug port normally > > triggers an interrupt from the PCIe switch. > > > > We have, however, observed cases where no interrupt is generated when > the > > device suddenly loses its link; the behaviour is identical to setting the > > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly > > what goes wrong in the LTSSM between the PCIe switch and the endpoint > remains > > unknown. > > In this scenario, the hardware has effectively vanished, yet the device > driver remains bound and the IOMMU resources haven't been released. I’m > just curious if this stale state could trigger issues in other places > before the kernel fully realizes the device is gone? I’m not objecting > to the fix. I'm just interested in whether this 'zombie' state creates > risks elsewhere. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2025-12-24 3:08 ` Tian, Kevin @ 2026-02-10 23:39 ` Bjorn Helgaas 2026-02-27 1:44 ` Samiullah Khawaja 0 siblings, 1 reply; 12+ messages in thread From: Bjorn Helgaas @ 2026-02-10 23:39 UTC (permalink / raw) To: Tian, Kevin Cc: Baolu Lu, Guo, Jinhui, Bjorn Helgaas, dwmw2@infradead.org, iommu@lists.linux.dev, joro@8bytes.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, will@kernel.org, Alex Williamson [+cc Alex, beginning of thread: https://lore.kernel.org/all/20251211035946.2071-1-guojinhui.liam@bytedance.com/] On Wed, Dec 24, 2025 at 03:08:49AM +0000, Tian, Kevin wrote: > +Bjorn for guidance. Sorry for the late response. > quick context - previously intel-iommu driver fixed a lockup issue in surprise > removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the > lockup issue in a setup where no interrupt is raised to pci core upon surprise > removal (so pci_dev_is_disconnected() is false), hence suggesting to replace > the check with pci_device_is_present() instead. I think checking pci_dev_is_disconnected() or pci_device_is_present() in drivers is usually bad practice because it's always racy, as you've already pointed out. I don't think it's possible to avoid Invalidate Completion Timeouts in general, so I think the real solution is to figure out how to gracefully handle them without running into the lockup detection. I assume the lockup is the loop in qi_submit_sync() where we wait for QI_DONE with interrupts disabled. Maybe we need something like watchdog_hardlockup_touch_cpu() there, along with a timeout in that loop? The PCIe r7.0, sec 10.3.1, implementation note suggests the timeout might be in the 1-2 minute range, which is pretty extreme, but if we can at least handle timeouts gracefully, we can think about ways to make them less likely, e.g., by coordinating with FLR and VFIO detach (maybe the sort of thing Alex alluded to at https://lore.kernel.org/all/20251223153534.0968cc15.alex@shazbot.org). > Bjorn, is it a common practice to fix it directly/only in drivers or should the > pci core be notified e.g. simulating a late removal event? By searching the > code looks it's the former, but better confirm with you before picking this > fix... I don't know exactly what it would look like to simulate a late removal event, but it sounds like some kind of complicated infrastructure that would still be only a 90% solution, which I wouldn't recommend. > > From: Baolu Lu <baolu.lu@linux.intel.com> > > Sent: Tuesday, December 23, 2025 12:06 PM > > > > On 12/22/25 19:19, Jinhui Guo wrote: > > > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote: > > >>> From: Jinhui Guo<guojinhui.liam@bytedance.com> > > >>> Sent: Thursday, December 11, 2025 12:00 PM > > >>> > > >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation > > >>> request when device is disconnected") relies on > > >>> pci_dev_is_disconnected() to skip ATS invalidation for > > >>> safely-removed devices, but it does not cover link-down caused > > >>> by faults, which can still hard-lock the system. > > >> According to the commit msg it actually tries to fix the hard lockup > > >> with surprise removal. For safe removal the device is not removed > > >> before invalidation is done: > > >> > > >> " > > >> For safe removal, device wouldn't be removed until the whole software > > >> handling process is done, it wouldn't trigger the hard lock up issue > > >> caused by too long ATS Invalidation timeout wait. > > >> " > > >> > > >> Can you help articulate the problem especially about the part > > >> 'link-down caused by faults"? What are those faults? How are > > >> they different from the said surprise removal in the commit > > >> msg to not set pci_dev_is_disconnected()? > > >> > > > Hi, kevin, sorry for the delayed reply. > > > > > > A normal or surprise removal of a PCIe device on a hot-plug port normally > > > triggers an interrupt from the PCIe switch. > > > > > > We have, however, observed cases where no interrupt is generated when > > the > > > device suddenly loses its link; the behaviour is identical to setting the > > > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly > > > what goes wrong in the LTSSM between the PCIe switch and the endpoint > > remains > > > unknown. > > > > In this scenario, the hardware has effectively vanished, yet the device > > driver remains bound and the IOMMU resources haven't been released. I’m > > just curious if this stale state could trigger issues in other places > > before the kernel fully realizes the device is gone? I’m not objecting > > to the fix. I'm just interested in whether this 'zombie' state creates > > risks elsewhere. > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode 2026-02-10 23:39 ` Bjorn Helgaas @ 2026-02-27 1:44 ` Samiullah Khawaja 0 siblings, 0 replies; 12+ messages in thread From: Samiullah Khawaja @ 2026-02-27 1:44 UTC (permalink / raw) To: Bjorn Helgaas Cc: Tian, Kevin, Baolu Lu, Guo, Jinhui, Bjorn Helgaas, dwmw2@infradead.org, iommu@lists.linux.dev, joro@8bytes.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, will@kernel.org, Alex Williamson On Tue, Feb 10, 2026 at 05:39:12PM -0600, Bjorn Helgaas wrote: >[+cc Alex, beginning of thread: >https://lore.kernel.org/all/20251211035946.2071-1-guojinhui.liam@bytedance.com/] > >On Wed, Dec 24, 2025 at 03:08:49AM +0000, Tian, Kevin wrote: >> +Bjorn for guidance. > >Sorry for the late response. > >> quick context - previously intel-iommu driver fixed a lockup issue in surprise >> removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the >> lockup issue in a setup where no interrupt is raised to pci core upon surprise >> removal (so pci_dev_is_disconnected() is false), hence suggesting to replace >> the check with pci_device_is_present() instead. > >I think checking pci_dev_is_disconnected() or pci_device_is_present() >in drivers is usually bad practice because it's always racy, as you've >already pointed out. > >I don't think it's possible to avoid Invalidate Completion Timeouts in >general, so I think the real solution is to figure out how to >gracefully handle them without running into the lockup detection. > >I assume the lockup is the loop in qi_submit_sync() where we wait for >QI_DONE with interrupts disabled. Maybe we need something like >watchdog_hardlockup_touch_cpu() there, along with a timeout in that >loop? Looking at the AMD IOMMU driver, it has 100ms timeout in wait_on_sem() that basically waits for the completion until the timeout occurs. Is this the expected behaviour as per specification, or should the IOMMU wait for the Invalidation Completion Timeout? Reading the specs (notes of PCIe r7.0, sec 10.1.1, Figure 10-4), it seems the device is allowed to send translated TLPs, targetting the address regions being invalidated, until the Invalidation Completion Timeout (which could be 1-2 minutes as Bjorn shared below). > >The PCIe r7.0, sec 10.3.1, implementation note suggests the timeout >might be in the 1-2 minute range, which is pretty extreme, but if we >can at least handle timeouts gracefully, we can think about ways to >make them less likely, e.g., by coordinating with FLR and VFIO detach >(maybe the sort of thing Alex alluded to at >https://lore.kernel.org/all/20251223153534.0968cc15.alex@shazbot.org). > >> Bjorn, is it a common practice to fix it directly/only in drivers or should the >> pci core be notified e.g. simulating a late removal event? By searching the >> code looks it's the former, but better confirm with you before picking this >> fix... > >I don't know exactly what it would look like to simulate a late >removal event, but it sounds like some kind of complicated >infrastructure that would still be only a 90% solution, which I >wouldn't recommend. > >> > From: Baolu Lu <baolu.lu@linux.intel.com> >> > Sent: Tuesday, December 23, 2025 12:06 PM >> > >> > On 12/22/25 19:19, Jinhui Guo wrote: >> > > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote: >> > >>> From: Jinhui Guo<guojinhui.liam@bytedance.com> >> > >>> Sent: Thursday, December 11, 2025 12:00 PM >> > >>> >> > >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation >> > >>> request when device is disconnected") relies on >> > >>> pci_dev_is_disconnected() to skip ATS invalidation for >> > >>> safely-removed devices, but it does not cover link-down caused >> > >>> by faults, which can still hard-lock the system. >> > >> According to the commit msg it actually tries to fix the hard lockup >> > >> with surprise removal. For safe removal the device is not removed >> > >> before invalidation is done: >> > >> >> > >> " >> > >> For safe removal, device wouldn't be removed until the whole software >> > >> handling process is done, it wouldn't trigger the hard lock up issue >> > >> caused by too long ATS Invalidation timeout wait. >> > >> " >> > >> >> > >> Can you help articulate the problem especially about the part >> > >> 'link-down caused by faults"? What are those faults? How are >> > >> they different from the said surprise removal in the commit >> > >> msg to not set pci_dev_is_disconnected()? >> > >> >> > > Hi, kevin, sorry for the delayed reply. >> > > >> > > A normal or surprise removal of a PCIe device on a hot-plug port normally >> > > triggers an interrupt from the PCIe switch. >> > > >> > > We have, however, observed cases where no interrupt is generated when >> > the >> > > device suddenly loses its link; the behaviour is identical to setting the >> > > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly >> > > what goes wrong in the LTSSM between the PCIe switch and the endpoint >> > remains >> > > unknown. >> > >> > In this scenario, the hardware has effectively vanished, yet the device >> > driver remains bound and the IOMMU resources haven't been released. I’m >> > just curious if this stale state could trigger issues in other places >> > before the kernel fully realizes the device is gone? I’m not objecting >> > to the fix. I'm just interested in whether this 'zombie' state creates >> > risks elsewhere. >> > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device 2025-12-11 3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo 2025-12-11 3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo 2025-12-11 3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo @ 2026-01-20 6:49 ` Baolu Lu 2026-03-01 3:50 ` Ethan Zhao 3 siblings, 0 replies; 12+ messages in thread From: Baolu Lu @ 2026-01-20 6:49 UTC (permalink / raw) To: Jinhui Guo, dwmw2, joro, will; +Cc: iommu, linux-kernel, stable On 12/11/25 11:59, Jinhui Guo wrote: > Hi, all > > We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation > that cannot complete, especially under GDR high-load conditions. > > 1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU > non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down > event and link-down without any event. > > a) NIC link-down with an explicit link-dow event. > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > domain_context_clear_one_cb > pci_for_each_dma_alias > device_block_translation > blocking_domain_attach_dev > iommu_deinit_device > __iommu_group_remove_device > iommu_release_device > iommu_bus_notifier > blocking_notifier_call_chain > bus_notify > device_del > pci_remove_bus_device > pci_stop_and_remove_bus_device > pciehp_unconfigure_device > pciehp_disable_slot > pciehp_handle_presence_or_link_change > pciehp_ist > > b) NIC link-down without an event - hard-lock on VM destroy. > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > domain_context_clear_one_cb > pci_for_each_dma_alias > device_block_translation > blocking_domain_attach_dev > __iommu_attach_device > __iommu_device_set_domain > __iommu_group_set_domain_internal > iommu_detach_group > vfio_iommu_type1_detach_group > vfio_group_detach_container > vfio_group_fops_release > __fput > > 2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU > scalable mode; NIC link-down without an event hard-locks on VM destroy. > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > intel_pasid_tear_down_entry > device_block_translation > blocking_domain_attach_dev > __iommu_attach_device > __iommu_device_set_domain > __iommu_group_set_domain_internal > iommu_detach_group > vfio_iommu_type1_detach_group > vfio_group_detach_container > vfio_group_fops_release > __fput > > Fix both issues with two patches: > 1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using > pci_device_is_present(). > 2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to > skip ATS invalidation in devtlb_invalidation_with_pasid(). > > Best Regards, > Jinhui > > --- > v1:https://lore.kernel.org/all/20251210171431.1589-1- > guojinhui.liam@bytedance.com/ > > Changelog in v1 -> v2 (suggested by Baolu Lu) > - Simplify the pci_device_is_present() check in __context_flush_dev_iotlb(). > - Add Cc:stable@vger.kernel.org to both patches. > > Jinhui Guo (2): > iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without > scalable mode > iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in > scalable mode Queued for iommu next. Thanks, baolu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device 2025-12-11 3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo ` (2 preceding siblings ...) 2026-01-20 6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu @ 2026-03-01 3:50 ` Ethan Zhao 3 siblings, 0 replies; 12+ messages in thread From: Ethan Zhao @ 2026-03-01 3:50 UTC (permalink / raw) To: Jinhui Guo, dwmw2, baolu.lu, joro, will; +Cc: iommu, linux-kernel, stable On 12/11/2025 11:59 AM, Jinhui Guo wrote: > Hi, all > > We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation > that cannot complete, especially under GDR high-load conditions. > > 1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU > non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down > event and link-down without any event. > > a) NIC link-down with an explicit link-dow event. > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > domain_context_clear_one_cb > pci_for_each_dma_alias > device_block_translation > blocking_domain_attach_dev > iommu_deinit_device > __iommu_group_remove_device > iommu_release_device > iommu_bus_notifier > blocking_notifier_call_chain > bus_notify > device_del > pci_remove_bus_device > pci_stop_and_remove_bus_device > pciehp_unconfigure_device > pciehp_disable_slot > pciehp_handle_presence_or_link_change > pciehp_ist > > b) NIC link-down without an event - hard-lock on VM destroy. > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > __context_flush_dev_iotlb.part.0 > domain_context_clear_one_cb > pci_for_each_dma_alias > device_block_translation > blocking_domain_attach_dev > __iommu_attach_device > __iommu_device_set_domain > __iommu_group_set_domain_internal > iommu_detach_group > vfio_iommu_type1_detach_group > vfio_group_detach_container > vfio_group_fops_release > __fput > > 2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU > scalable mode; NIC link-down without an event hard-locks on VM destroy. > Call Trace: > qi_submit_sync > qi_flush_dev_iotlb > intel_pasid_tear_down_entry > device_block_translation > blocking_domain_attach_dev > __iommu_attach_device > __iommu_device_set_domain > __iommu_group_set_domain_internal > iommu_detach_group > vfio_iommu_type1_detach_group > vfio_group_detach_container > vfio_group_fops_release > __fput > > Fix both issues with two patches: > 1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using > pci_device_is_present(). > 2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to > skip ATS invalidation in devtlb_invalidation_with_pasid(). If what I remembered right, using pci_device_is_present() to replace pci_device_is_disconnected() might not be the correct choice against link down case, you might misunderstand the function of pci_device_is _present() when device is there but link is not up. if you want to check link status, just check link status. Bjorn, correct me if I am wrong. Thanks, Ethan > > Best Regards, > Jinhui > > --- > v1: https://lore.kernel.org/all/20251210171431.1589-1-guojinhui.liam@bytedance.com/ > > Changelog in v1 -> v2 (suggested by Baolu Lu) > - Simplify the pci_device_is_present() check in __context_flush_dev_iotlb(). > - Add Cc: stable@vger.kernel.org to both patches. > > Jinhui Guo (2): > iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without > scalable mode > iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in > scalable mode > > drivers/iommu/intel/pasid.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-03-01 3:50 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-11 3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo 2025-12-11 3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo 2025-12-11 3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo 2025-12-18 8:04 ` Tian, Kevin 2025-12-22 11:19 ` Jinhui Guo 2025-12-23 4:06 ` Baolu Lu 2025-12-23 14:58 ` Jinhui Guo 2025-12-24 3:08 ` Tian, Kevin 2026-02-10 23:39 ` Bjorn Helgaas 2026-02-27 1:44 ` Samiullah Khawaja 2026-01-20 6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu 2026-03-01 3:50 ` Ethan Zhao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox