[PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device

public inbox for iommu@lists.linux-foundation.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device
@ 2025-12-11  3:59 Jinhui Guo
  2025-12-11  3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Jinhui Guo @ 2025-12-11  3:59 UTC (permalink / raw)
  To: dwmw2, baolu.lu, joro, will; +Cc: guojinhui.liam, iommu, linux-kernel, stable

Hi, all

We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation
that cannot complete, especially under GDR high-load conditions.

1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
   non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down
   event and link-down without any event.

   a) NIC link-down with an explicit link-dow event.
      Call Trace:
       qi_submit_sync
       qi_flush_dev_iotlb
       __context_flush_dev_iotlb.part.0
       domain_context_clear_one_cb
       pci_for_each_dma_alias
       device_block_translation
       blocking_domain_attach_dev
       iommu_deinit_device
       __iommu_group_remove_device
       iommu_release_device
       iommu_bus_notifier
       blocking_notifier_call_chain
       bus_notify
       device_del
       pci_remove_bus_device
       pci_stop_and_remove_bus_device
       pciehp_unconfigure_device
       pciehp_disable_slot
       pciehp_handle_presence_or_link_change
       pciehp_ist

   b) NIC link-down without an event - hard-lock on VM destroy.
      Call Trace:
       qi_submit_sync
       qi_flush_dev_iotlb
       __context_flush_dev_iotlb.part.0
       domain_context_clear_one_cb
       pci_for_each_dma_alias
       device_block_translation
       blocking_domain_attach_dev
       __iommu_attach_device
       __iommu_device_set_domain
       __iommu_group_set_domain_internal
       iommu_detach_group
       vfio_iommu_type1_detach_group
       vfio_group_detach_container
       vfio_group_fops_release
       __fput

2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
   scalable mode; NIC link-down without an event hard-locks on VM destroy.
   Call Trace:
    qi_submit_sync
    qi_flush_dev_iotlb
    intel_pasid_tear_down_entry
    device_block_translation
    blocking_domain_attach_dev
    __iommu_attach_device
    __iommu_device_set_domain
    __iommu_group_set_domain_internal
    iommu_detach_group
    vfio_iommu_type1_detach_group
    vfio_group_detach_container
    vfio_group_fops_release
    __fput

Fix both issues with two patches:
1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using
   pci_device_is_present().
2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to
   skip ATS invalidation in devtlb_invalidation_with_pasid().

Best Regards,
Jinhui

---
v1: https://lore.kernel.org/all/20251210171431.1589-1-guojinhui.liam@bytedance.com/

Changelog in v1 -> v2 (suggested by Baolu Lu)
 - Simplify the pci_device_is_present() check in __context_flush_dev_iotlb().
 - Add Cc: stable@vger.kernel.org to both patches.

Jinhui Guo (2):
  iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without
    scalable mode
  iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in
    scalable mode

 drivers/iommu/intel/pasid.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
  2025-12-11  3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo
@ 2025-12-11  3:59 ` Jinhui Guo
  2025-12-11  3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Jinhui Guo @ 2025-12-11  3:59 UTC (permalink / raw)
  To: dwmw2, baolu.lu, joro, will; +Cc: guojinhui.liam, iommu, linux-kernel, stable

PCIe endpoints with ATS enabled and passed through to userspace
(e.g., QEMU, DPDK) can hard-lock the host when their link drops,
either by surprise removal or by a link fault.

Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
request when device is disconnected") adds pci_dev_is_disconnected()
to devtlb_invalidation_with_pasid() so ATS invalidation is skipped
only when the device is being safely removed, but it applies only
when Intel IOMMU scalable mode is enabled.

With scalable mode disabled or unsupported, a system hard-lock
occurs when a PCIe endpoint's link drops because the Intel IOMMU
waits indefinitely for an ATS invalidation that cannot complete.

Call Trace:
 qi_submit_sync
 qi_flush_dev_iotlb
 __context_flush_dev_iotlb.part.0
 domain_context_clear_one_cb
 pci_for_each_dma_alias
 device_block_translation
 blocking_domain_attach_dev
 iommu_deinit_device
 __iommu_group_remove_device
 iommu_release_device
 iommu_bus_notifier
 blocking_notifier_call_chain
 bus_notify
 device_del
 pci_remove_bus_device
 pci_stop_and_remove_bus_device
 pciehp_unconfigure_device
 pciehp_disable_slot
 pciehp_handle_presence_or_link_change
 pciehp_ist

Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release")
adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(),
which calls qi_flush_dev_iotlb() and can also hard-lock the system
when a PCIe endpoint's link drops.

Call Trace:
 qi_submit_sync
 qi_flush_dev_iotlb
 __context_flush_dev_iotlb.part.0
 intel_context_flush_no_pasid
 device_pasid_table_teardown
 pci_pasid_table_teardown
 pci_for_each_dma_alias
 intel_pasid_teardown_sm_context
 intel_iommu_release_device
 iommu_deinit_device
 __iommu_group_remove_device
 iommu_release_device
 iommu_bus_notifier
 blocking_notifier_call_chain
 bus_notify
 device_del
 pci_remove_bus_device
 pci_stop_and_remove_bus_device
 pciehp_unconfigure_device
 pciehp_disable_slot
 pciehp_handle_presence_or_link_change
 pciehp_ist

Sometimes the endpoint loses connection without a link-down event
(e.g., due to a link fault); killing the process (virsh destroy)
then hard-locks the host.

Call Trace:
 qi_submit_sync
 qi_flush_dev_iotlb
 __context_flush_dev_iotlb.part.0
 domain_context_clear_one_cb
 pci_for_each_dma_alias
 device_block_translation
 blocking_domain_attach_dev
 __iommu_attach_device
 __iommu_device_set_domain
 __iommu_group_set_domain_internal
 iommu_detach_group
 vfio_iommu_type1_detach_group
 vfio_group_detach_container
 vfio_group_fops_release
 __fput

pci_dev_is_disconnected() only covers safe-removal paths;
pci_device_is_present() tests accessibility by reading
vendor/device IDs and internally calls pci_dev_is_disconnected().
On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs.

Since __context_flush_dev_iotlb() is only called on
{attach,release}_dev paths (not hot), add pci_device_is_present()
there to skip inaccessible devices and avoid the hard-lock.

Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed")
Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/iommu/intel/pasid.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 3e2255057079..a369690f5926 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -1102,6 +1102,15 @@ static void __context_flush_dev_iotlb(struct device_domain_info *info)
 	if (!info->ats_enabled)
 		return;

+	/*
+	 * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the
+	 * Intel IOMMU from waiting indefinitely for an ATS invalidation that
+	 * cannot complete.
+	 */
+	if (dev_is_pci(info->dev) &&
+		!pci_device_is_present(to_pci_dev(info->dev)))
+		return;
+
 	qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn),
 			   info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH);

-- 
2.20.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-11  3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo
  2025-12-11  3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
@ 2025-12-11  3:59 ` Jinhui Guo
  2025-12-18  8:04   ` Tian, Kevin
  2026-01-20  6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu
  2026-03-01  3:50 ` Ethan Zhao
  3 siblings, 1 reply; 12+ messages in thread
From: Jinhui Guo @ 2025-12-11  3:59 UTC (permalink / raw)
  To: dwmw2, baolu.lu, joro, will; +Cc: guojinhui.liam, iommu, linux-kernel, stable

Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
request when device is disconnected") relies on
pci_dev_is_disconnected() to skip ATS invalidation for
safely-removed devices, but it does not cover link-down caused
by faults, which can still hard-lock the system.

For example, if a VM fails to connect to the PCIe device,
"virsh destroy" is executed to release resources and isolate
the fault, but a hard-lockup occurs while releasing the group fd.

Call Trace:
 qi_submit_sync
 qi_flush_dev_iotlb
 intel_pasid_tear_down_entry
 device_block_translation
 blocking_domain_attach_dev
 __iommu_attach_device
 __iommu_device_set_domain
 __iommu_group_set_domain_internal
 iommu_detach_group
 vfio_iommu_type1_detach_group
 vfio_group_detach_container
 vfio_group_fops_release
 __fput

Although pci_device_is_present() is slower than
pci_dev_is_disconnected(), it still takes only ~70 µs on a
ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
and width increase.

Besides, devtlb_invalidation_with_pasid() is called only in the
paths below, which are far less frequent than memory map/unmap.

1. mm-struct release
2. {attach,release}_dev
3. set/remove PASID
4. dirty-tracking setup

The gain in system stability far outweighs the negligible cost
of using pci_device_is_present() instead of pci_dev_is_disconnected()
to decide when to skip ATS invalidation, especially under GDR
high-load conditions.

Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/iommu/intel/pasid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index a369690f5926..e64d445de964 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -218,7 +218,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 	if (!info || !info->ats_enabled)
 		return;

-	if (pci_dev_is_disconnected(to_pci_dev(dev)))
+	if (!pci_device_is_present(to_pci_dev(dev)))
 		return;

 	sid = PCI_DEVID(info->bus, info->devfn);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-11  3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
@ 2025-12-18  8:04   ` Tian, Kevin
  2025-12-22 11:19     ` Jinhui Guo
  0 siblings, 1 reply; 12+ messages in thread
From: Tian, Kevin @ 2025-12-18  8:04 UTC (permalink / raw)
  To: Guo, Jinhui, dwmw2@infradead.org, baolu.lu@linux.intel.com,
	joro@8bytes.org, will@kernel.org
  Cc: Guo, Jinhui, iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

> From: Jinhui Guo <guojinhui.liam@bytedance.com>
> Sent: Thursday, December 11, 2025 12:00 PM
>
> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> request when device is disconnected") relies on
> pci_dev_is_disconnected() to skip ATS invalidation for
> safely-removed devices, but it does not cover link-down caused
> by faults, which can still hard-lock the system.

According to the commit msg it actually tries to fix the hard lockup
with surprise removal. For safe removal the device is not removed
before invalidation is done:

"
    For safe removal, device wouldn't be removed until the whole software
    handling process is done, it wouldn't trigger the hard lock up issue
    caused by too long ATS Invalidation timeout wait.
"

Can you help articulate the problem especially about the part
'link-down caused by faults"? What are those faults? How are
they different from the said surprise removal in the commit
msg to not set pci_dev_is_disconnected()?

>
> For example, if a VM fails to connect to the PCIe device,

'failed' for what reason?

> "virsh destroy" is executed to release resources and isolate
> the fault, but a hard-lockup occurs while releasing the group fd.
>
> Call Trace:
>  qi_submit_sync
>  qi_flush_dev_iotlb
>  intel_pasid_tear_down_entry
>  device_block_translation
>  blocking_domain_attach_dev
>  __iommu_attach_device
>  __iommu_device_set_domain
>  __iommu_group_set_domain_internal
>  iommu_detach_group
>  vfio_iommu_type1_detach_group
>  vfio_group_detach_container
>  vfio_group_fops_release
>  __fput
>
> Although pci_device_is_present() is slower than
> pci_dev_is_disconnected(), it still takes only ~70 µs on a
> ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
> and width increase.
>
> Besides, devtlb_invalidation_with_pasid() is called only in the
> paths below, which are far less frequent than memory map/unmap.
>
> 1. mm-struct release
> 2. {attach,release}_dev
> 3. set/remove PASID
> 4. dirty-tracking setup
>

surprise removal can happen at any time, e.g. after the check of
pci_device_is_present(). In the end we need the logic in
qi_check_fault() to check the presence upon ITE timeout error
received to break the infinite loop. So in your case even with
that logici in place you still observe lockup (probably due to
hardware ITE timeout is longer than the lockup detection on
the CPU?

In any case this change cannot 100% fix the lockup. It just
reduces the possibility which should be made clear.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-18  8:04   ` Tian, Kevin
@ 2025-12-22 11:19     ` Jinhui Guo
  2025-12-23  4:06       ` Baolu Lu
  0 siblings, 1 reply; 12+ messages in thread
From: Jinhui Guo @ 2025-12-22 11:19 UTC (permalink / raw)
  To: kevin.tian
  Cc: baolu.lu, dwmw2, guojinhui.liam, iommu, joro, linux-kernel,
	stable, will

On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> > From: Jinhui Guo <guojinhui.liam@bytedance.com>
> > Sent: Thursday, December 11, 2025 12:00 PM
> > 
> > Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> > request when device is disconnected") relies on
> > pci_dev_is_disconnected() to skip ATS invalidation for
> > safely-removed devices, but it does not cover link-down caused
> > by faults, which can still hard-lock the system.
> 
> According to the commit msg it actually tries to fix the hard lockup
> with surprise removal. For safe removal the device is not removed
> before invalidation is done:
> 
> "
>     For safe removal, device wouldn't be removed until the whole software
>     handling process is done, it wouldn't trigger the hard lock up issue
>     caused by too long ATS Invalidation timeout wait.
> "
> 
> Can you help articulate the problem especially about the part
> 'link-down caused by faults"? What are those faults? How are
> they different from the said surprise removal in the commit
> msg to not set pci_dev_is_disconnected()?
> 

Hi, kevin, sorry for the delayed reply.

A normal or surprise removal of a PCIe device on a hot-plug port normally
triggers an interrupt from the PCIe switch.

We have, however, observed cases where no interrupt is generated when the
device suddenly loses its link; the behaviour is identical to setting the
Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
unknown.

> > 
> > For example, if a VM fails to connect to the PCIe device,
> 
> 'failed' for what reason?
> 
> > "virsh destroy" is executed to release resources and isolate
> > the fault, but a hard-lockup occurs while releasing the group fd.
> > 
> > Call Trace:
> >  qi_submit_sync
> >  qi_flush_dev_iotlb
> >  intel_pasid_tear_down_entry
> >  device_block_translation
> >  blocking_domain_attach_dev
> >  __iommu_attach_device
> >  __iommu_device_set_domain
> >  __iommu_group_set_domain_internal
> >  iommu_detach_group
> >  vfio_iommu_type1_detach_group
> >  vfio_group_detach_container
> >  vfio_group_fops_release
> >  __fput
> > 
> > Although pci_device_is_present() is slower than
> > pci_dev_is_disconnected(), it still takes only ~70 µs on a
> > ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
> > and width increase.
> > 
> > Besides, devtlb_invalidation_with_pasid() is called only in the
> > paths below, which are far less frequent than memory map/unmap.
> > 
> > 1. mm-struct release
> > 2. {attach,release}_dev
> > 3. set/remove PASID
> > 4. dirty-tracking setup
> > 
> 
> surprise removal can happen at any time, e.g. after the check of
> pci_device_is_present(). In the end we need the logic in
> qi_check_fault() to check the presence upon ITE timeout error
> received to break the infinite loop. So in your case even with
> that logici in place you still observe lockup (probably due to
> hardware ITE timeout is longer than the lockup detection on 
> the CPU?

Are you referring to the timeout added in patch
https://lore.kernel.org/all/20240222090251.2849702-4-haifeng.zhao@linux.intel.com/ ?

Our lockup-detection timeout is the default 10 s.

We see ITE-timeout messages in the kernel log. Yet the system still
hard-locks—probably because, as you mentioned, the hardware ITE timeout
is longer than the CPU’s lockup-detection window. I’ll reproduce the
case and follow up with a deeper analysis.

kernel: [ 2402.642685][  T607] vfio-pci 0000:3f:00.0: Unable to change power state from D0 to D3hot, device inaccessible
kernel: [ 2403.441828][T49880] DMAR: VT-d detected Invalidation Time-out Error: SID 0
kernel: [ 2403.441830][    C0] DMAR: DRHD: handling fault status reg 40
kernel: [ 2403.441831][T49880] DMAR: QI HEAD: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07fc
kernel: [ 2403.441833][T49880] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07f8
kernel: [ 2403.441879][T49880] DMAR: Invalidation Time-out Error (ITE) cleared
kernel: [ 2423.643527][    C7] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
kernel: [ 2423.643551][    C7] rcu:        8-...0: (0 ticks this GP) idle=198c/1/0x4000000000000000 softirq=19450/19450 fqs=4403
kernel: [ 2423.643567][    C7] rcu:        (detected by 7, t=21002 jiffies, g=238909, q=4932 ncpus=96)
kernel: [ 2423.643578][    C7] Sending NMI from CPU 7 to CPUs 8:
kernel: [ 2423.643581][    C8] NMI backtrace for cpu 8
kernel: [ 2423.643585][    C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S          E       6.18.0 #5 PREEMPT(voluntary)
kernel: [ 2423.643588][    C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
kernel: [ 2423.643589][    C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021
kernel: [ 2423.643590][    C8] RIP: 0010:qi_submit_sync+0x6cf/0x8d0
kernel: [ 2423.643597][    C8] Code: 89 4c 24 50 89 70 34 48 c7 c7 f0 f5 4a a5 e8 48 15 89 ff 48 8b 4c 24 50 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 <75> 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1
kernel: [ 2423.643598][    C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000097
kernel: [ 2423.643600][    C8] RAX: ffff9dac803a06bc RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 2423.643601][    C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480
kernel: [ 2423.643602][    C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003
kernel: [ 2423.643603][    C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040
kernel: [ 2423.643605][    C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000000af
kernel: [ 2423.643606][    C8] FS:  0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000
kernel: [ 2423.643607][    C8] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 2423.643608][    C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0
kernel: [ 2423.643610][    C8] PKRU: 55555554
kernel: [ 2423.643611][    C8] Call Trace:
kernel: [ 2423.643613][    C8]  <TASK>
kernel: [ 2423.643616][    C8]  ? __pfx_domain_context_clear_one_cb+0x10/0x10
kernel: [ 2423.643620][    C8]  qi_flush_dev_iotlb+0xd5/0xe0
kernel: [ 2423.643622][    C8]  __context_flush_dev_iotlb.part.0+0x3c/0x80
kernel: [ 2423.643625][    C8]  domain_context_clear_one_cb+0x16/0x20
kernel: [ 2423.643626][    C8]  pci_for_each_dma_alias+0x3b/0x140
kernel: [ 2423.643631][    C8]  device_block_translation+0x122/0x180
kernel: [ 2423.643634][    C8]  blocking_domain_attach_dev+0x39/0x50
kernel: [ 2423.643636][    C8]  __iommu_attach_device+0x1b/0x90
kernel: [ 2423.643639][    C8]  __iommu_device_set_domain+0x5d/0xb0
kernel: [ 2423.643642][    C8]  __iommu_group_set_domain_internal+0x60/0x110
kernel: [ 2423.643644][    C8]  iommu_detach_group+0x3a/0x60
kernel: [ 2423.643650][    C8]  vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1]
kernel: [ 2423.643654][    C8]  ? __dentry_kill+0x12a/0x180
kernel: [ 2423.643660][    C8]  ? __pm_runtime_idle+0x44/0xe0
kernel: [ 2423.643666][    C8]  vfio_group_detach_container+0x4f/0x160 [vfio]
kernel: [ 2423.643672][    C8]  vfio_group_fops_release+0x3e/0x80 [vfio]
kernel: [ 2423.643677][    C8]  __fput+0xe6/0x2b0
kernel: [ 2423.643682][    C8]  task_work_run+0x58/0x90
kernel: [ 2423.643688][    C8]  do_exit+0x29b/0xa80
kernel: [ 2423.643694][    C8]  do_group_exit+0x2c/0x80
kernel: [ 2423.643696][    C8]  get_signal+0x8f9/0x900
kernel: [ 2423.643700][    C8]  arch_do_signal_or_restart+0x29/0x210
kernel: [ 2423.643704][    C8]  ? __schedule+0x582/0xe80
kernel: [ 2423.643708][    C8]  exit_to_user_mode_loop+0x8e/0x4f0
kernel: [ 2423.643712][    C8]  do_syscall_64+0x262/0x630
kernel: [ 2423.643717][    C8]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: [ 2423.643720][    C8] RIP: 0033:0x7fde19078514
kernel: [ 2423.643722][    C8] Code: Unable to access opcode bytes at 0x7fde190784ea.
kernel: [ 2423.643723][    C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
kernel: [ 2423.643724][    C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514
kernel: [ 2423.643726][    C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000
kernel: [ 2423.643727][    C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000
kernel: [ 2423.643728][    C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0
kernel: [ 2423.643729][    C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2423.643731][    C8]  </TASK>
kernel: [ 2424.375254][T81463] vfio-pci 0000:3f:00.0: Unable to change power state from D3cold to D0, device inaccessible
...
kernel: [ 2448.327929][    C8] watchdog: CPU8: Watchdog detected hard LOCKUP on cpu 8
kernel: [ 2448.327932][    C8] Modules linked in: vfio_pci(E) vfio_pci_core(E) vfio_iommu_type1(E) vfio(E) udp_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) ip_set_hash_net(E) nft_compat(E) x_tables(E) ip_set(E) msr(E) nf_tables(E) ...
kernel: [ 2448.327963][    C8]  ib_core(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) nvme(E) xhci_hcd(E) i2c_i801(E) nvme_core(E) usbcore(E) scsi_mod(E) mlx5_core(E) i2c_smbus(E) lpc_ich(E) usb_common(E) scsi_common(E) wmi(E)
kernel: [ 2448.327972][    C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S          EL      6.18.0 #5 PREEMPT(voluntary)
kernel: [ 2448.327975][    C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP
kernel: [ 2448.327976][    C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021
kernel: [ 2448.327977][    C8] RIP: 0010:qi_submit_sync+0x6e7/0x8d0
kernel: [ 2448.327981][    C8] Code: 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 75 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 e8 18 <41> 01 c7 45 0f b6 ff 41 29 c7 44 39 fa 75 cb 48 85 c9 0f 85 05 01
kernel: [ 2448.327983][    C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000046
kernel: [ 2448.327984][    C8] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 2448.327985][    C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480
kernel: [ 2448.327986][    C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003
kernel: [ 2448.327987][    C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040
kernel: [ 2448.327988][    C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000001b3
kernel: [ 2448.327989][    C8] FS:  0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000
kernel: [ 2448.327990][    C8] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 2448.327991][    C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0
kernel: [ 2448.327992][    C8] PKRU: 55555554
kernel: [ 2448.327993][    C8] Call Trace:
kernel: [ 2448.327995][    C8]  <TASK>
kernel: [ 2448.327997][    C8]  ? __pfx_domain_context_clear_one_cb+0x10/0x10
kernel: [ 2448.328000][    C8]  qi_flush_dev_iotlb+0xd5/0xe0
kernel: [ 2448.328002][    C8]  __context_flush_dev_iotlb.part.0+0x3c/0x80
kernel: [ 2448.328004][    C8]  domain_context_clear_one_cb+0x16/0x20
kernel: [ 2448.328006][    C8]  pci_for_each_dma_alias+0x3b/0x140
kernel: [ 2448.328010][    C8]  device_block_translation+0x122/0x180
kernel: [ 2448.328012][    C8]  blocking_domain_attach_dev+0x39/0x50
kernel: [ 2448.328014][    C8]  __iommu_attach_device+0x1b/0x90
kernel: [ 2448.328017][    C8]  __iommu_device_set_domain+0x5d/0xb0
kernel: [ 2448.328019][    C8]  __iommu_group_set_domain_internal+0x60/0x110
kernel: [ 2448.328021][    C8]  iommu_detach_group+0x3a/0x60
kernel: [ 2448.328023][    C8]  vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1]
kernel: [ 2448.328026][    C8]  ? __dentry_kill+0x12a/0x180
kernel: [ 2448.328030][    C8]  ? __pm_runtime_idle+0x44/0xe0
kernel: [ 2448.328035][    C8]  vfio_group_detach_container+0x4f/0x160 [vfio]
kernel: [ 2448.328041][    C8]  vfio_group_fops_release+0x3e/0x80 [vfio]
kernel: [ 2448.328046][    C8]  __fput+0xe6/0x2b0
kernel: [ 2448.328049][    C8]  task_work_run+0x58/0x90
kernel: [ 2448.328053][    C8]  do_exit+0x29b/0xa80
kernel: [ 2448.328057][    C8]  do_group_exit+0x2c/0x80
kernel: [ 2448.328060][    C8]  get_signal+0x8f9/0x900
kernel: [ 2448.328064][    C8]  arch_do_signal_or_restart+0x29/0x210
kernel: [ 2448.328068][    C8]  ? __schedule+0x582/0xe80
kernel: [ 2448.328070][    C8]  exit_to_user_mode_loop+0x8e/0x4f0
kernel: [ 2448.328074][    C8]  do_syscall_64+0x262/0x630
kernel: [ 2448.328076][    C8]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: [ 2448.328078][    C8] RIP: 0033:0x7fde19078514
kernel: [ 2448.328080][    C8] Code: Unable to access opcode bytes at 0x7fde190784ea.
kernel: [ 2448.328081][    C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
kernel: [ 2448.328082][    C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514
kernel: [ 2448.328083][    C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000
kernel: [ 2448.328085][    C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000
kernel: [ 2448.328085][    C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0
kernel: [ 2448.328086][    C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2448.328088][    C8]  </TASK>
kernel: [ 2450.245901][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 41s! [mongoosev3-agen:4727]

> 
> In any case this change cannot 100% fix the lockup. It just
> reduces the possibility which should be made clear.

I agree with the above, but it's better to cover more corner cases.

Best Regards,
Jinhui

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-22 11:19     ` Jinhui Guo
@ 2025-12-23  4:06       ` Baolu Lu
  2025-12-23 14:58         ` Jinhui Guo
  2025-12-24  3:08         ` Tian, Kevin
  0 siblings, 2 replies; 12+ messages in thread
From: Baolu Lu @ 2025-12-23  4:06 UTC (permalink / raw)
  To: Jinhui Guo, kevin.tian
  Cc: dwmw2, iommu, joro, linux-kernel, stable, will, Bjorn Helgaas

On 12/22/25 19:19, Jinhui Guo wrote:
> On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
>>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
>>> Sent: Thursday, December 11, 2025 12:00 PM
>>>
>>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
>>> request when device is disconnected") relies on
>>> pci_dev_is_disconnected() to skip ATS invalidation for
>>> safely-removed devices, but it does not cover link-down caused
>>> by faults, which can still hard-lock the system.
>> According to the commit msg it actually tries to fix the hard lockup
>> with surprise removal. For safe removal the device is not removed
>> before invalidation is done:
>>
>> "
>>      For safe removal, device wouldn't be removed until the whole software
>>      handling process is done, it wouldn't trigger the hard lock up issue
>>      caused by too long ATS Invalidation timeout wait.
>> "
>>
>> Can you help articulate the problem especially about the part
>> 'link-down caused by faults"? What are those faults? How are
>> they different from the said surprise removal in the commit
>> msg to not set pci_dev_is_disconnected()?
>>
> Hi, kevin, sorry for the delayed reply.
> 
> A normal or surprise removal of a PCIe device on a hot-plug port normally
> triggers an interrupt from the PCIe switch.
> 
> We have, however, observed cases where no interrupt is generated when the
> device suddenly loses its link; the behaviour is identical to setting the
> Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
> unknown.

In this scenario, the hardware has effectively vanished, yet the device
driver remains bound and the IOMMU resources haven't been released. I’m
just curious if this stale state could trigger issues in other places
before the kernel fully realizes the device is gone? I’m not objecting
to the fix. I'm just interested in whether this 'zombie' state creates
risks elsewhere.

> 
>>> For example, if a VM fails to connect to the PCIe device,
>> 'failed' for what reason?
>>
>>> "virsh destroy" is executed to release resources and isolate
>>> the fault, but a hard-lockup occurs while releasing the group fd.
>>>
>>> Call Trace:
>>>   qi_submit_sync
>>>   qi_flush_dev_iotlb
>>>   intel_pasid_tear_down_entry
>>>   device_block_translation
>>>   blocking_domain_attach_dev
>>>   __iommu_attach_device
>>>   __iommu_device_set_domain
>>>   __iommu_group_set_domain_internal
>>>   iommu_detach_group
>>>   vfio_iommu_type1_detach_group
>>>   vfio_group_detach_container
>>>   vfio_group_fops_release
>>>   __fput
>>>
>>> Although pci_device_is_present() is slower than
>>> pci_dev_is_disconnected(), it still takes only ~70 µs on a
>>> ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
>>> and width increase.
>>>
>>> Besides, devtlb_invalidation_with_pasid() is called only in the
>>> paths below, which are far less frequent than memory map/unmap.
>>>
>>> 1. mm-struct release
>>> 2. {attach,release}_dev
>>> 3. set/remove PASID
>>> 4. dirty-tracking setup
>>>
>> surprise removal can happen at any time, e.g. after the check of
>> pci_device_is_present(). In the end we need the logic in
>> qi_check_fault() to check the presence upon ITE timeout error
>> received to break the infinite loop. So in your case even with
>> that logici in place you still observe lockup (probably due to
>> hardware ITE timeout is longer than the lockup detection on
>> the CPU?
> Are you referring to the timeout added in patch
> https://lore.kernel.org/all/20240222090251.2849702-4- 
> haifeng.zhao@linux.intel.com/ ?

This doesn't appear to be a deterministic solution, because ...

> Our lockup-detection timeout is the default 10 s.
> 
> We see ITE-timeout messages in the kernel log. Yet the system still
> hard-locks—probably because, as you mentioned, the hardware ITE timeout
> is longer than the CPU’s lockup-detection window. I’ll reproduce the
> case and follow up with a deeper analysis.

... as you see, neither the PCI nor the VT-d specifications mandate a
specific device-TLB invalidation timeout value for hardware
implementations. Consequently, the ITE timeout value may exceed the CPU
watchdog threshold, meaning a hard lockup will be detected before the
ITE even occurs.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-23  4:06       ` Baolu Lu
@ 2025-12-23 14:58         ` Jinhui Guo
  2025-12-24  3:08         ` Tian, Kevin
  1 sibling, 0 replies; 12+ messages in thread
From: Jinhui Guo @ 2025-12-23 14:58 UTC (permalink / raw)
  To: baolu.lu
  Cc: bhelgaas, dwmw2, guojinhui.liam, iommu, joro, kevin.tian,
	linux-kernel, stable, will

On Tue, Dec 23, 2025 12:06:24 +0800, Baolu Lu wrote:
> > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> >>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
> >>> Sent: Thursday, December 11, 2025 12:00 PM
> >>>
> >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> >>> request when device is disconnected") relies on
> >>> pci_dev_is_disconnected() to skip ATS invalidation for
> >>> safely-removed devices, but it does not cover link-down caused
> >>> by faults, which can still hard-lock the system.
> >> According to the commit msg it actually tries to fix the hard lockup
> >> with surprise removal. For safe removal the device is not removed
> >> before invalidation is done:
> >>
> >> "
> >>      For safe removal, device wouldn't be removed until the whole software
> >>      handling process is done, it wouldn't trigger the hard lock up issue
> >>      caused by too long ATS Invalidation timeout wait.
> >> "
> >>
> >> Can you help articulate the problem especially about the part
> >> 'link-down caused by faults"? What are those faults? How are
> >> they different from the said surprise removal in the commit
> >> msg to not set pci_dev_is_disconnected()?
> >>
> > Hi, kevin, sorry for the delayed reply.
> > 
> > A normal or surprise removal of a PCIe device on a hot-plug port normally
> > triggers an interrupt from the PCIe switch.
> > 
> > We have, however, observed cases where no interrupt is generated when the
> > device suddenly loses its link; the behaviour is identical to setting the
> > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> > what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
> > unknown.
> 
> In this scenario, the hardware has effectively vanished, yet the device
> driver remains bound and the IOMMU resources haven't been released. I’m
> just curious if this stale state could trigger issues in other places
> before the kernel fully realizes the device is gone? I’m not objecting
> to the fix. I'm just interested in whether this 'zombie' state creates
> risks elsewhere.

Hi, Baolu

In our scenario we see no other issues; a hard-LOCKUP panic is triggered the
moment the Mellanox Ethernet device vanishes. But we can analyze what happens
when we access the Mellanox Ethernet device whose link is disabled.
(If we check whether the PCIe endpoint device (Mellanox Ethernet) is present
before issuing device-IOTLB invalidation to the Intel IOMMU, no other issues
appear.)

According to the PCIe spec, Rev. 5.0 v1.0, Sec. 2.4.1, there are two kinds of
TLPs: posted and non-posted. Non-posted TLPs require a completion TLP; posted
TLPs do not.

- A Posted Request is a Memory Write Request or a Message Request.
- A Read Request is a Configuration Read Request, an I/O Read Request, or a
  Memory Read Request.
- An NPR (Non-Posted Request) with Data is a Configuration Write Request, an
  I/O Write Request, or an AtomicOp Request.
- A Non-Posted Request is a Read Request or an NPR with Data.

When the CPU issues a PCIe memory-write TLP (posted) via a MOV instruction,
the instruction retires immediately after the packet reaches the Root Complex;
no Data-Link ACK/NAK is required. A memory-read TLP (non-posted), however, stalls
the core until the corresponding Completion TLP is received - if that Completion
never arrives, the CPU hangs. (The CPU hangs if the LTSSM does not enter the
Disabled state.)

However, if the LTSSM enters the Disabled state, the Root Port returns
Completer-Abort (CA) for any non-posted TLP, so the request completes with status
0xFFFFFFFF without stalling.

I ran some tests on the machine after setting the Link Disable bit in the switch’s
Link Control register (offset 10h).
- setpci -s 0000:3c:08.0 CAP_EXP+10.w=0x0010

 +-[0000:3a]-+-00.0-[3b-3f]----00.0-[3c-3f]--+-00.0-[3d]----
 |           |                               +-04.0-[3e]----
 |           |                               \-08.0-[3f]----00.0  Mellanox Technologies MT27800 Family [ConnectX-5]

 # lspci -vvv -s 0000:3f:00.0
 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
 ...
         Region 0: Memory at 3af804000000 (64-bit, prefetchable) [size=32M]
 ...

1) Issue a PCI config-space read request and it returns 0xFFFFFFFF.
 # lspci -vvv -s 0000:3f:00.0
 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] (rev ff) (prog-if ff)
         !!! Unknown header type 7f
         Kernel driver in use: mlx5_core
         Kernel modules: mlx5_core

2) Issuing a PCI memory read request through /dev/mem also returns 0xFFFFFFFF.
 # ./devmem
 Usage: ./devmem <phys_addr> <size> <offset> [value]
   phys_addr : physical base address of the BAR (hex or decimal)
   size      : mapping length in bytes (hex or decimal)
   offset    : register offset from BAR base (hex or decimal)
   value     : optional 32-bit value to write (hex or decimal)
 Example: ./devmem 0x600000000 0x1000 0x0 0xDEADBEEF
 # ./devmem 0x3af804000000 0x2000000 0x0
 0x3af804000000 = 0xffffffff

 Before the link was disabled, we could read 0x3af804000000 with devmem and
 obtain a valid result.
 # ./devmem 0x3af804000000 0x2000000 0x0
 0x3af804000000 = 0x10002300

Besides, after searching the kernel code, I found many EP drivers already check
whether their endpoint is still present. There may be exception cases in some
PCIe endpoint drivers, such as commit 43bb40c5b926 ("virtio_pci: Support surprise
removal of virtio pci device").

Best Regards,
Jinhui

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-23  4:06       ` Baolu Lu
  2025-12-23 14:58         ` Jinhui Guo
@ 2025-12-24  3:08         ` Tian, Kevin
  2026-02-10 23:39           ` Bjorn Helgaas
  1 sibling, 1 reply; 12+ messages in thread
From: Tian, Kevin @ 2025-12-24  3:08 UTC (permalink / raw)
  To: Baolu Lu, Guo, Jinhui, Bjorn Helgaas
  Cc: dwmw2@infradead.org, iommu@lists.linux.dev, joro@8bytes.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	will@kernel.org, Bjorn Helgaas

+Bjorn for guidance.

quick context - previously intel-iommu driver fixed a lockup issue in surprise
removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the
lockup issue in a setup where no interrupt is raised to pci core upon surprise
removal (so pci_dev_is_disconnected() is false), hence suggesting to replace
the check with pci_device_is_present() instead.

Bjorn, is it a common practice to fix it directly/only in drivers or should the
pci core be notified e.g. simulating a late removal event? By searching the
code looks it's the former, but better confirm with you before picking this
fix...

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Tuesday, December 23, 2025 12:06 PM
> 
> On 12/22/25 19:19, Jinhui Guo wrote:
> > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> >>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
> >>> Sent: Thursday, December 11, 2025 12:00 PM
> >>>
> >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> >>> request when device is disconnected") relies on
> >>> pci_dev_is_disconnected() to skip ATS invalidation for
> >>> safely-removed devices, but it does not cover link-down caused
> >>> by faults, which can still hard-lock the system.
> >> According to the commit msg it actually tries to fix the hard lockup
> >> with surprise removal. For safe removal the device is not removed
> >> before invalidation is done:
> >>
> >> "
> >>      For safe removal, device wouldn't be removed until the whole software
> >>      handling process is done, it wouldn't trigger the hard lock up issue
> >>      caused by too long ATS Invalidation timeout wait.
> >> "
> >>
> >> Can you help articulate the problem especially about the part
> >> 'link-down caused by faults"? What are those faults? How are
> >> they different from the said surprise removal in the commit
> >> msg to not set pci_dev_is_disconnected()?
> >>
> > Hi, kevin, sorry for the delayed reply.
> >
> > A normal or surprise removal of a PCIe device on a hot-plug port normally
> > triggers an interrupt from the PCIe switch.
> >
> > We have, however, observed cases where no interrupt is generated when
> the
> > device suddenly loses its link; the behaviour is identical to setting the
> > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> > what goes wrong in the LTSSM between the PCIe switch and the endpoint
> remains
> > unknown.
> 
> In this scenario, the hardware has effectively vanished, yet the device
> driver remains bound and the IOMMU resources haven't been released. I’m
> just curious if this stale state could trigger issues in other places
> before the kernel fully realizes the device is gone? I’m not objecting
> to the fix. I'm just interested in whether this 'zombie' state creates
> risks elsewhere.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2025-12-24  3:08         ` Tian, Kevin
@ 2026-02-10 23:39           ` Bjorn Helgaas
  2026-02-27  1:44             ` Samiullah Khawaja
  0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2026-02-10 23:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Baolu Lu, Guo, Jinhui, Bjorn Helgaas, dwmw2@infradead.org,
	iommu@lists.linux.dev, joro@8bytes.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	will@kernel.org, Alex Williamson

[+cc Alex, beginning of thread:
https://lore.kernel.org/all/20251211035946.2071-1-guojinhui.liam@bytedance.com/]

On Wed, Dec 24, 2025 at 03:08:49AM +0000, Tian, Kevin wrote:
> +Bjorn for guidance.

Sorry for the late response.

> quick context - previously intel-iommu driver fixed a lockup issue in surprise
> removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the
> lockup issue in a setup where no interrupt is raised to pci core upon surprise
> removal (so pci_dev_is_disconnected() is false), hence suggesting to replace
> the check with pci_device_is_present() instead.

I think checking pci_dev_is_disconnected() or pci_device_is_present()
in drivers is usually bad practice because it's always racy, as you've
already pointed out.

I don't think it's possible to avoid Invalidate Completion Timeouts in
general, so I think the real solution is to figure out how to
gracefully handle them without running into the lockup detection.

I assume the lockup is the loop in qi_submit_sync() where we wait for
QI_DONE with interrupts disabled.  Maybe we need something like
watchdog_hardlockup_touch_cpu() there, along with a timeout in that
loop?

The PCIe r7.0, sec 10.3.1, implementation note suggests the timeout
might be in the 1-2 minute range, which is pretty extreme, but if we
can at least handle timeouts gracefully, we can think about ways to
make them less likely, e.g., by coordinating with FLR and VFIO detach
(maybe the sort of thing Alex alluded to at
https://lore.kernel.org/all/20251223153534.0968cc15.alex@shazbot.org).

> Bjorn, is it a common practice to fix it directly/only in drivers or should the
> pci core be notified e.g. simulating a late removal event? By searching the
> code looks it's the former, but better confirm with you before picking this
> fix...

I don't know exactly what it would look like to simulate a late
removal event, but it sounds like some kind of complicated
infrastructure that would still be only a 90% solution, which I
wouldn't recommend.

> > From: Baolu Lu <baolu.lu@linux.intel.com>
> > Sent: Tuesday, December 23, 2025 12:06 PM
> > 
> > On 12/22/25 19:19, Jinhui Guo wrote:
> > > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> > >>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
> > >>> Sent: Thursday, December 11, 2025 12:00 PM
> > >>>
> > >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> > >>> request when device is disconnected") relies on
> > >>> pci_dev_is_disconnected() to skip ATS invalidation for
> > >>> safely-removed devices, but it does not cover link-down caused
> > >>> by faults, which can still hard-lock the system.
> > >> According to the commit msg it actually tries to fix the hard lockup
> > >> with surprise removal. For safe removal the device is not removed
> > >> before invalidation is done:
> > >>
> > >> "
> > >>      For safe removal, device wouldn't be removed until the whole software
> > >>      handling process is done, it wouldn't trigger the hard lock up issue
> > >>      caused by too long ATS Invalidation timeout wait.
> > >> "
> > >>
> > >> Can you help articulate the problem especially about the part
> > >> 'link-down caused by faults"? What are those faults? How are
> > >> they different from the said surprise removal in the commit
> > >> msg to not set pci_dev_is_disconnected()?
> > >>
> > > Hi, kevin, sorry for the delayed reply.
> > >
> > > A normal or surprise removal of a PCIe device on a hot-plug port normally
> > > triggers an interrupt from the PCIe switch.
> > >
> > > We have, however, observed cases where no interrupt is generated when
> > the
> > > device suddenly loses its link; the behaviour is identical to setting the
> > > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> > > what goes wrong in the LTSSM between the PCIe switch and the endpoint
> > remains
> > > unknown.
> > 
> > In this scenario, the hardware has effectively vanished, yet the device
> > driver remains bound and the IOMMU resources haven't been released. I’m
> > just curious if this stale state could trigger issues in other places
> > before the kernel fully realizes the device is gone? I’m not objecting
> > to the fix. I'm just interested in whether this 'zombie' state creates
> > risks elsewhere.
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  2026-02-10 23:39           ` Bjorn Helgaas
@ 2026-02-27  1:44             ` Samiullah Khawaja
  0 siblings, 0 replies; 12+ messages in thread
From: Samiullah Khawaja @ 2026-02-27  1:44 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Tian, Kevin, Baolu Lu, Guo, Jinhui, Bjorn Helgaas,
	dwmw2@infradead.org, iommu@lists.linux.dev, joro@8bytes.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	will@kernel.org, Alex Williamson

On Tue, Feb 10, 2026 at 05:39:12PM -0600, Bjorn Helgaas wrote:
>[+cc Alex, beginning of thread:
>https://lore.kernel.org/all/20251211035946.2071-1-guojinhui.liam@bytedance.com/]
>
>On Wed, Dec 24, 2025 at 03:08:49AM +0000, Tian, Kevin wrote:
>> +Bjorn for guidance.
>
>Sorry for the late response.
>
>> quick context - previously intel-iommu driver fixed a lockup issue in surprise
>> removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the
>> lockup issue in a setup where no interrupt is raised to pci core upon surprise
>> removal (so pci_dev_is_disconnected() is false), hence suggesting to replace
>> the check with pci_device_is_present() instead.
>
>I think checking pci_dev_is_disconnected() or pci_device_is_present()
>in drivers is usually bad practice because it's always racy, as you've
>already pointed out.
>
>I don't think it's possible to avoid Invalidate Completion Timeouts in
>general, so I think the real solution is to figure out how to
>gracefully handle them without running into the lockup detection.
>
>I assume the lockup is the loop in qi_submit_sync() where we wait for
>QI_DONE with interrupts disabled.  Maybe we need something like
>watchdog_hardlockup_touch_cpu() there, along with a timeout in that
>loop?

Looking at the AMD IOMMU driver, it has 100ms timeout in wait_on_sem()
that basically waits for the completion until the timeout occurs. Is
this the expected behaviour as per specification, or should the IOMMU
wait for the Invalidation Completion Timeout?

Reading the specs (notes of PCIe r7.0, sec 10.1.1, Figure 10-4), it
seems the device is allowed to send translated TLPs, targetting the
address regions being invalidated, until the Invalidation Completion
Timeout (which could be 1-2 minutes as Bjorn shared below).

>
>The PCIe r7.0, sec 10.3.1, implementation note suggests the timeout
>might be in the 1-2 minute range, which is pretty extreme, but if we
>can at least handle timeouts gracefully, we can think about ways to
>make them less likely, e.g., by coordinating with FLR and VFIO detach
>(maybe the sort of thing Alex alluded to at
>https://lore.kernel.org/all/20251223153534.0968cc15.alex@shazbot.org).
>
>> Bjorn, is it a common practice to fix it directly/only in drivers or should the
>> pci core be notified e.g. simulating a late removal event? By searching the
>> code looks it's the former, but better confirm with you before picking this
>> fix...
>
>I don't know exactly what it would look like to simulate a late
>removal event, but it sounds like some kind of complicated
>infrastructure that would still be only a 90% solution, which I
>wouldn't recommend.
>
>> > From: Baolu Lu <baolu.lu@linux.intel.com>
>> > Sent: Tuesday, December 23, 2025 12:06 PM
>> >
>> > On 12/22/25 19:19, Jinhui Guo wrote:
>> > > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
>> > >>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
>> > >>> Sent: Thursday, December 11, 2025 12:00 PM
>> > >>>
>> > >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
>> > >>> request when device is disconnected") relies on
>> > >>> pci_dev_is_disconnected() to skip ATS invalidation for
>> > >>> safely-removed devices, but it does not cover link-down caused
>> > >>> by faults, which can still hard-lock the system.
>> > >> According to the commit msg it actually tries to fix the hard lockup
>> > >> with surprise removal. For safe removal the device is not removed
>> > >> before invalidation is done:
>> > >>
>> > >> "
>> > >>      For safe removal, device wouldn't be removed until the whole software
>> > >>      handling process is done, it wouldn't trigger the hard lock up issue
>> > >>      caused by too long ATS Invalidation timeout wait.
>> > >> "
>> > >>
>> > >> Can you help articulate the problem especially about the part
>> > >> 'link-down caused by faults"? What are those faults? How are
>> > >> they different from the said surprise removal in the commit
>> > >> msg to not set pci_dev_is_disconnected()?
>> > >>
>> > > Hi, kevin, sorry for the delayed reply.
>> > >
>> > > A normal or surprise removal of a PCIe device on a hot-plug port normally
>> > > triggers an interrupt from the PCIe switch.
>> > >
>> > > We have, however, observed cases where no interrupt is generated when
>> > the
>> > > device suddenly loses its link; the behaviour is identical to setting the
>> > > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
>> > > what goes wrong in the LTSSM between the PCIe switch and the endpoint
>> > remains
>> > > unknown.
>> >
>> > In this scenario, the hardware has effectively vanished, yet the device
>> > driver remains bound and the IOMMU resources haven't been released. I’m
>> > just curious if this stale state could trigger issues in other places
>> > before the kernel fully realizes the device is gone? I’m not objecting
>> > to the fix. I'm just interested in whether this 'zombie' state creates
>> > risks elsewhere.
>> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device
  2025-12-11  3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo
  2025-12-11  3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
  2025-12-11  3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
@ 2026-01-20  6:49 ` Baolu Lu
  2026-03-01  3:50 ` Ethan Zhao
  3 siblings, 0 replies; 12+ messages in thread
From: Baolu Lu @ 2026-01-20  6:49 UTC (permalink / raw)
  To: Jinhui Guo, dwmw2, joro, will; +Cc: iommu, linux-kernel, stable

On 12/11/25 11:59, Jinhui Guo wrote:
> Hi, all
> 
> We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation
> that cannot complete, especially under GDR high-load conditions.
> 
> 1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
>     non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down
>     event and link-down without any event.
> 
>     a) NIC link-down with an explicit link-dow event.
>        Call Trace:
>         qi_submit_sync
>         qi_flush_dev_iotlb
>         __context_flush_dev_iotlb.part.0
>         domain_context_clear_one_cb
>         pci_for_each_dma_alias
>         device_block_translation
>         blocking_domain_attach_dev
>         iommu_deinit_device
>         __iommu_group_remove_device
>         iommu_release_device
>         iommu_bus_notifier
>         blocking_notifier_call_chain
>         bus_notify
>         device_del
>         pci_remove_bus_device
>         pci_stop_and_remove_bus_device
>         pciehp_unconfigure_device
>         pciehp_disable_slot
>         pciehp_handle_presence_or_link_change
>         pciehp_ist
> 
>     b) NIC link-down without an event - hard-lock on VM destroy.
>        Call Trace:
>         qi_submit_sync
>         qi_flush_dev_iotlb
>         __context_flush_dev_iotlb.part.0
>         domain_context_clear_one_cb
>         pci_for_each_dma_alias
>         device_block_translation
>         blocking_domain_attach_dev
>         __iommu_attach_device
>         __iommu_device_set_domain
>         __iommu_group_set_domain_internal
>         iommu_detach_group
>         vfio_iommu_type1_detach_group
>         vfio_group_detach_container
>         vfio_group_fops_release
>         __fput
> 
> 2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
>     scalable mode; NIC link-down without an event hard-locks on VM destroy.
>     Call Trace:
>      qi_submit_sync
>      qi_flush_dev_iotlb
>      intel_pasid_tear_down_entry
>      device_block_translation
>      blocking_domain_attach_dev
>      __iommu_attach_device
>      __iommu_device_set_domain
>      __iommu_group_set_domain_internal
>      iommu_detach_group
>      vfio_iommu_type1_detach_group
>      vfio_group_detach_container
>      vfio_group_fops_release
>      __fput
> 
> Fix both issues with two patches:
> 1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using
>     pci_device_is_present().
> 2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to
>     skip ATS invalidation in devtlb_invalidation_with_pasid().
> 
> Best Regards,
> Jinhui
> 
> ---
> v1:https://lore.kernel.org/all/20251210171431.1589-1- 
> guojinhui.liam@bytedance.com/
> 
> Changelog in v1 -> v2 (suggested by Baolu Lu)
>   - Simplify the pci_device_is_present() check in __context_flush_dev_iotlb().
>   - Add Cc:stable@vger.kernel.org to both patches.
> 
> Jinhui Guo (2):
>    iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without
>      scalable mode
>    iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in
>      scalable mode

Queued for iommu next.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device
  2025-12-11  3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo
                   ` (2 preceding siblings ...)
  2026-01-20  6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu
@ 2026-03-01  3:50 ` Ethan Zhao
  3 siblings, 0 replies; 12+ messages in thread
From: Ethan Zhao @ 2026-03-01  3:50 UTC (permalink / raw)
  To: Jinhui Guo, dwmw2, baolu.lu, joro, will; +Cc: iommu, linux-kernel, stable



On 12/11/2025 11:59 AM, Jinhui Guo wrote:
> Hi, all
> 
> We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation
> that cannot complete, especially under GDR high-load conditions.
> 
> 1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
>     non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down
>     event and link-down without any event.
> 
>     a) NIC link-down with an explicit link-dow event.
>        Call Trace:
>         qi_submit_sync
>         qi_flush_dev_iotlb
>         __context_flush_dev_iotlb.part.0
>         domain_context_clear_one_cb
>         pci_for_each_dma_alias
>         device_block_translation
>         blocking_domain_attach_dev
>         iommu_deinit_device
>         __iommu_group_remove_device
>         iommu_release_device
>         iommu_bus_notifier
>         blocking_notifier_call_chain
>         bus_notify
>         device_del
>         pci_remove_bus_device
>         pci_stop_and_remove_bus_device
>         pciehp_unconfigure_device
>         pciehp_disable_slot
>         pciehp_handle_presence_or_link_change
>         pciehp_ist
> 
>     b) NIC link-down without an event - hard-lock on VM destroy.
>        Call Trace:
>         qi_submit_sync
>         qi_flush_dev_iotlb
>         __context_flush_dev_iotlb.part.0
>         domain_context_clear_one_cb
>         pci_for_each_dma_alias
>         device_block_translation
>         blocking_domain_attach_dev
>         __iommu_attach_device
>         __iommu_device_set_domain
>         __iommu_group_set_domain_internal
>         iommu_detach_group
>         vfio_iommu_type1_detach_group
>         vfio_group_detach_container
>         vfio_group_fops_release
>         __fput
> 
> 2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU
>     scalable mode; NIC link-down without an event hard-locks on VM destroy.
>     Call Trace:
>      qi_submit_sync
>      qi_flush_dev_iotlb
>      intel_pasid_tear_down_entry
>      device_block_translation
>      blocking_domain_attach_dev
>      __iommu_attach_device
>      __iommu_device_set_domain
>      __iommu_group_set_domain_internal
>      iommu_detach_group
>      vfio_iommu_type1_detach_group
>      vfio_group_detach_container
>      vfio_group_fops_release
>      __fput
> 
> Fix both issues with two patches:
> 1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using
>     pci_device_is_present().
> 2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to
>     skip ATS invalidation in devtlb_invalidation_with_pasid().
If what I remembered right, using pci_device_is_present() to replace
pci_device_is_disconnected() might not be the correct choice against
link down case, you might misunderstand the function of pci_device_is
_present() when device is there but link is not up. if you want to check
link status, just check link status.

Bjorn, correct me if I am wrong.

Thanks,
Ethan


> 
> Best Regards,
> Jinhui
> 
> ---
> v1: https://lore.kernel.org/all/20251210171431.1589-1-guojinhui.liam@bytedance.com/
> 
> Changelog in v1 -> v2 (suggested by Baolu Lu)
>   - Simplify the pci_device_is_present() check in __context_flush_dev_iotlb().
>   - Add Cc: stable@vger.kernel.org to both patches.
> 
> Jinhui Guo (2):
>    iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without
>      scalable mode
>    iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in
>      scalable mode
> 
>   drivers/iommu/intel/pasid.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-03-01  3:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-11  3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo
2025-12-11  3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
2025-12-11  3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
2025-12-18  8:04   ` Tian, Kevin
2025-12-22 11:19     ` Jinhui Guo
2025-12-23  4:06       ` Baolu Lu
2025-12-23 14:58         ` Jinhui Guo
2025-12-24  3:08         ` Tian, Kevin
2026-02-10 23:39           ` Bjorn Helgaas
2026-02-27  1:44             ` Samiullah Khawaja
2026-01-20  6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu
2026-03-01  3:50 ` Ethan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox