From: Baolu Lu <baolu.lu@linux.intel.com>
To: Jinhui Guo <guojinhui.liam@bytedance.com>, kevin.tian@intel.com
Cc: dwmw2@infradead.org, iommu@lists.linux.dev, joro@8bytes.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org,
will@kernel.org, Bjorn Helgaas <bhelgaas@google.com>
Subject: Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
Date: Tue, 23 Dec 2025 12:06:24 +0800 [thread overview]
Message-ID: <aa1eda8a-4463-467a-b157-c6155882f293@linux.intel.com> (raw)
In-Reply-To: <20251222111935.489-1-guojinhui.liam@bytedance.com>
On 12/22/25 19:19, Jinhui Guo wrote:
> On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
>>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
>>> Sent: Thursday, December 11, 2025 12:00 PM
>>>
>>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
>>> request when device is disconnected") relies on
>>> pci_dev_is_disconnected() to skip ATS invalidation for
>>> safely-removed devices, but it does not cover link-down caused
>>> by faults, which can still hard-lock the system.
>> According to the commit msg it actually tries to fix the hard lockup
>> with surprise removal. For safe removal the device is not removed
>> before invalidation is done:
>>
>> "
>> For safe removal, device wouldn't be removed until the whole software
>> handling process is done, it wouldn't trigger the hard lock up issue
>> caused by too long ATS Invalidation timeout wait.
>> "
>>
>> Can you help articulate the problem especially about the part
>> 'link-down caused by faults"? What are those faults? How are
>> they different from the said surprise removal in the commit
>> msg to not set pci_dev_is_disconnected()?
>>
> Hi, kevin, sorry for the delayed reply.
>
> A normal or surprise removal of a PCIe device on a hot-plug port normally
> triggers an interrupt from the PCIe switch.
>
> We have, however, observed cases where no interrupt is generated when the
> device suddenly loses its link; the behaviour is identical to setting the
> Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
> unknown.
In this scenario, the hardware has effectively vanished, yet the device
driver remains bound and the IOMMU resources haven't been released. I’m
just curious if this stale state could trigger issues in other places
before the kernel fully realizes the device is gone? I’m not objecting
to the fix. I'm just interested in whether this 'zombie' state creates
risks elsewhere.
>
>>> For example, if a VM fails to connect to the PCIe device,
>> 'failed' for what reason?
>>
>>> "virsh destroy" is executed to release resources and isolate
>>> the fault, but a hard-lockup occurs while releasing the group fd.
>>>
>>> Call Trace:
>>> qi_submit_sync
>>> qi_flush_dev_iotlb
>>> intel_pasid_tear_down_entry
>>> device_block_translation
>>> blocking_domain_attach_dev
>>> __iommu_attach_device
>>> __iommu_device_set_domain
>>> __iommu_group_set_domain_internal
>>> iommu_detach_group
>>> vfio_iommu_type1_detach_group
>>> vfio_group_detach_container
>>> vfio_group_fops_release
>>> __fput
>>>
>>> Although pci_device_is_present() is slower than
>>> pci_dev_is_disconnected(), it still takes only ~70 µs on a
>>> ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
>>> and width increase.
>>>
>>> Besides, devtlb_invalidation_with_pasid() is called only in the
>>> paths below, which are far less frequent than memory map/unmap.
>>>
>>> 1. mm-struct release
>>> 2. {attach,release}_dev
>>> 3. set/remove PASID
>>> 4. dirty-tracking setup
>>>
>> surprise removal can happen at any time, e.g. after the check of
>> pci_device_is_present(). In the end we need the logic in
>> qi_check_fault() to check the presence upon ITE timeout error
>> received to break the infinite loop. So in your case even with
>> that logici in place you still observe lockup (probably due to
>> hardware ITE timeout is longer than the lockup detection on
>> the CPU?
> Are you referring to the timeout added in patch
> https://lore.kernel.org/all/20240222090251.2849702-4-
> haifeng.zhao@linux.intel.com/ ?
This doesn't appear to be a deterministic solution, because ...
> Our lockup-detection timeout is the default 10 s.
>
> We see ITE-timeout messages in the kernel log. Yet the system still
> hard-locks—probably because, as you mentioned, the hardware ITE timeout
> is longer than the CPU’s lockup-detection window. I’ll reproduce the
> case and follow up with a deeper analysis.
... as you see, neither the PCI nor the VT-d specifications mandate a
specific device-TLB invalidation timeout value for hardware
implementations. Consequently, the ITE timeout value may exceed the CPU
watchdog threshold, meaning a hard lockup will be detected before the
ITE even occurs.
Thanks,
baolu
next prev parent reply other threads:[~2025-12-23 4:05 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-11 3:59 [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Jinhui Guo
2025-12-11 3:59 ` [PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode Jinhui Guo
2025-12-11 3:59 ` [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in " Jinhui Guo
2025-12-18 8:04 ` Tian, Kevin
2025-12-22 11:19 ` Jinhui Guo
2025-12-23 4:06 ` Baolu Lu [this message]
2025-12-23 14:58 ` Jinhui Guo
2025-12-24 3:08 ` Tian, Kevin
2026-02-10 23:39 ` Bjorn Helgaas
2026-02-27 1:44 ` Samiullah Khawaja
2026-01-20 6:49 ` [PATCH v2 0/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device Baolu Lu
2026-03-01 3:50 ` Ethan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aa1eda8a-4463-467a-b157-c6155882f293@linux.intel.com \
--to=baolu.lu@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=dwmw2@infradead.org \
--cc=guojinhui.liam@bytedance.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox