From: Ethan Zhao <etzhao1900@gmail.com>
To: Nicolin Chen <nicolinc@nvidia.com>,
robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
jgg@nvidia.com
Cc: will@kernel.org, robin.clark@oss.qualcomm.com,
yong.wu@mediatek.com, matthias.bgg@gmail.com,
angelogioacchino.delregno@collabora.com,
thierry.reding@gmail.com, vdumpa@nvidia.com,
jonathanh@nvidia.com, rafael@kernel.org, lenb@kernel.org,
kevin.tian@intel.com, yi.l.liu@intel.com,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-arm-msm@vger.kernel.org,
linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
patches@lists.linux.dev, pjaroszynski@nvidia.com,
vsethi@nvidia.com, helgaas@kernel.org
Subject: Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
Date: Tue, 19 Aug 2025 22:12:41 +0800 [thread overview]
Message-ID: <550635db-00ce-410e-add0-77c1a75adb11@gmail.com> (raw)
In-Reply-To: <3749cd6a1430ac36d1af1fadaa4d90ceffef9c62.1754952762.git.nicolinc@nvidia.com>
On 8/12/2025 6:59 AM, Nicolin Chen wrote:
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out: e.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
>
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
>
> Now iommu_dev_reset_prepare/done() helpers are introduced for this matter.
> Use them in all the existing reset functions, which will attach the device
> to an IOMMU_DOMAIN_BLOCKED during a reset, so as to allow IOMMU driver to:
> - invoke pci_disable_ats() and pci_enable_ats(), if necessary
> - wait for all ATS invalidations to complete
> - stop issuing new ATS invalidations
> - fence any incoming ATS queries
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
> drivers/pci/pci-acpi.c | 17 +++++++++--
> drivers/pci/pci.c | 68 ++++++++++++++++++++++++++++++++++++++----
> drivers/pci/quirks.c | 23 +++++++++++++-
> 3 files changed, 100 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index ddb25960ea47d..adaf46422c05d 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -9,6 +9,7 @@
>
> #include <linux/delay.h>
> #include <linux/init.h>
> +#include <linux/iommu.h>
> #include <linux/irqdomain.h>
> #include <linux/pci.h>
> #include <linux/msi.h>
> @@ -969,6 +970,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
> int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> {
> acpi_handle handle = ACPI_HANDLE(&dev->dev);
> + int ret = 0;
>
> if (!handle || !acpi_has_method(handle, "_RST"))
> return -ENOTTY;
> @@ -976,12 +978,23 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> if (probe)
> return 0;
>
> + /*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> + * before initiating a reset. Notify the iommu driver that enabled ATS.
> + */
> + ret = iommu_dev_reset_prepare(&dev->dev);
> + if (ret) {
> + pci_err(dev, "failed to stop IOMMU\n");
> + return ret;
> + }
> +
> if (ACPI_FAILURE(acpi_evaluate_object(handle, "_RST", NULL, NULL))) {
> pci_warn(dev, "ACPI _RST failed\n");
> - return -ENOTTY;
> + ret = -ENOTTY;
> }
>
> - return 0;
> + iommu_dev_reset_done(&dev->dev);
> + return ret;
> }
>
> bool acpi_pci_power_manageable(struct pci_dev *dev)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0f4d98036cdd..d6d87e22d81b3 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -13,6 +13,7 @@
> #include <linux/delay.h>
> #include <linux/dmi.h>
> #include <linux/init.h>
> +#include <linux/iommu.h>
> #include <linux/msi.h>
> #include <linux/of.h>
> #include <linux/pci.h>
> @@ -4529,13 +4530,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
> */
> int pcie_flr(struct pci_dev *dev)
> {
> + int ret = 0;
> +
> if (!pci_wait_for_pending_transaction(dev))
> pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
>
> + /*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> + * before initiating a reset. Notify the iommu driver that enabled ATS.
> + * Have to call it after waiting for pending DMA transaction.
> + */
> + ret = iommu_dev_reset_prepare(&dev->dev);
If we dont' consider the association between IOMMU and devices in FLR(),
it can be understood that more complex processing logic resides outside
this function. However, if the FLR() function already synchironizes and
handles the association with IOMMU like this (disabling ATS by attaching
device to blocking domain), then how would the following scenarios
behave ?
1. Reset one of PCIe alias devices.
2. Reset PF when its VFs are actvie.
....
Thanks,
Ethan> + if (ret) {
> + pci_err(dev, "failed to stop IOMMU\n");
> + return ret;
> + }
> +
> pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
>
> if (dev->imm_ready)
> - return 0;
> + goto done;
>
> /*
> * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
> @@ -4544,7 +4558,11 @@ int pcie_flr(struct pci_dev *dev)
> */
> msleep(100);
>
> - return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> + ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> +
> +done:
> + iommu_dev_reset_done(&dev->dev);
> + return ret;
> }
> EXPORT_SYMBOL_GPL(pcie_flr);
>
> @@ -4572,6 +4590,7 @@ EXPORT_SYMBOL_GPL(pcie_reset_flr);
>
> static int pci_af_flr(struct pci_dev *dev, bool probe)
> {
> + int ret = 0;
> int pos;
> u8 cap;
>
> @@ -4598,10 +4617,21 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
> PCI_AF_STATUS_TP << 8))
> pci_err(dev, "timed out waiting for pending transaction; performing AF function level reset anyway\n");
>
> + /*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> + * before initiating a reset. Notify the iommu driver that enabled ATS.
> + * Have to call it after waiting for pending DMA transaction.
> + */
> + ret = iommu_dev_reset_prepare(&dev->dev);
> + if (ret) {
> + pci_err(dev, "failed to stop IOMMU\n");
> + return ret;
> + }
> +
> pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
>
> if (dev->imm_ready)
> - return 0;
> + goto done;
>
> /*
> * Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006,
> @@ -4611,7 +4641,11 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
> */
> msleep(100);
>
> - return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> + ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> +
> +done:
> + iommu_dev_reset_done(&dev->dev);
> + return ret;
> }
>
> /**
> @@ -4632,6 +4666,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
> static int pci_pm_reset(struct pci_dev *dev, bool probe)
> {
> u16 csr;
> + int ret;
>
> if (!dev->pm_cap || dev->dev_flags & PCI_DEV_FLAGS_NO_PM_RESET)
> return -ENOTTY;
> @@ -4646,6 +4681,16 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
> if (dev->current_state != PCI_D0)
> return -EINVAL;
>
> + /*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> + * before initiating a reset. Notify the iommu driver that enabled ATS.
> + */
> + ret = iommu_dev_reset_prepare(&dev->dev);
> + if (ret) {
> + pci_err(dev, "failed to stop IOMMU\n");
> + return ret;
> + }
> +
> csr &= ~PCI_PM_CTRL_STATE_MASK;
> csr |= PCI_D3hot;
> pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
> @@ -4656,7 +4701,9 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
> pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
> pci_dev_d3_sleep(dev);
>
> - return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> + ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> + iommu_dev_reset_done(&dev->dev);
> + return ret;
> }
>
> /**
> @@ -5111,6 +5158,16 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
> if (rc)
> return -ENOTTY;
>
> + /*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> + * before initiating a reset. Notify the iommu driver that enabled ATS.
> + */
> + rc = iommu_dev_reset_prepare(&dev->dev);
> + if (rc) {
> + pci_err(dev, "failed to stop IOMMU\n");
> + return rc;
> + }
> +
> if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
> val = reg;
> } else {
> @@ -5125,6 +5182,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
> pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
> reg);
>
> + iommu_dev_reset_done(&dev->dev);
> return rc;
> }
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index d97335a401930..6157c6c02bdb0 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -21,6 +21,7 @@
> #include <linux/pci.h>
> #include <linux/isa-dma.h> /* isa_dma_bridge_buggy */
> #include <linux/init.h>
> +#include <linux/iommu.h>
> #include <linux/delay.h>
> #include <linux/acpi.h>
> #include <linux/dmi.h>
> @@ -4225,6 +4226,26 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
> { 0 }
> };
>
> +static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
> + const struct pci_dev_reset_methods *i)
> +{
> + int ret;
> +
> + /*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> + * before initiating a reset. Notify the iommu driver that enabled ATS.
> + */
> + ret = iommu_dev_reset_prepare(&dev->dev);
> + if (ret) {
> + pci_err(dev, "failed to stop IOMMU\n");
> + return ret;
> + }
> +
> + ret = i->reset(dev, probe);
> + iommu_dev_reset_done(&dev->dev);
> + return ret;
> +}
> +
> /*
> * These device-specific reset methods are here rather than in a driver
> * because when a host assigns a device to a guest VM, the host may need
> @@ -4239,7 +4260,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
> i->vendor == (u16)PCI_ANY_ID) &&
> (i->device == dev->device ||
> i->device == (u16)PCI_ANY_ID))
> - return i->reset(dev, probe);
> + return __pci_dev_specific_reset(dev, probe, i);
> }
>
> return -ENOTTY;
next prev parent reply other threads:[~2025-08-19 14:12 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
2025-08-15 5:09 ` Baolu Lu
2025-08-15 8:27 ` Tian, Kevin
2025-08-15 8:24 ` Tian, Kevin
2025-08-15 19:26 ` Nicolin Chen
2025-08-18 14:17 ` Jason Gunthorpe
2025-08-18 17:45 ` Nicolin Chen
2025-08-18 18:09 ` Jason Gunthorpe
2025-08-18 18:29 ` Nicolin Chen
2025-08-11 22:59 ` [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain Nicolin Chen
2025-08-15 5:18 ` Baolu Lu
2025-08-15 8:29 ` Tian, Kevin
2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
2025-08-15 5:28 ` Baolu Lu
2025-08-15 18:48 ` Nicolin Chen
2025-08-18 14:40 ` Jason Gunthorpe
2025-08-19 2:09 ` Baolu Lu
2025-08-15 8:55 ` Tian, Kevin
2025-08-15 18:45 ` Nicolin Chen
2025-08-21 8:07 ` Tian, Kevin
2025-08-21 14:41 ` Nicolin Chen
2025-08-31 23:24 ` Nicolin Chen
2025-08-18 14:39 ` Jason Gunthorpe
2025-08-18 17:22 ` Nicolin Chen
2025-08-18 23:42 ` Jason Gunthorpe
2025-08-19 5:09 ` Nicolin Chen
2025-08-19 12:52 ` Jason Gunthorpe
2025-08-19 17:22 ` Nicolin Chen
2025-08-21 13:13 ` Jason Gunthorpe
2025-08-21 15:22 ` Nicolin Chen
2025-08-21 18:37 ` Jason Gunthorpe
2025-08-22 5:11 ` Nicolin Chen
2025-08-21 8:11 ` Tian, Kevin
2025-08-21 13:14 ` Jason Gunthorpe
2025-08-22 9:45 ` Tian, Kevin
2025-08-22 13:21 ` Jason Gunthorpe
2025-08-11 22:59 ` [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-08-15 5:49 ` Baolu Lu
2025-08-15 20:10 ` Nicolin Chen
2025-08-15 9:14 ` Tian, Kevin
2025-08-15 19:45 ` Nicolin Chen
2025-08-11 22:59 ` [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
2025-08-19 14:12 ` Ethan Zhao [this message]
2025-08-19 21:59 ` Nicolin Chen
2025-08-20 3:18 ` Ethan Zhao
2025-08-22 6:14 ` Nicolin Chen
2025-08-21 13:07 ` Jason Gunthorpe
2025-08-22 6:35 ` Nicolin Chen
2025-08-22 14:08 ` Jason Gunthorpe
2025-08-22 18:50 ` Nicolin Chen
2025-08-28 12:51 ` Jason Gunthorpe
2025-08-28 15:08 ` Nicolin Chen
2025-08-28 18:46 ` Jason Gunthorpe
2025-08-28 19:35 ` Nicolin Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=550635db-00ce-410e-add0-77c1a75adb11@gmail.com \
--to=etzhao1900@gmail.com \
--cc=angelogioacchino.delregno@collabora.com \
--cc=baolu.lu@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=helgaas@kernel.org \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=jonathanh@nvidia.com \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mediatek@lists.infradead.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-tegra@vger.kernel.org \
--cc=matthias.bgg@gmail.com \
--cc=nicolinc@nvidia.com \
--cc=patches@lists.linux.dev \
--cc=pjaroszynski@nvidia.com \
--cc=rafael@kernel.org \
--cc=robin.clark@oss.qualcomm.com \
--cc=robin.murphy@arm.com \
--cc=thierry.reding@gmail.com \
--cc=vdumpa@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
--cc=yi.l.liu@intel.com \
--cc=yong.wu@mediatek.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).