Linux PCI subsystem development

Linux PCI subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v2 05/11] iommu/arm-smmu-v3: Submit CMDQ_OP_PRI_RESP for IOPF event
From: Nicolin Chen @ 2026-06-27  0:44 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, bhelgaas, praan, kevin.tian, kees, smostafa,
	baolu.lu, linux-arm-kernel, iommu, linux-kernel, linux-pci,
	skaestle, mmarrid, skolothumtho, bbiber
In-Reply-To: <b5085e09-6533-4f88-938d-8d46751cf7da@arm.com>

On Fri, Jun 26, 2026 at 05:15:13PM +0100, Robin Murphy wrote:
> On 28/05/2026 8:59 am, Nicolin Chen wrote:
> > From: Malak Marrid <mmarrid@nvidia.com>
> > 
> > To handle IOMMU_FAULT_PAGE_REQ from the PRI queue, arm_smmu_page_response()
> > must issue a CMDQ_OP_PRI_RESP back to the SMMU.
> > 
> > However, either a stall event in the EVTQ or a PRI request in the PRIQ can
> > surface to the IOPF infrastructure with fault.type == IOMMU_FAULT_PAGE_REQ,
> > and a single master can in principle be both stall-capable and PRI-capable
> 
> No, the SMMU architecture does all it can to specifically forbid this, see
> 3.12.1 and 16.4, it just can't be made architecturally ILLEGAL to enable
> stalls for PCIe devices because there's no strict architectural definition
> for what "a PCIe device" actually is. Similarly with the note in the
> definition of STE.EATS about the relationship with CD.S - the unwritten
> implication is that defining specific behaviours would only create an
> unreasonable burden for hardware validation, for the sake of something that
> nobody in their right mind should ever do anyway.
> 
> The expectation is that RCiEPs which do speak stallable non-PCIe bus
> protocols will not go to the effort of implementing ATS/PRI capabilities
> (not least because there's every chance that such protocols simply doesn't
> have that kind of transaction flow anyway). And conversely that it can be
> considered an egregious firmware (or system design) error to even claim (let
> alone force) stall capability for a real PCIe root port which may be
> deadlocked by blocking its requirement for free-flowing writes. Thus I think
> we could go so far as to refuse to handle any endpoint which did somehow
> claim both.

Oh, I missed that. This certainly can simplify things here.

I will fix it.

Thanks!
Nicolin

^ permalink raw reply

* Re: [PATCH v2 00/11] iommu/arm-smmu-v3: Add PRI support
From: Nicolin Chen @ 2026-06-27  0:43 UTC (permalink / raw)
  To: harsha.v
  Cc: will, robin.murphy, jgg, joro, bhelgaas, praan, kevin.tian, kees,
	smostafa, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
	linux-pci, skaestle, mmarrid, skolothumtho, bbiber
In-Reply-To: <2a52ecb1-be8a-4522-aa23-f3277f0555fa@oss.qualcomm.com>

On Fri, Jun 26, 2026 at 08:24:31PM +0530, harsha.v@oss.qualcomm.com wrote:
> One thing I noticed while reviewing: when arm_smmu_priq_thread() detects
> PRIQ overflow, partial faults (non-LAST pages stored via
> report_partial_fault()) whose LAST page was lost in the overflow remain
> permanently in iopf_param->partial. This is a monotonic memory leak —
> it grows with each overflow event.
> 
> Intel VT-d handles this in prq_event_thread() (drivers/iommu/intel/prq.c):
>     if (head == tail) {
>         iopf_queue_discard_partial(iommu->iopf_queue);
>         writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG);
>     }
> 
> iopf_queue_discard_partial() was written for exactly this scenario.
> Could we add the same here arm_smmu_priq_thread()
> (drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c) ?
> 
>     if (queue_sync_prod_in(q) == -EOVERFLOW) {
>         dev_err(smmu->dev, "PRIQ overflow detected -- requests lost\n");
> +       iopf_queue_discard_partial(smmu->evtq.iopf);
>     }
> 
> At this point all surviving entries have already been consumed by the
> loop above, so discarding unconditionally is safe — implicitly matching
> Intel's "head == tail" guard.

I will fold it in.

Thanks!
Nicolin

^ permalink raw reply

* Re: [PATCH v7 10/11] PCI/cxl: Expose CXL Reset as a PCI reset method
From: Alex Williamson @ 2026-06-26 22:08 UTC (permalink / raw)
  To: Srirangan Madhavan
  Cc: Alison Schofield, Bjorn Helgaas, Dan Williams, Dave Jiang,
	Davidlohr Bueso, Ira Weiny, Jonathan Cameron, Vishal Verma,
	linux-cxl, linux-pci, linux-kernel, vsethi, Dan Williams,
	Sai Yashwanth Reddy Kancherla, Vishal Aslot, Manish Honap,
	Jiandi An, Richard Cheng, linux-tegra
In-Reply-To: <20260623032453.3404772-11-smadhavan@nvidia.com>

On Tue, 23 Jun 2026 03:24:52 +0000
Srirangan Madhavan <smadhavan@nvidia.com> wrote:
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 360f2aaee10c..b1ec20126390 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -33,6 +33,7 @@
>  #include <asm/dma.h>
>  #include <linux/aer.h>
>  #include <linux/bitfield.h>
> +#include <cxl/cxl.h>
>  #include "pci.h"
>  
>  DEFINE_MUTEX(pci_slot_mutex);
> @@ -5081,6 +5082,7 @@ const struct pci_reset_fn_method pci_reset_fn_methods[] = {
>  	{ pci_dev_acpi_reset, .name = "acpi" },
>  	{ pcie_reset_flr, .name = "flr" },
>  	{ pci_af_flr, .name = "af_flr" },
> +	{ cxl_reset_function, .name = "cxl_reset" },
>  	{ pci_pm_reset, .name = "pm" },
>  	{ pci_reset_bus_function, .name = "bus" },
>  	{ cxl_reset_bus_function, .name = "cxl_bus" },

Follow-on to the previous comments, for a device supporting cxl.mem
and/or cxl.cache, cxl_reset is the best option, perhaps with the
exception of device specific or acpi.  Why isn't this placed above FLR?

If we decide cxl.io scoped function resets should not be exposed for
cxl.cache/mem devices this might be moot as we'll filter through to
cxl_reset anyway.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH v7 08/11] cxl: Coordinate sibling functions for CXL reset
From: Alex Williamson @ 2026-06-26 22:08 UTC (permalink / raw)
  To: Dan Williams (nvidia)
  Cc: Srirangan Madhavan, Alison Schofield, Bjorn Helgaas, Dave Jiang,
	Davidlohr Bueso, Ira Weiny, Jonathan Cameron, Vishal Verma,
	linux-cxl, linux-pci, linux-kernel, vsethi, Dan Williams,
	Sai Yashwanth Reddy Kancherla, Vishal Aslot, Manish Honap,
	Jiandi An, Richard Cheng, linux-tegra
In-Reply-To: <6a3b100798704_3c9f100a9@djbw-dev.notmuch>

On Tue, 23 Jun 2026 16:00:23 -0700
"Dan Williams (nvidia)" <djbw@kernel.org> wrote:

> Srirangan Madhavan wrote:
> > CXL Device Reset affects all CXL.cache and CXL.mem functions in the reset
> > scope. Lock same-scope siblings with pci_dev_trylock(), save/disable them,
> > drain pending transactions, and hold IOMMU reset blocks until recovery.
> > 
> > Also include mem-capable siblings in HDM range validation and CPU cache
> > invalidation. Cache-only siblings are quiesced, but skipped for HDM range
> > handling.  
> 
> PCI reset locking and ordering is already a source of some burden
> without adding this new sibling model to consider.
> 
> Is there evidence that multi-function CXL devices, where most of the
> functions are non-CXL, is going to be a common occurrence?
> 
> In other words if CXL reset borrowed the bus reset locking model:
> 
>     if (pci_bus_trylock(bus)) {
>         pci_bus_save_and_disable_locked(bus);
>         might_sleep();
> 	rc = cxl_request_and_flush_hdm(bus);
> 	if (rc == 0) {
> 		rc = cxl_reset_execute(pdev);
> 		cxl_release_and_flush_hdm(bus);
> 	}
>         pci_bus_restore_locked(bus);
>         pci_bus_unlock(bus);
>     }
> 
> The cost is disturbing some non-CXL functions, the benefit is reusing an
> existing reset order / locking model.

I'd say further that this exceeds the boundaries of what
pci_reset_function(), or the @reset sysfs attribute per pci_dev, is
scoped to do.  pci_reset_function() must limit the scope to the
pci_dev (and in this case the CXL state associated with only that
pci_dev).  See for instance how bus and slot use cases through
pci_reset_function() are limited to non-multifunction devices.

For multiple functions, the precedent is something more like
pci_reset_bus(), where the caller is responsible for coordinating the
set of affected devices.  The locking is still complicated, but at
least it's managed in vfio-pci-core, with a variant driver that
actually owns the device, rather than pci-core.

Also note that there's currently no mechanism for performing a
multi-function scoped reset through sysfs (excluding raw access to the
parent bridge that bypasses all save/restore mechanics).  I'd suggest
that cxl_reset can only be available as a function scoped reset when
only function 0 supports cxl.mem or cxl.cache, but that may also lead
to the question of whether the reset sysfs attribute should be exposed
at all if it only resets the cxl.io state, for example via FLR.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH 3/3] pci: fix UAF when probe runs concurrent to dyn ID removal
From: Gary Guo @ 2026-06-26 21:55 UTC (permalink / raw)
  To: Gary Guo, Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich
  Cc: linux-pci, driver-core, linux-kernel, Sashiko
In-Reply-To: <20260626-pci_id_fix-v1-3-a35c803f1b95@garyguo.net>

On Fri Jun 26, 2026 at 8:49 PM BST, Gary Guo wrote:
> Dynamic IDs are only guaranteed to be valid when dynids.lock is held,
> as remove_id_store can free the node. Thus, make a copy in
> pci_match_device.
>
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Link: https://lore.kernel.org/all/20260619170503.518F61F00A3A@smtp.kernel.org/
> Fixes: 0994375e9614 ("PCI: add remove_id sysfs entry")
> Signed-off-by: Gary Guo <gary@garyguo.net>

Sashiko reports driver UAF issue with this:
https://sashiko.dev/#/patchset/20260626-pci_id_fix-v1-0-a35c803f1b95%40garyguo.net?part=3

All of these cases are already existing UAF issues, albeit now it is much easier
to trigger because this patch converts them to heap UAF for dynamic ID only to
stack UAF for all IDs.

One obvious fix is to instead keep a copy of the ID in pci_dev. However, in
practice I think it doesn't make much sense for drivers to keep pci_device_id
pointers at all; everything except driver_data can just be retrieved from
pci_dev.

So I think it's better to fix drivers that stores these IDs. I used this
coccinelle script to find cases:

---
@store@
identifier fn;
identifier id;
expression E;
parameter list[n] ps;
@@
  fn(ps, const struct pci_device_id *id, ...)
  {
    ...
*   E = id
    ...
  }

@store_cast@
identifier fn;
identifier id;
expression E;
type T;
parameter list[n] ps;
@@
  fn(ps, const struct pci_device_id *id, ...)
  {
    ...
*   E = (T)id
    ...
  }

@struct_const@
identifier s, fld;
@@
  struct s {
    ...
*   const struct pci_device_id *fld;
    ...
  };

@struct_mut@
identifier s, fld;
@@
  struct s {
    ...
*   struct pci_device_id *fld;
    ...
  };
---

And it looks like there're only 4 drivers that need fixing:
- mlxsw
- ipack/tpci200
- nsp32
- ata

Best,
Gary

^ permalink raw reply

* [PATCH v1] sysfs: Include bridge window address size flags
From: Bjorn Helgaas @ 2026-06-26 21:30 UTC (permalink / raw)
  To: Martin Mares
  Cc: Pali Rohár, David Matlack, linux-pci, linux-kernel,
	Bjorn Helgaas

sysfs_get_resources() fills d->base_addr[] with CPU addresses of PCI BARs
(the CPU addresses in sysfs 'resource' differ from the BARs when a host
bridge applies an offset between CPU and PCI addresses).

119c1376f9ca ("libpci: Add support for filling bridge resources") fills
d->bridge_base_addr[] for bridge windows, but unlike for d->base_addr[], it
did not include the low-order four bits of 'flags'.  For I/O windows, those
bits indicate 16- or 32-bit addressing; for prefetchable memory windows,
they indicate 32- or 64-bit addressing.

ccf68033a452 ("lspci: Use PCI_FILL_BRIDGE_BASES to detect if range behind
bridge is disabled or unsupported") tests the d->bridge_base_addr[] flag
bits to learn those properties, but since they weren't included by
119c1376f9ca, 32-bit I/O windows were labeled as "[16-bit]" and 64-bit
prefetchable windows were labeled as "[32-bit]".

Previous output of "lspci -v" on a bridge:

  I/O behind bridge: 2000-2fff [size=4K] [16-bit]
  Prefetchable memory behind bridge: 48030000000-480401fffff [size=258M] [32-bit]

Similar output after this commit:

  I/O behind bridge: 00002000-00002fff [size=4K] [32-bit]
  Prefetchable memory behind bridge: 0000048030000000-00000480401fffff [size=258M] [64-bit]

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 lib/sysfs.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/lib/sysfs.c b/lib/sysfs.c
index 55ac7a6f0699..77ab99e2d5d2 100644
--- a/lib/sysfs.c
+++ b/lib/sysfs.c
@@ -219,11 +219,14 @@ sysfs_get_resources(struct pci_dev *d)
 	}
       if (i == 7+4 || i == 7+6+4)
 	{
+	  unsigned long long flags;
 	  int offset = (i == 7+6+4) ? 6 : 0;
 	  for (i = 0; i < 4; i++)
 	    {
-	      d->bridge_flags[i] = lines[offset+i].flags;
-	      d->bridge_base_addr[i] = lines[offset+i].base_addr;
+	      flags = lines[offset+i].flags;
+	      d->bridge_flags[i] = flags;
+	      flags &= PCI_ADDR_FLAG_MASK;
+	      d->bridge_base_addr[i] = lines[offset+i].base_addr | flags;
 	      d->bridge_size[i] = lines[offset+i].size;
 	    }
 	  have_bridge_bases = 1;
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH 2/3] pci: fix dyn_id add TOCTOU
From: sashiko-bot @ 2026-06-26 20:02 UTC (permalink / raw)
  To: Gary Guo; +Cc: linux-pci
In-Reply-To: <20260626-pci_id_fix-v1-2-a35c803f1b95@garyguo.net>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] new_id_store unconditionally rejects dynamic IDs that overlap with override_only static IDs (e.g., in vfio-pci), breaking device assignment.
--

commit 65f57bc2128a4f1b44e3036aee55956006935cca
Author: Gary Guo <gary@garyguo.net>
pci: fix dyn_id add TOCTOU

This patch fixes a TOCTOU issue in new_id_store() by moving the duplicate
dynamic ID check inside do_pci_add_dynid() under the lock. It also replaces
the dummy pci_dev allocation and pci_match_device() call in new_id_store()
with a direct call to do_pci_match_id() against the static table.

> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
[ ... ]
> @@ -215,38 +244,22 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
[ ... ]
>  	if (fields != 7) {
> -		struct pci_dev *pdev = kzalloc_obj(*pdev);
[ ... ]
> -		if (pci_match_device(pdrv, pdev))
> -			retval = -EEXIST;
> -
> -		put_device(&pdev->dev);
> -
> -		if (retval)
> -			return retval;
> +		if (do_pci_match_id(pdrv->id_table, &id))
> +			return -EEXIST;
>  	}

[Severity: High]
Does this change break dynamic ID insertion for drivers that rely on
override_only static IDs (like vfio-pci)?

Previously, pci_match_device() was used, which explicitly checks the
override_only flag:

	for (ids = drv->id_table; (found_id = do_pci_match_id(ids, &dev_id));
	     ids = found_id + 1) {
		if (!found_id->override_only || ret > 0) {
			*id = *found_id;
			return true;
		}
	}

By directly calling do_pci_match_id(), which performs purely structural
matching and ignores the override_only flag, it seems like new_id_store()
will now unconditionally reject dynamic IDs for drivers whose static table
contains a catch-all override_only entry, returning -EEXIST instead of
allowing the assignment.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net?part=2

^ permalink raw reply

* Re: [PATCH 3/3] pci: fix UAF when probe runs concurrent to dyn ID removal
From: sashiko-bot @ 2026-06-26 20:00 UTC (permalink / raw)
  To: Gary Guo; +Cc: linux-pci
In-Reply-To: <20260626-pci_id_fix-v1-3-a35c803f1b95@garyguo.net>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Passing a pointer to a stack-allocated `struct pci_device_id` to driver `probe` functions introduces a stack use-after-free vulnerability for drivers that store the pointer for later use.
--

commit 2537209ca17dda70306c583f96a659580e419b4a
Author: Gary Guo <gary@garyguo.net>

pci: fix UAF when probe runs concurrent to dyn ID removal

Dynamic IDs are only guaranteed to be valid when dynids.lock is held, as
remove_id_store can free the node. This patch copies the matched ID in
pci_match_device to avoid the UAF.

Link: https://lore.kernel.org/all/20260619170503.518F61F00A3A@smtp.kernel.org/

> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 464ee52346fa8..2c76bba9a7156 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c

[ ... ]

> @@ -474,15 +475,14 @@ void pci_probe_flush_workqueue(void)
>   */
>  static int __pci_device_probe(struct pci_driver *drv, struct pci_dev *pci_dev)
>  {
> -	const struct pci_device_id *id;
> +	struct pci_device_id id;
>  	int error = 0;
>  
>  	if (drv->probe) {
>  		error = -ENODEV;
>  
> -		id = pci_match_device(drv, pci_dev);
> -		if (id)
> -			error = pci_call_probe(drv, pci_dev, id);
> +		if (pci_match_device(drv, pci_dev, &id))
> +			error = pci_call_probe(drv, pci_dev, &id);

[Severity: Critical]
Could passing a pointer to the stack-allocated struct pci_device_id to the
driver's probe function introduce a stack use-after-free regression?

Some drivers, like nsp32 and ata_generic, save the pointer to the ID passed
into their probe function for later use. For example:

drivers/scsi/nsp32.c:nsp32_probe() {
    ...
    data->pci_devid   = id;
    ...
}

When normal operations later access this saved pointer, won't they dereference
invalid stack memory since the __pci_device_probe() stack frame has already
been destroyed? 

For example, when reading from procfs:

drivers/scsi/nsp32.c:nsp32_show_info() {
    ...
    model    = data->pci_devid->driver_data;
    ...
}

Similarly in ata_generic_init_one(), the id pointer is passed as host private
data for later use in ATA operations:

drivers/ata/ata_generic.c:ata_generic_init_one() {
    ...
    return ata_pci_bmdma_init_one(dev, ppi, &generic_sht, (void *)id, 0);
}

>  	}
>  	return error;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net?part=3

^ permalink raw reply

* Re: [PATCH 1/3] pci: make pci_match_one_device match on ID instead of device
From: sashiko-bot @ 2026-06-26 19:56 UTC (permalink / raw)
  To: Gary Guo; +Cc: linux-pci
In-Reply-To: <20260626-pci_id_fix-v1-1-a35c803f1b95@garyguo.net>

> There is a need to match just IDs instead of against devices. Thus rename
> this function to pci_match_one_id, and add a pci_id_from_device helper to
> make it easy to convert users.
> 
> Similar convert pci_match_id to do_pci_match_id, however the existing API
> is kept due to quite a few users.
> 
> Signed-off-by: Gary Guo <gary@garyguo.net>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net?part=1


^ permalink raw reply

* [PATCH 1/3] pci: make pci_match_one_device match on ID instead of device
From: Gary Guo @ 2026-06-26 19:49 UTC (permalink / raw)
  To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich
  Cc: linux-pci, driver-core, linux-kernel, Gary Guo
In-Reply-To: <20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net>

There is a need to match just IDs instead of against devices. Thus rename
this function to pci_match_one_id, and add a pci_id_from_device helper to
make it easy to convert users.

Similar convert pci_match_id to do_pci_match_id, however the existing API
is kept due to quite a few users.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 drivers/pci/pci-driver.c | 38 ++++++++++++++++++++++++++++----------
 drivers/pci/pci.h        | 36 ++++++++++++++++++++++++++----------
 drivers/pci/search.c     |  6 ++++--
 3 files changed, 58 insertions(+), 22 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f36778e62ac1..0507cb801310 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -90,6 +90,27 @@ static void pci_free_dynids(struct pci_driver *drv)
 	spin_unlock(&drv->dynids.lock);
 }
 
+/**
+ * do_pci_match_id - See if a PCI ID matches a given pci_id table
+ * @ids: array of PCI device ID structures to search in
+ * @dev_id: the actual PCI device ID structure to match against.
+ *
+ * Returns the matching pci_device_id structure or
+ * %NULL if there is no match.
+ */
+static const struct pci_device_id *do_pci_match_id(const struct pci_device_id *ids,
+						   const struct pci_device_id *dev_id)
+{
+	if (ids) {
+		while (ids->vendor || ids->subvendor || ids->class_mask) {
+			if (pci_match_one_id(ids, dev_id))
+				return ids;
+			ids++;
+		}
+	}
+	return NULL;
+}
+
 /**
  * pci_match_id - See if a PCI device matches a given pci_id table
  * @ids: array of PCI device ID structures to search in
@@ -105,14 +126,9 @@ static void pci_free_dynids(struct pci_driver *drv)
 const struct pci_device_id *pci_match_id(const struct pci_device_id *ids,
 					 struct pci_dev *dev)
 {
-	if (ids) {
-		while (ids->vendor || ids->subvendor || ids->class_mask) {
-			if (pci_match_one_device(ids, dev))
-				return ids;
-			ids++;
-		}
-	}
-	return NULL;
+	struct pci_device_id dev_id = pci_id_from_device(dev);
+
+	return do_pci_match_id(ids, &dev_id);
 }
 EXPORT_SYMBOL(pci_match_id);
 
@@ -138,6 +154,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
 {
 	struct pci_dynid *dynid;
 	const struct pci_device_id *found_id = NULL, *ids;
+	struct pci_device_id dev_id;
 	int ret;
 
 	/* When driver_override is set, only bind to the matching driver */
@@ -145,10 +162,11 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
 	if (ret == 0)
 		return NULL;
 
+	dev_id = pci_id_from_device(dev);
 	/* Look at the dynamic ids first, before the static ones */
 	spin_lock(&drv->dynids.lock);
 	list_for_each_entry(dynid, &drv->dynids.list, node) {
-		if (pci_match_one_device(&dynid->id, dev)) {
+		if (pci_match_one_id(&dynid->id, &dev_id)) {
 			found_id = &dynid->id;
 			break;
 		}
@@ -158,7 +176,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
 	if (found_id)
 		return found_id;
 
-	for (ids = drv->id_table; (found_id = pci_match_id(ids, dev));
+	for (ids = drv->id_table; (found_id = do_pci_match_id(ids, &dev_id));
 	     ids = found_id + 1) {
 		/*
 		 * The match table is split based on driver_override.
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4469e1a77f3c..0567a8762baa 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -442,21 +442,37 @@ static inline int pci_setup_cardbus(char *str) { return -ENOENT; }
 #endif /* CONFIG_CARDBUS */
 
 /**
- * pci_match_one_device - Tell if a PCI device structure has a matching
- *			  PCI device id structure
- * @id: single PCI device id structure to match
- * @dev: the PCI device structure to match against
+ * pci_id_from_device - Obtain a pci_device_id from a PCI device
+ * @dev: the PCI device
+ *
+ * Returns a pci_device_id filled.
+ */
+static inline struct pci_device_id pci_id_from_device(const struct pci_dev *dev)
+{
+	return (struct pci_device_id) {
+		.vendor = dev->vendor,
+		.device = dev->device,
+		.subvendor = dev->subsystem_vendor,
+		.subdevice = dev->subsystem_device,
+		.class = dev->class,
+	};
+}
+
+/**
+ * pci_match_one_id - Tell if a PCI device ID matches a needle PCI device id
+ * @id: single PCI device id structure to match against (needle)
+ * @dev_id: the actual ID from the PCI device (can be created via pci_id_from_device)
  *
  * Returns the matching pci_device_id structure or %NULL if there is no match.
  */
 static inline const struct pci_device_id *
-pci_match_one_device(const struct pci_device_id *id, const struct pci_dev *dev)
+pci_match_one_id(const struct pci_device_id *id, const struct pci_device_id *dev_id)
 {
-	if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
-	    (id->device == PCI_ANY_ID || id->device == dev->device) &&
-	    (id->subvendor == PCI_ANY_ID || id->subvendor == dev->subsystem_vendor) &&
-	    (id->subdevice == PCI_ANY_ID || id->subdevice == dev->subsystem_device) &&
-	    !((id->class ^ dev->class) & id->class_mask))
+	if ((id->vendor == PCI_ANY_ID || id->vendor == dev_id->vendor) &&
+	    (id->device == PCI_ANY_ID || id->device == dev_id->device) &&
+	    (id->subvendor == PCI_ANY_ID || id->subvendor == dev_id->subvendor) &&
+	    (id->subdevice == PCI_ANY_ID || id->subdevice == dev_id->subdevice) &&
+	    !((id->class ^ dev_id->class) & id->class_mask))
 		return id;
 	return NULL;
 }
diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index e3d3177fce54..c8c4bfe7817b 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -245,8 +245,10 @@ static int match_pci_dev_by_id(struct device *dev, const void *data)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	const struct pci_device_id *id = data;
+	struct pci_device_id dev_id;
 
-	if (pci_match_one_device(id, pdev))
+	dev_id = pci_id_from_device(pdev);
+	if (pci_match_one_id(id, &dev_id))
 		return 1;
 	return 0;
 }
@@ -418,7 +420,7 @@ EXPORT_SYMBOL(pci_get_class);
  *
  * Iterates through the list of known PCI devices. If a PCI device is found
  * with a matching base class code, the reference count to the device is
- * incremented. See pci_match_one_device() to figure out how does this works.
+ * incremented. See pci_match_one_id() to figure out how does this works.
  * A new search is initiated by passing %NULL as the @from argument.
  * Otherwise if @from is not %NULL, searches continue from next device on the
  * global list. The reference count for @from is always decremented if it is

-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/3] pci: fix UAF and TOCTOU related to dynamic ID
From: Gary Guo @ 2026-06-26 19:49 UTC (permalink / raw)
  To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich
  Cc: linux-pci, driver-core, linux-kernel, Gary Guo, Sashiko

While working on improving the Rust abstractions [1], Sashiko reported that
an existing UAF issue related to dynamic ID, which I find to be genuine.
When taking a look at the code I also find a TOCTOU issue where the
existence check of dynamic ID happens in a separate critical section as the
actual insertion. This series fix both issues.

There are two exported functions "pci_match_id" and "pci_add_dynid" which I
have to tweak to implement this cleanly; I created separate "do_xxx"
functions to keep the existing APIs because they all have multiple users.

Link: https://lore.kernel.org/all/20260618-id_info-v1-0-96af1e559ef9@garyguo.net/ [1]
Link: https://lore.kernel.org/all/20260619170503.518F61F00A3A@smtp.kernel.org/ [2]

Signed-off-by: Gary Guo <gary@garyguo.net>
---
Gary Guo (3):
      pci: make pci_match_one_device match on ID instead of device
      pci: fix dyn_id add TOCTOU
      pci: fix UAF when probe runs concurrent to dyn ID removal

 drivers/pci/pci-driver.c | 208 ++++++++++++++++++++++++++---------------------
 drivers/pci/pci.h        |  36 +++++---
 drivers/pci/search.c     |   6 +-
 3 files changed, 147 insertions(+), 103 deletions(-)
---
base-commit: 6c94b38b83a04c43ea49004275f0391404051093
change-id: 20260626-pci_id_fix-83eaec007674

Best regards,
--  
Gary Guo <gary@garyguo.net>

^ permalink raw reply

* [PATCH 3/3] pci: fix UAF when probe runs concurrent to dyn ID removal
From: Gary Guo @ 2026-06-26 19:49 UTC (permalink / raw)
  To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich
  Cc: linux-pci, driver-core, linux-kernel, Gary Guo, Sashiko
In-Reply-To: <20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net>

Dynamic IDs are only guaranteed to be valid when dynids.lock is held,
as remove_id_store can free the node. Thus, make a copy in
pci_match_device.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Link: https://lore.kernel.org/all/20260619170503.518F61F00A3A@smtp.kernel.org/
Fixes: 0994375e9614 ("PCI: add remove_id sysfs entry")
Signed-off-by: Gary Guo <gary@garyguo.net>
---
 drivers/pci/pci-driver.c | 60 ++++++++++++++++++++++--------------------------
 1 file changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 464ee52346fa..2c76bba9a715 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -177,14 +177,16 @@ static const struct pci_device_id pci_device_id_any = {
  * pci_match_device - See if a device matches a driver's list of IDs
  * @drv: the PCI driver to match against
  * @dev: the PCI device structure to match against
+ * @id: Matched pci_device_id
  *
  * Used by a driver to check whether a PCI device is in its list of
  * supported devices or in the dynids list, which may have been augmented
- * via the sysfs "new_id" file.  Returns the matching pci_device_id
- * structure or %NULL if there is no match.
+ * via the sysfs "new_id" file.  Returns true if there is a match, the matched
+ * ID is stored in @id.
  */
-static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
-						    struct pci_dev *dev)
+static bool pci_match_device(struct pci_driver *drv,
+			     struct pci_dev *dev,
+			     struct pci_device_id *id)
 {
 	struct pci_dynid *dynid;
 	const struct pci_device_id *found_id = NULL, *ids;
@@ -194,21 +196,19 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
 	/* When driver_override is set, only bind to the matching driver */
 	ret = device_match_driver_override(&dev->dev, &drv->driver);
 	if (ret == 0)
-		return NULL;
+		return false;
 
 	dev_id = pci_id_from_device(dev);
 	/* Look at the dynamic ids first, before the static ones */
-	spin_lock(&drv->dynids.lock);
-	list_for_each_entry(dynid, &drv->dynids.list, node) {
-		if (pci_match_one_id(&dynid->id, &dev_id)) {
-			found_id = &dynid->id;
-			break;
+	{
+		guard(spinlock)(&drv->dynids.lock);
+		list_for_each_entry(dynid, &drv->dynids.list, node) {
+			if (pci_match_one_id(&dynid->id, &dev_id)) {
+				*id = dynid->id;
+				return true;
+			}
 		}
 	}
-	spin_unlock(&drv->dynids.lock);
-
-	if (found_id)
-		return found_id;
 
 	for (ids = drv->id_table; (found_id = do_pci_match_id(ids, &dev_id));
 	     ids = found_id + 1) {
@@ -217,18 +217,19 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
 		 * In case override_only was set, enforce driver_override
 		 * matching.
 		 */
-		if (found_id->override_only) {
-			if (ret > 0)
-				return found_id;
-		} else {
-			return found_id;
+		if (!found_id->override_only || ret > 0) {
+			*id = *found_id;
+			return true;
 		}
 	}
 
 	/* driver_override will always match, send a dummy id */
-	if (ret > 0)
-		return &pci_device_id_any;
-	return NULL;
+	if (ret > 0) {
+		*id = pci_device_id_any;
+		return true;
+	}
+
+	return false;
 }
 
 /**
@@ -474,15 +475,14 @@ void pci_probe_flush_workqueue(void)
  */
 static int __pci_device_probe(struct pci_driver *drv, struct pci_dev *pci_dev)
 {
-	const struct pci_device_id *id;
+	struct pci_device_id id;
 	int error = 0;
 
 	if (drv->probe) {
 		error = -ENODEV;
 
-		id = pci_match_device(drv, pci_dev);
-		if (id)
-			error = pci_call_probe(drv, pci_dev, id);
+		if (pci_match_device(drv, pci_dev, &id))
+			error = pci_call_probe(drv, pci_dev, &id);
 	}
 	return error;
 }
@@ -1567,17 +1567,13 @@ static int pci_bus_match(struct device *dev, const struct device_driver *drv)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	struct pci_driver *pci_drv;
-	const struct pci_device_id *found_id;
+	struct pci_device_id id;
 
 	if (pci_dev_binding_disallowed(pci_dev))
 		return 0;
 
 	pci_drv = (struct pci_driver *)to_pci_driver(drv);
-	found_id = pci_match_device(pci_drv, pci_dev);
-	if (found_id)
-		return 1;
-
-	return 0;
+	return pci_match_device(pci_drv, pci_dev, &id);
 }
 
 /**

-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/3] pci: fix dyn_id add TOCTOU
From: Gary Guo @ 2026-06-26 19:49 UTC (permalink / raw)
  To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich
  Cc: linux-pci, driver-core, linux-kernel, Gary Guo
In-Reply-To: <20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net>

Currently there is a TOCTOU issue in new_id_store as the dyn ID insertion
in pci_add_dynid and the pci_match_device are in separate critical
sections.

Fix this by moving the existing ID check to inside pci_add_dynid and only
check against the static ID table outside the critical section.

Fixes: 3853f9123c18 ("PCI: Avoid duplicate IDs in driver dynamic IDs list")
Signed-off-by: Gary Guo <gary@garyguo.net>
---
 drivers/pci/pci-driver.c | 112 ++++++++++++++++++++++++++---------------------
 1 file changed, 62 insertions(+), 50 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0507cb801310..464ee52346fa 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -29,6 +29,48 @@ struct pci_dynid {
 	struct pci_device_id id;
 };
 
+/**
+ * do_pci_add_dynid - add a new PCI device ID to this driver and re-probe devices
+ * @drv: target pci driver
+ * @id: ID to be added
+ * @check_dup: whether to check if matching ID is already present
+ *
+ * Adds a new dynamic pci device ID to this driver and causes the
+ * driver to probe for all devices again.  @drv must have been
+ * registered prior to calling this function.
+ *
+ * CONTEXT:
+ * Does GFP_KERNEL allocation.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+static int do_pci_add_dynid(struct pci_driver *drv, const struct pci_device_id *id, bool check_dup)
+{
+	struct pci_dynid *dynid, *existing_dynid;
+
+	dynid = kzalloc_obj(*dynid);
+	if (!dynid)
+		return -ENOMEM;
+
+	dynid->id = *id;
+
+	{
+		guard(spinlock)(&drv->dynids.lock);
+		if (check_dup) {
+			list_for_each_entry(existing_dynid, &drv->dynids.list, node) {
+				if (pci_match_one_id(&existing_dynid->id, id)) {
+					kfree(dynid);
+					return -EEXIST;
+				}
+			}
+		}
+		list_add_tail(&dynid->node, &drv->dynids.list);
+	}
+
+	return driver_attach(&drv->driver);
+}
+
 /**
  * pci_add_dynid - add a new PCI device ID to this driver and re-probe devices
  * @drv: target pci driver
@@ -56,25 +98,17 @@ int pci_add_dynid(struct pci_driver *drv,
 		  unsigned int class, unsigned int class_mask,
 		  unsigned long driver_data)
 {
-	struct pci_dynid *dynid;
-
-	dynid = kzalloc_obj(*dynid);
-	if (!dynid)
-		return -ENOMEM;
-
-	dynid->id.vendor = vendor;
-	dynid->id.device = device;
-	dynid->id.subvendor = subvendor;
-	dynid->id.subdevice = subdevice;
-	dynid->id.class = class;
-	dynid->id.class_mask = class_mask;
-	dynid->id.driver_data = driver_data;
-
-	spin_lock(&drv->dynids.lock);
-	list_add_tail(&dynid->node, &drv->dynids.list);
-	spin_unlock(&drv->dynids.lock);
+	struct pci_device_id id = {
+		.vendor = vendor,
+		.device = device,
+		.subvendor = subvendor,
+		.subdevice = subdevice,
+		.class = class,
+		.class_mask = class_mask,
+		.driver_data = driver_data,
+	};
 
-	return driver_attach(&drv->driver);
+	return do_pci_add_dynid(drv, &id, false);
 }
 EXPORT_SYMBOL_GPL(pci_add_dynid);
 
@@ -197,11 +231,6 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
 	return NULL;
 }
 
-static void _pci_free_device(struct device *dev)
-{
-	kfree(to_pci_dev(dev));
-}
-
 /**
  * new_id_store - sysfs frontend to pci_add_dynid()
  * @driver: target device driver
@@ -215,38 +244,22 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
 {
 	struct pci_driver *pdrv = to_pci_driver(driver);
 	const struct pci_device_id *ids = pdrv->id_table;
-	u32 vendor, device, subvendor = PCI_ANY_ID,
-		subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
-	unsigned long driver_data = 0;
+	struct pci_device_id id = {
+		.subvendor = PCI_ANY_ID,
+		.subdevice = PCI_ANY_ID
+	};
 	int fields;
 	int retval = 0;
 
 	fields = sscanf(buf, "%x %x %x %x %x %x %lx",
-			&vendor, &device, &subvendor, &subdevice,
-			&class, &class_mask, &driver_data);
+			&id.vendor, &id.device, &id.subvendor, &id.subdevice,
+			&id.class, &id.class_mask, &id.driver_data);
 	if (fields < 2)
 		return -EINVAL;
 
 	if (fields != 7) {
-		struct pci_dev *pdev = kzalloc_obj(*pdev);
-		if (!pdev)
-			return -ENOMEM;
-
-		pdev->vendor = vendor;
-		pdev->device = device;
-		pdev->subsystem_vendor = subvendor;
-		pdev->subsystem_device = subdevice;
-		pdev->class = class;
-		pdev->dev.release = _pci_free_device;
-
-		device_initialize(&pdev->dev);
-		if (pci_match_device(pdrv, pdev))
-			retval = -EEXIST;
-
-		put_device(&pdev->dev);
-
-		if (retval)
-			return retval;
+		if (do_pci_match_id(pdrv->id_table, &id))
+			return -EEXIST;
 	}
 
 	/* Only accept driver_data values that match an existing id_table
@@ -254,7 +267,7 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
 	if (ids) {
 		retval = -EINVAL;
 		while (ids->vendor || ids->subvendor || ids->class_mask) {
-			if (driver_data == ids->driver_data) {
+			if (id.driver_data == ids->driver_data) {
 				retval = 0;
 				break;
 			}
@@ -264,8 +277,7 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
 			return retval;
 	}
 
-	retval = pci_add_dynid(pdrv, vendor, device, subvendor, subdevice,
-			       class, class_mask, driver_data);
+	retval = do_pci_add_dynid(pdrv, &id, fields != 7);
 	if (retval)
 		return retval;
 	return count;

-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH V3 1/8] PCI: imx6: Add skip_pwrctrl_off flag support
From: Frank Li @ 2026-06-26 19:40 UTC (permalink / raw)
  To: Sherry Sun (OSS)
  Cc: robh, krzk+dt, conor+dt, Frank.Li, s.hauer, kernel, festevam,
	amitkumar.karwar, neeraj.sanjaykale, marcel, luiz.dentz,
	hongxing.zhu, l.stach, lpieralisi, kwilczynski, mani, bhelgaas,
	brgl, imx, linux-pci, linux-arm-kernel, devicetree, linux-kernel,
	linux-bluetooth, linux-pm, sherry.sun
In-Reply-To: <20260626023126.2189931-2-sherry.sun@oss.nxp.com>

On Fri, Jun 26, 2026 at 10:31:19AM +0800, Sherry Sun (OSS) wrote:
> From: Sherry Sun <sherry.sun@nxp.com>
>
> Use dw_pcie_rp::skip_pwrctrl_off to avoid powering off devices during
> suspend to preserve wakeup capability of the devices and also not to power
> on the devices in the init path.

Need empty line here.

> This allows controller power-off to be skipped when some devices(e.g. M.2
> cards key E without auxiliary power) required to support PCIe L2 link state
> and wake-up mechanisms.
>
> Move pci_pwrctrl_create_devices() to imx_pcie_probe() so that it is only
> called once during probe, similar to other regulator_get calls.
>
> Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
> ---
>  drivers/pci/controller/dwc/pci-imx6.c | 43 ++++++++++++++++-----------
>  1 file changed, 25 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/pci/controller/dwc/pci-imx6.c b/drivers/pci/controller/dwc/pci-imx6.c
> index 0fa716d1ed75..0685573fee71 100644
> --- a/drivers/pci/controller/dwc/pci-imx6.c
> +++ b/drivers/pci/controller/dwc/pci-imx6.c
> @@ -1382,16 +1382,12 @@ static int imx_pcie_host_init(struct dw_pcie_rp *pp)
>  		}
>  	}
>
> -	ret = pci_pwrctrl_create_devices(dev);
> -	if (ret) {
> -		dev_err(dev, "failed to create pwrctrl devices\n");
> -		goto err_reg_disable;
> -	}
> -

Please two patch do that. one patch move pci_pwrctrl_create_devices() to
probe

one patch check skip_power_off.

> -	ret = pci_pwrctrl_power_on_devices(dev);
> -	if (ret) {
> -		dev_err(dev, "failed to power on pwrctrl devices\n");
> -		goto err_pwrctrl_destroy;
> +	if (!pp->skip_pwrctrl_off) {
> +		ret = pci_pwrctrl_power_on_devices(dev);
> +		if (ret) {
> +			dev_err(dev, "failed to power on pwrctrl devices\n");
> +			goto err_reg_disable;
> +		}
>  	}
>
>  	ret = imx_pcie_clk_enable(imx_pcie);
> @@ -1460,10 +1456,8 @@ static int imx_pcie_host_init(struct dw_pcie_rp *pp)
>  err_clk_disable:
>  	imx_pcie_clk_disable(imx_pcie);
>  err_pwrctrl_power_off:
> -	pci_pwrctrl_power_off_devices(dev);
> -err_pwrctrl_destroy:
> -	if (ret != -EPROBE_DEFER)
> -		pci_pwrctrl_destroy_devices(dev);
> +	if (!pp->skip_pwrctrl_off)
> +		pci_pwrctrl_power_off_devices(dev);
>  err_reg_disable:
>  	if (imx_pcie->vpcie)
>  		regulator_disable(imx_pcie->vpcie);
> @@ -1482,7 +1476,8 @@ static void imx_pcie_host_exit(struct dw_pcie_rp *pp)
>  	}
>  	imx_pcie_clk_disable(imx_pcie);
>
> -	pci_pwrctrl_power_off_devices(pci->dev);
> +	if (!pci->pp.skip_pwrctrl_off)
> +		pci_pwrctrl_power_off_devices(pci->dev);
>  	if (imx_pcie->vpcie)
>  		regulator_disable(imx_pcie->vpcie);
>  }
> @@ -1954,11 +1949,15 @@ static int imx_pcie_probe(struct platform_device *pdev)
>  	if (ret)
>  		return ret;
>
> +	ret = pci_pwrctrl_create_devices(dev);
> +	if (ret)
> +		return dev_err_probe(dev, ret, "failed to create pwrctrl devices\n");
> +
>  	pci->use_parent_dt_ranges = true;
>  	if (imx_pcie->drvdata->mode == DW_PCIE_EP_TYPE) {
>  		ret = imx_add_pcie_ep(imx_pcie, pdev);
>  		if (ret < 0)
> -			return ret;
> +			goto err_pwrctrl_destroy;
>
>  		/*
>  		 * FIXME: Only single Device (EPF) is supported due to the
> @@ -1973,7 +1972,7 @@ static int imx_pcie_probe(struct platform_device *pdev)
>  		pci->pp.use_atu_msg = true;
>  		ret = dw_pcie_host_init(&pci->pp);
>  		if (ret < 0)
> -			return ret;
> +			goto err_pwrctrl_destroy;
>
>  		if (pci_msi_enabled()) {
>  			u8 offset = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
> @@ -1985,16 +1984,24 @@ static int imx_pcie_probe(struct platform_device *pdev)
>  	}
>
>  	return 0;
> +
> +err_pwrctrl_destroy:
> +	if (ret != -EPROBE_DEFER)
> +		pci_pwrctrl_destroy_devices(dev);
> +	return ret;

Mani said he will fix DEFER problem soon.

Frank

>  }
>
>  static void imx_pcie_shutdown(struct platform_device *pdev)
>  {
>  	struct imx_pcie *imx_pcie = platform_get_drvdata(pdev);
> +	struct dw_pcie *pci = imx_pcie->pci;
> +	struct dw_pcie_rp *pp = &pci->pp;
>
>  	/* bring down link, so bootloader gets clean state in case of reboot */
>  	imx_pcie_assert_core_reset(imx_pcie);
>  	imx_pcie_assert_perst(imx_pcie, true);
> -	pci_pwrctrl_power_off_devices(&pdev->dev);
> +	if (!pp->skip_pwrctrl_off)
> +		pci_pwrctrl_power_off_devices(&pdev->dev);
>  	pci_pwrctrl_destroy_devices(&pdev->dev);
>  }
>
> --
> 2.50.1
>
>

^ permalink raw reply

* Re: [PATCH v2 05/11] iommu/arm-smmu-v3: Submit CMDQ_OP_PRI_RESP for IOPF event
From: Robin Murphy @ 2026-06-26 16:15 UTC (permalink / raw)
  To: Nicolin Chen, will, jgg
  Cc: joro, bhelgaas, praan, kevin.tian, kees, smostafa, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-pci, skaestle,
	mmarrid, skolothumtho, bbiber
In-Reply-To: <6c713c724fa09bf5a1b5e2247c633e516036f079.1779944354.git.nicolinc@nvidia.com>

On 28/05/2026 8:59 am, Nicolin Chen wrote:
> From: Malak Marrid <mmarrid@nvidia.com>
> 
> To handle IOMMU_FAULT_PAGE_REQ from the PRI queue, arm_smmu_page_response()
> must issue a CMDQ_OP_PRI_RESP back to the SMMU.
> 
> However, either a stall event in the EVTQ or a PRI request in the PRIQ can
> surface to the IOPF infrastructure with fault.type == IOMMU_FAULT_PAGE_REQ,
> and a single master can in principle be both stall-capable and PRI-capable

No, the SMMU architecture does all it can to specifically forbid this, 
see 3.12.1 and 16.4, it just can't be made architecturally ILLEGAL to 
enable stalls for PCIe devices because there's no strict architectural 
definition for what "a PCIe device" actually is. Similarly with the note 
in the definition of STE.EATS about the relationship with CD.S - the 
unwritten implication is that defining specific behaviours would only 
create an unreasonable burden for hardware validation, for the sake of 
something that nobody in their right mind should ever do anyway.

The expectation is that RCiEPs which do speak stallable non-PCIe bus 
protocols will not go to the effort of implementing ATS/PRI capabilities 
(not least because there's every chance that such protocols simply 
doesn't have that kind of transaction flow anyway). And conversely that 
it can be considered an egregious firmware (or system design) error to 
even claim (let alone force) stall capability for a real PCIe root port 
which may be deadlocked by blocking its requirement for free-flowing 
writes. Thus I think we could go so far as to refuse to handle any 
endpoint which did somehow claim both.

Thanks,
Robin.

> (e.g. FEAT_STALL_FORCE on a PCIe device with PRI), so master state is not a
> reliable discriminator.
> 
> Add IOMMU_FAULT_PAGE_REQUEST_STALLS_TRANS to the generic flags so the fault
> reporter can mark a page request that is holding the device's transaction:
>      arm_smmu_handle_event() sets it on STALL events
>      arm_smmu_handle_ppr() leaves it clear for PRI events
> 
> Note: streams[0].id remains the RID because arm_smmu_enable_iopf() rejects
> num_streams != 1.
> 
> Co-developed-by: Barak Biber <bbiber@nvidia.com>
> Signed-off-by: Barak Biber <bbiber@nvidia.com>
> Co-developed-by: Stefan Kaestle <skaestle@nvidia.com>
> Signed-off-by: Stefan Kaestle <skaestle@nvidia.com>
> Signed-off-by: Malak Marrid <mmarrid@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>   include/linux/iommu.h                       |  1 +
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 75 +++++++++++++++------
>   3 files changed, 58 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 2bb810e4d5fce..1083621705f16 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -1007,6 +1007,7 @@ struct arm_smmu_master {
>   	/* Locked by the iommu core using the group mutex */
>   	struct arm_smmu_ctx_desc_cfg	cd_table;
>   	unsigned int			num_streams;
> +	bool				pri_enabled : 1;
>   	bool				ats_enabled : 1;
>   	bool				ste_ats_enabled : 1;
>   	bool				stall_enabled;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e587d4ac4d331..83c4dfcf20637 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -76,6 +76,7 @@ struct iommu_fault_page_request {
>   #define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
>   #define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
>   #define IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID	(1 << 2)
> +#define IOMMU_FAULT_PAGE_REQUEST_STALLS_TRANS	(1 << 3)
>   	u32	flags;
>   	u32	pasid;
>   	u32	grpid;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index ffc9621cd2288..061f1d46fda0d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -921,32 +921,68 @@ static int arm_smmu_drain_queue_for_iopf(struct arm_smmu_device *smmu,
>   	return ret;
>   }
>   
> -static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
> +static void arm_smmu_page_response(struct device *dev, struct iopf_fault *evt,
>   				   struct iommu_page_response *resp)
>   {
>   	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> -	u8 resume_resp;
> +	struct arm_smmu_cmd cmd;
> +	int sid;
>   
> -	if (WARN_ON(!master->stall_enabled))
> +	if (WARN_ON_ONCE(evt->fault.type != IOMMU_FAULT_PAGE_REQ))
>   		return;
>   
> -	switch (resp->code) {
> -	case IOMMU_PAGE_RESP_INVALID:
> -	case IOMMU_PAGE_RESP_FAILURE:
> -		resume_resp = CMDQ_RESUME_0_RESP_ABORT;
> -		break;
> -	case IOMMU_PAGE_RESP_SUCCESS:
> -		resume_resp = CMDQ_RESUME_0_RESP_RETRY;
> -		break;
> -	default:
> -		resume_resp = CMDQ_RESUME_0_RESP_TERM;
> -		break;
> +	/* IOPF is gated to num_streams == 1 in arm_smmu_enable_iopf() */
> +	sid = master->streams[0].id;
> +
> +	if (evt->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_STALLS_TRANS) {
> +		u8 resume_resp;
> +
> +		if (WARN_ON_ONCE(!master->stall_enabled))
> +			return;
> +		switch (resp->code) {
> +		case IOMMU_PAGE_RESP_INVALID:
> +		case IOMMU_PAGE_RESP_FAILURE:
> +			resume_resp = CMDQ_RESUME_0_RESP_ABORT;
> +			break;
> +		case IOMMU_PAGE_RESP_SUCCESS:
> +			resume_resp = CMDQ_RESUME_0_RESP_RETRY;
> +			break;
> +		default:
> +			resume_resp = CMDQ_RESUME_0_RESP_TERM;
> +			break;
> +		}
> +		cmd = arm_smmu_make_cmd_resume(sid, resp->grpid, resume_resp);
> +	} else {
> +		enum pri_resp pri_resp;
> +		bool ssv;
> +
> +		if (WARN_ON_ONCE(!master->pri_enabled))
> +			return;
> +		/* PCIe allows only one PRG Response per group */
> +		if (!(evt->fault.prm.flags &
> +		      IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
> +			return;
> +		switch (resp->code) {
> +		case IOMMU_PAGE_RESP_SUCCESS:
> +			pri_resp = PRI_RESP_SUCC;
> +			break;
> +		case IOMMU_PAGE_RESP_FAILURE:
> +			pri_resp = PRI_RESP_FAIL;
> +			break;
> +		case IOMMU_PAGE_RESP_INVALID:
> +			pri_resp = PRI_RESP_DENY;
> +			break;
> +		default:
> +			WARN_ON(true);
> +			return;
> +		}
> +		ssv = !!(evt->fault.prm.flags &
> +			 IOMMU_FAULT_PAGE_REQUEST_PASID_VALID);
> +		cmd = arm_smmu_make_cmd_pri_resp(sid, resp->pasid, ssv,
> +						 resp->grpid, pri_resp);
>   	}
>   
> -	arm_smmu_cmdq_issue_cmd(master->smmu,
> -				arm_smmu_make_cmd_resume(master->streams[0].id,
> -							 resp->grpid,
> -							 resume_resp));
> +	arm_smmu_cmdq_issue_cmd(master->smmu, cmd);
>   	/*
>   	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
>   	 * RESUME consumption guarantees that the stalled transaction will be
> @@ -2081,7 +2117,8 @@ static int arm_smmu_handle_event(struct arm_smmu_device *smmu, u64 *evt,
>   
>   		flt->type = IOMMU_FAULT_PAGE_REQ;
>   		flt->prm = (struct iommu_fault_page_request){
> -			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
> +			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE |
> +				 IOMMU_FAULT_PAGE_REQUEST_STALLS_TRANS,
>   			.grpid = event->stag,
>   			.perm = perm,
>   			.addr = event->iova,


^ permalink raw reply

* Re: [PATCH] fix: PCI: pci_iov_remove_virtfn: fix unmatched pci_dev_put for PF device
From: sashiko-bot @ 2026-06-26 16:02 UTC (permalink / raw)
  To: WenTao Liang; +Cc: linux-pci
In-Reply-To: <20260626155042.53862-1-vulab@iscas.ac.cn>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Refcount Leak of PF device

Pre-existing issues:
- [High] Premature Release / Use-After-Free of PF device in VF teardown
--

commit f1226b4e04690771e292377b3796f15f2ab401f4
Author: WenTao Liang <vulab@iscas.ac.cn>

fix: PCI: pci_iov_remove_virtfn: fix unmatched pci_dev_put for PF device

This patch removes a pci_dev_put() call for the PF device in
pci_iov_remove_virtfn(). The commit message states that the reference was never
acquired by the function, potentially causing a refcount underflow and
premature release.

> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index b0d24839c0840..87a2f3b278683 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -426,7 +426,6 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id)
>  
>  	/* balance pci_get_domain_bus_and_slot() */
>  	pci_dev_put(virtfn);
> -	pci_dev_put(dev);

[Severity: High]
Does removing this call cause a refcount leak for the PF device?

Looking at the VF creation in pci_iov_scan_device(), it appears a reference
to the PF is explicitly acquired:

drivers/pci/iov.c:pci_iov_scan_device() {
    ...
    virtfn->physfn = pci_dev_get(dev);
    ...
}

Because the core release function for the VF (pci_release_dev()) doesn't
currently drop this reference, doesn't removing this cleanup result in a
persistent leak of the PF device every time SR-IOV is disabled?

[Severity: High]
This is a pre-existing issue, but there appears to be a lifecycle design flaw
with when this PF reference is dropped.

If the VF object outlives pci_iov_remove_virtfn() due to elevated references
(like an open sysfs file or concurrent VFIO access pinning the VF's kobject),
does dropping the PF reference here leave a dangling virtfn->physfn pointer?

If the PF is subsequently removed and freed, any later access to virtfn->physfn
by the surviving VF object could trigger a use-after-free.

Should the balancing pci_dev_put(pci_dev->physfn) instead be moved into the
VF's actual release handler to ensure the PF is kept alive exactly as long as
the VF memory object?

drivers/pci/probe.c:pci_release_dev() {
    ...
    if (pci_dev->is_virtfn)
        pci_dev_put(pci_dev->physfn);
    ...
}

>  }
>  
>  static ssize_t sriov_totalvfs_show(struct device *dev,

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626155042.53862-1-vulab@iscas.ac.cn?part=1

^ permalink raw reply

* Re: [PATCH] fix: PCI: dra7xx: dra7xx_pcie_probe: fix missing device_link_del on phy   error paths
From: sashiko-bot @ 2026-06-26 15:57 UTC (permalink / raw)
  To: WenTao Liang; +Cc: linux-pci
In-Reply-To: <20260626154809.53770-1-vulab@iscas.ac.cn>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Double cleanup of PHY resources on error path.
- [Low] Compilation error due to undefined label `err_link`.
--

commit 51479d1472b27aa065c936360d150e8e275d37b6
Author: WenTao Liang <vulab@iscas.ac.cn>

fix: PCI: dra7xx: dra7xx_pcie_probe: fix missing device_link_del on phy error paths

This commit addresses a device link leak in the phy error paths of
dra7xx_pcie_probe(). It changes the error paths to jump to the err_link
label to properly call device_link_del() for each link. It also updates
the phy enable error paths.

> diff --git a/drivers/pci/controller/dwc/pci-dra7xx.c b/drivers/pci/controller/dwc/pci-dra7xx.c
> index 3fc889944f021..c54d68cf0e412 100644
> --- a/drivers/pci/controller/dwc/pci-dra7xx.c
> +++ b/drivers/pci/controller/dwc/pci-dra7xx.c

[ ... ]

> @@ -766,7 +768,8 @@ static int dra7xx_pcie_probe(struct platform_device *pdev)
>  	ret = dra7xx_pcie_enable_phy(dra7xx);
>  	if (ret) {
>  		dev_err(dev, "failed to enable phy\n");
> -		return ret;
> +		dra7xx_pcie_disable_phy(dra7xx);
> +		goto err_link;
>  	}

[Severity: High]
Does this cause a double cleanup of the PHY resources?

When dra7xx_pcie_enable_phy() fails, it already cleans up any partially 
enabled PHYs internally before returning. Calling dra7xx_pcie_disable_phy() 
here will repeat the cleanup on the same PHYs, which could lead to 
unbalanced reference counts in the PHY framework (underflowing power_count 
and init_count).

[ ... ]

> @@ -909,7 +912,8 @@ static int dra7xx_pcie_resume_noirq(struct device *dev)
>  	ret = dra7xx_pcie_enable_phy(dra7xx);
>  	if (ret) {
>  		dev_err(dev, "failed to enable phy\n");
> -		return ret;
> +		dra7xx_pcie_disable_phy(dra7xx);

[Severity: High]
Similar to the probe function, could this cause a double cleanup of the 
PHY resources since dra7xx_pcie_enable_phy() handles its own cleanup on 
failure?

> +		goto err_link;

[Severity: Low]
Does this label exist in this function?

It appears that the err_link label is not defined anywhere in 
dra7xx_pcie_resume_noirq(), which will result in a compilation error.

>  	}
>  
>  	return 0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626154809.53770-1-vulab@iscas.ac.cn?part=1

^ permalink raw reply

* [PATCH] fix: PCI: pci_iov_remove_virtfn: fix unmatched pci_dev_put for PF device
From: WenTao Liang @ 2026-06-26 15:50 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci; +Cc: stable, linux-kernel, WenTao Liang

In pci_iov_remove_virtfn(), the function calls pci_dev_put(dev) on the PF
  device at the end, but never acquired a reference to it via
  pci_dev_get(). The function parameter dev is passed by the caller who
  manages the reference. This unbalanced pci_dev_put causes a refcount
  underflow, potentially leading to premature PF device release or
  use-after-free.

Remove the spurious pci_dev_put(dev) call.

Cc: stable@vger.kernel.org
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
---
 drivers/pci/iov.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 91ac4e37ecb9..24a4fb3cd042 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -426,7 +426,6 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id)

 	/* balance pci_get_domain_bus_and_slot() */
 	pci_dev_put(virtfn);
-	pci_dev_put(dev);
 }

 static ssize_t sriov_totalvfs_show(struct device *dev,
-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related

* [PATCH] fix: PCI: dra7xx: dra7xx_pcie_probe: fix missing device_link_del on phy   error paths
From: WenTao Liang @ 2026-06-26 15:48 UTC (permalink / raw)
  To: Kishon Vijay Abraham I, linux-pci
  Cc: Lorenzo Pieralisi, Krzysztof Wilczynski, Bjorn Helgaas, stable,
	linux-kernel, WenTao Liang

In dra7xx_pcie_probe(), when devm_phy_get() fails for a later phy device
  (i > 0), the function directly returns PTR_ERR(phy[i]) without calling
  device_link_del() on previously added device links. Similarly, when
  dra7xx_pcie_enable_phy() fails, all phy_count device links are leaked as
  the function returns directly without cleanup.

Change the error paths to jump to the existing err_link label which
  properly iterates through link[0..i) and calls device_link_del() for
  each.

Cc: stable@vger.kernel.org
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
---
 drivers/pci/controller/dwc/pci-dra7xx.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/controller/dwc/pci-dra7xx.c b/drivers/pci/controller/dwc/pci-dra7xx.c
index cd904659c321..153ee6f7dbfc 100644
--- a/drivers/pci/controller/dwc/pci-dra7xx.c
+++ b/drivers/pci/controller/dwc/pci-dra7xx.c
@@ -743,8 +743,10 @@ static int dra7xx_pcie_probe(struct platform_device *pdev)
 	for (i = 0; i < phy_count; i++) {
 		snprintf(name, sizeof(name), "pcie-phy%d", i);
 		phy[i] = devm_phy_get(dev, name);
-		if (IS_ERR(phy[i]))
-			return PTR_ERR(phy[i]);
+		if (IS_ERR(phy[i])) {
+			ret = PTR_ERR(phy[i]);
+			goto err_link;
+		}
 
 		link[i] = device_link_add(dev, &phy[i]->dev, DL_FLAG_STATELESS);
 		if (!link[i]) {
@@ -767,7 +769,8 @@ static int dra7xx_pcie_probe(struct platform_device *pdev)
 	ret = dra7xx_pcie_enable_phy(dra7xx);
 	if (ret) {
 		dev_err(dev, "failed to enable phy\n");
-		return ret;
+		dra7xx_pcie_disable_phy(dra7xx);
+		goto err_link;
 	}
 
 	platform_set_drvdata(pdev, dra7xx);
@@ -910,7 +913,8 @@ static int dra7xx_pcie_resume_noirq(struct device *dev)
 	ret = dra7xx_pcie_enable_phy(dra7xx);
 	if (ret) {
 		dev_err(dev, "failed to enable phy\n");
-		return ret;
+		dra7xx_pcie_disable_phy(dra7xx);
+		goto err_link;
 	}
 
 	return 0;
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [PATCH v17 08/12] PCI/TPH: Add sysfs binary file to export CPU to steering-tag mapping
From: Alex Williamson @ 2026-06-26 15:22 UTC (permalink / raw)
  To: fengchengwen
  Cc: helgaas, wathsala.vithanage, wei.huang2, zhipingz, wangzhou1,
	wangyushan12, liuyonglong, kvm, linux-pci, Jason Gunthorpe, alex
In-Reply-To: <9105ceef-5e27-4e3c-8903-d46aef52a2bd@huawei.com>

On Tue, 23 Jun 2026 17:56:51 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> 
> I’ve sent v18-resend of the PCI TPH sysfs patchset with all prior review
> comments fully addressed.
> 
> I remain open to both discussed implementation options. This v18-resend
> implements the sysfs binary blob approach on Root Ports as you previously
> suggested.

There's a problem with the pci-sysfs binary attribute beyond whether
it's acceptable from a sysfs perspective: it only solves the CPU-ID to
ST lookup.  We already know that this interface should support dma-buf
provided TPH ST and PH values from Zhiping's series.  If we take the
pci-sysfs route for CPU-ID, no subsystem exposes the value embedded
into the dma-buf.  Another solution is required.  Repeat for each
possible ST/PH source we might see in the future.

Therefore, I'm coming around to the idea that vfio-pci provides this
interface, but we need to do so in a way that it's not a wart on the
side of the interface.  I think we can do this based on an incremental
progression of features and support, where the ST can be provided
through an extensible set of objects, CPU ID, dma-buf, no-preference
(zero), and literal user-provided ST values.

The key components of the uAPI are similar to ones you've proposed:

 - An opt-in/feature flags (SET/GET): VFIO_DEVICE_FEATURE_TPH

   The TPH uAPI necessarily changes the ABI of the vfio-pci device;
   capabilities and control are virtualized.  An empty SET is the
   opt-in to the ABI change.  There are also feature flags necessary
   for uAPI surfaces not discovered through the PCI TPH capability, GET
   on this feature returns a flags field indicating supported features.
   (SET actually takes a zeroed flags field for symmetry in the vfio
   device feature API).

 - A mechanism to program indexes in architected ST locations (SET):
   VFIO_DEVICE_FEATURE_TPH_ST

   This includes flags to define the type of source provided, such as:

    - VFIO_DEVICE_TPH_SRC_DMABUF
    - VFIO_DEVICE_TPH_SRC_NONE
    - VFIO_DEVICE_TPH_SRC_CPU_VOLATILE
    - VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT
    - VFIO_DEVICE_TPH_SRC_LITERAL

   The requested namespace for the tag:

    - VFIO_DEVICE_TPH_EXTENDED

   And a failure policy, ex. if a translation results in zero, no-pref:

    - VFIO_DEVICE_TPH_REQUIRE_ST

   This would also include a start/count and pointer to user buffer to
   allow batch settings, if the TPH subsystem can be extended to
   support these, otherwise a single u32 holds all the decodable
   sources.

 - A mechanism to decode a source (GET): VFIO_DEVICE_FEATURE_TPH_RESOLVE

   This is necessary for DS mode, where the driver may store STs in
   device specific locations, but also for dma-buf sources even in IV
   mode, where the PH needs to be decoded for the driver's use.  This
   proposal also extends the API, through progressive opt-ins, to
   support virtualization, where a VMM may (in specific scenarios)
   implement a _DSM returning host values, and write through literal ST
   values from this source, or from a guest-based userspace driver for
   a completer, where the guest kernel can source a valid ST.

   In order to prevent this from becoming the side-car/wart, returning
   ST values is limited to configurations where the resulting ST may
   actually be used within the interface.  For example, No-ST and IV
   mode would operate only with DMABUF, NONE, and CPU_* sources, the
   literal ST need not be exposed to userspace.  The RESOLVE feature
   would be limited to returning the PH value for a DMABUF source.

   The structure includes input fields of u32 flags and source, output
   fields of u8 valid (bits declaring each of the next fields
   validity), u8 ph, u16 st.  Flag bits are defined the same as for
   TPH_ST, modulo NONE, LITERAL, REQUIRE_ST are not applicable and
   reserved.

With that rough uAPI, we can step through an incremental support path.

 1. The most basic level of support, and default sans module options,
    is No-ST mode.  After opt-in, the TPH capability is virtualized to
    report only No-ST mode supported, the remainder of the register
    zero.  Writes to ST Mode allow only 000b.  Writes to Requester
    Enable allow only valid, non-zero values.  In combination, this
    enables the device to operate only in No-ST mode.  The TPH_ST
    interface is inoperative in this mode.  The RESOLVE interface only
    accepts DMABUF sources and only provides the PH as valid.  This
    allows only the PH aspect of TPH to be used and does not risk any
    exposure or malicious use of ST values.

    NB. All modes, including this one, must also validate support for
    all reachable features through to the root port, ie. TPH Completer.

 2. Via a module option (proposed below), IV mode can be enabled where
    supported by the device.  This allows the unvirtualized interrupt
    vector mode supported bit, ST table location, and ST table size
    fields to be exposed in the capability register, and allows 001b to
    be written in the mode select control register.  Additionally, this
    brings onboard the TPH_ST feature, supporting a selection of
    source flags.  NONE and DMABUF can be assumed, based on sequencing
    of the dma-buf based TPH completer support series.  CPU sources
    rely on root port _DSM support.  GET on the noted feature flags
    above can indicate this support.  LITERAL is not enabled at this
    point, this mode only supports ST programming via objects and
    instance numbers.  RESOLVE support is identical to step 1.

    Via SET of TPH_ST, sources are decoded and ST values are written
    directly to the ST storage location.  The REQUIRE_ST flag indicates
    whether the user accepts a translation resulting in zero (no-pref)
    or if the API is to generate an error on this result.

    Support for translating CPU_* sources relies on platform support
    for _DSM, while other sources like DMABUF have no such requirement.
    Therefore, for this and following steps, the returned flags on
    FEATURE_TPH GET will include a bit indicating whether CPU_* sources
    are supported.

    There are some important considerations to resolve/document here
    between the TPH specification and Linux implementation, such as
    requiring TPH is enabled (and for MSI-X vector location, the
    programmed vector is enabled), before the TPH_ST.  We need to be
    particularly careful of a uAPI that relies on mutable
    implementation.

 3. Again, via a module option, DS mode can be enabled where supported
    by the device.  On such a supported device (and only on such a
    supported device), the device specific mode supported flag is now
    reported unvirtualized in the capability register and the control
    register supports writes of 010b to the mode field.  The RESOLVE
    interface now supports CPU_* as a source.  RESOLVE will now return
    both the PH and ST for a DMABUF, indicated through both valid bits
    set, and the ST for a CPU ID, only the ST valid bit set.  The
    TPH_ST interface still accepts only DMABUF, CPUID, and NONE for
    configuring architected ST storage locations.

 4. Finally, literal mode can be enabled via another progression of the
    module option.  Literal mode allows the TPH_ST feature to accept an
    actual, raw, ie. literal, ST value and also allows RESOLVE to
    return a valid ST even for devices only supporting IV mode.  This
    is intended to support VM use cases, such as a VMM configured with
    1:1 vCPU:pCPU mappings that exposes a _DSM to the guest with host
    CPU ST values, or if a completer exists in the guest that provides
    ST values through a dmabuf.  This is the only mode that allows
    writing unchecked ST values to architected ST storage locations.
    The user discovers support for this mode via a LITERAL flag
    returned from GET on the TPH feature.

The proposed module option is therefore a progression where the user is
granted increasing privilege to make use of STs:

vfio_pci_core.tph=:
	0: No-ST Mode (default)
	1: + IV Mode
	2: + DS Mode
	3: + LITERAL

It should be noted that even in IV mode, the user can R/W the ST value
via mmap covering the MSI-X vector table and could read the value in
the capability storage location unless we explicitly prevent it.
Removing the latter is trivial, removing the former costs reverting
previous decisions to allow mmap of the MSI-X vector table for
performance, esp. in larger PAGE_SIZE systems.  If this is an issue,
(tph >= 1) would need to reintroduce sparse mmap for MSI-X, but higher
tph values already allow for increasingly direct ST programming.

In general, while the proposal here presents progressively increasing
access and direct use of raw ST values, this access is not considered a
security issue.  TPH only allows addressing specific caching
structures, it does not present a correctness issue.  Abusive use of ST
values is at best a QoS issue, where access to a device capable of
these features imposes some inherent risk of exposure.

The feature ioctls therefore take this sort of form:

VFIO_DEVICE_FEATURE_TPH (SET/GET):

    struct vfio_device_feature_tph {
            __u32       flags;
    }

    @flags is reserved (0) on SET, returns VFIO_DEVICE_TPH_CAP_* on GET:

    #define VFIO_DEVICE_TPH_CAP_CPU             (1 << 0) /* CPU sources resolvable (_DSM) */
    #define VFIO_DEVICE_TPH_CAP_LITERAL         (1 << 1) /* LITERAL source available */

VFIO_DEVICE_FEATURE_TPH_RESOLVE (GET):

    struct vfio_device_feature_tph_resolve {
            __u32       flags;          /* IN: VFIO_DEVICE_TPH_* source + namespace */
            __u32       src;            /* IN: CPU id or dma-buf fd, per @flags */
            __u8        valid;          /* OUT: VFIO_DEVICE_TPH_VALID_* */
    #define VFIO_DEVICE_TPH_VALID_PH    (1 << 0) /* @ph holds a processing hint */
    #define VFIO_DEVICE_TPH_VALID_ST    (1 << 1) /* @st holds a non-zero steering tag */
            __u8        ph;             /* OUT: processing hint */
            __u16       st;             /* OUT: raw ST (DS presented or LITERAL) */
    };

    @flags is common for both RESOLVE and the following TPH_ST feature,
    bits are allocated according to their order of introduction.  Some
    bits are only applicable to the TPH_ST interface and are reserved
    for RESOLVE:

    #define VFIO_DEVICE_TPH_SRC_DMABUF          (1 << 0) /* dma-buf fd */
    #define VFIO_DEVICE_TPH_SRC_NONE            (1 << 1) /* write ST 0 (TPH_ST only) */
    #define VFIO_DEVICE_TPH_REQUIRE_ST          (1 << 2) /* modifier (TPH_ST): stop on
                                                            resolved ST 0 */
    #define VFIO_DEVICE_TPH_SRC_CPU_VOLATILE    (1 << 3) /* CPU id, volatile memory */
    #define VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT  (1 << 4) /* CPU id, persistent memory */
    #define VFIO_DEVICE_TPH_EXTENDED            (1 << 5) /* modifier: Extended ST
                                                            namespace */
    #define VFIO_DEVICE_TPH_SRC_LITERAL         (1 << 6) /* literal ST value
                                                            (TPH_ST only) */

    The source type must be uniquely defined, therefore only one bit in
    the source mask can be set:

    #define VFIO_DEVICE_TPH_SRC_MASK \
            (VFIO_DEVICE_TPH_SRC_DMABUF | \
             VFIO_DEVICE_TPH_SRC_NONE | \
             VFIO_DEVICE_TPH_SRC_CPU_VOLATILE | \
             VFIO_DEVICE_TPH_SRC_CPU_PERSISTENT | \
             VFIO_DEVICE_TPH_SRC_LITERAL)

VFIO_DEVICE_FEATURE_TPH_ST (SET):

    struct vfio_device_feature_tph_st {
            __u32               flags;  /* IN: VFIO_DEVICE_TPH_* source + namespace */
            __u16               start;  /* IN: first ST table index
                                               (IV: interrupt vector) */
            __u16               count;  /* IN: number of contiguous entries */
            __aligned_u64       dests;  /* IN: @count __u32 (CPU/DMABUF/LITERAL;
                                               unused for NONE) */
    };

    Batching here is intended to shorten the TPH disabled/quiescent
    window, but this is not currently supported by the TPH subsystem.
    This could degrade to a single inline entry if necessary.

    Batching also introduces a risk of partial success, where some
    sources may resolve, while others generate an error or are promoted
    to an error based on the REQUIRE_ST modifier.  The proposed
    solution for this is to allow vfio device features to return values
    other than 0 or -errno, as currently described.  This interface
    would adopt a solution similar to KVM_SET_MSRS, where the ioctl
    returns the number of successfully programmed entries.  Therefore
    if (ret == count) the full set was successful, otherwise the user
    knows exactly which entry failed.

There are various intricate details involved in validating
configurations where TPH can be enabled, and I'm glossing over the uAPI
dependencies on implementation versus specification, but I think this
provides a good basis for exposing TPH support on the device.  I'd
strongly recommend the incremental enablement stages proposed here,
even if we stop before LITERAL support based on current use cases.
Please comment if there are gaps or anything is unclear.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH v5 18/20] rust: dma: drop `dma_read!` and `dma_write!` API
From: sashiko-bot @ 2026-06-26 15:12 UTC (permalink / raw)
  To: Gary Guo; +Cc: Danilo Krummrich, ojeda, Alexandre Courbot, linux-pci, dri-devel
In-Reply-To: <20260626-io_projection-v5-18-d0961471ae50@garyguo.net>

> The primitive read/write use case is covered by the `io_read!` and
> `io_write!` macro. The non-primitive use case was finicky; they should
> either be achieved using `CoherentBox` or `as_ref()/as_mut()` to assert the
> lack of concurrent access, or should be using memcpy-like APIs to express
> the non-atomic and tearable nature.
> 
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
> Signed-off-by: Gary Guo <gary@garyguo.net>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-io_projection-v5-0-d0961471ae50@garyguo.net?part=18


^ permalink raw reply

* Re: [PATCH v5 17/20] gpu: nova-core: use I/O projection for cleaner encapsulation
From: sashiko-bot @ 2026-06-26 15:06 UTC (permalink / raw)
  To: Gary Guo; +Cc: ojeda, Danilo Krummrich, Alexandre Courbot, dri-devel, linux-pci
In-Reply-To: <20260626-io_projection-v5-17-d0961471ae50@garyguo.net>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Data corruption in GSP command queue due to missing memory barrier before updating the ring buffer write pointer.
--

commit abd2f0cd14180222d86d47f37c313be60a30d773
Author: Gary Guo <gary@garyguo.net>

gpu: nova-core: use I/O projection for cleaner encapsulation

This commit refactors the nova-core GSP driver to use the io_project!,
io_read!, and io_write! macros for memory-mapped I/O access. It replaces
the previous dma_read! and dma_write! usage to enforce better encapsulation
for PTE arrays and message queues.

> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs

[ ... ]

> @@ -438,12 +441,17 @@
>      // Informs the GSP that it can process `elem_count` new pages from the command queue.
>      fn advance_cpu_write_ptr(&mut self, elem_count: u32) {
> -        super::fw::gsp_mem::advance_cpu_write_ptr(&self.0, elem_count)
> +        let tx = io_project!(self.0, .cpuq.tx);
> +        let wptr = MsgqTxHeader::write_ptr(tx).wrapping_add(elem_count) % MSGQ_NUM_PAGES;
> +        MsgqTxHeader::set_write_ptr(tx, wptr);
> +
> +        // Ensure all command data is visible before triggering the GSP read.
> +        fence(Ordering::SeqCst);

[Severity: High]
This is a pre-existing issue, but is it possible that the memory barrier is
placed too late in this function?

In advance_cpu_write_ptr(), the fence(Ordering::SeqCst) is placed after
updating the ring buffer's write pointer via set_write_ptr(). Since the
payload is written to coherent memory using normal memory stores while the
write pointer uses a volatile store, weakly-ordered CPUs might reorder the
operations.

If the GSP firmware polls the write pointer before the payload writes are
globally visible, could it read uninitialized or stale data? Should the
fence be placed before the set_write_ptr() call to prevent this race?

>      }
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-io_projection-v5-0-d0961471ae50@garyguo.net?part=17

^ permalink raw reply

* Re: [PATCH v5 15/20] rust: io: implement a view type for `Coherent`
From: sashiko-bot @ 2026-06-26 15:05 UTC (permalink / raw)
  To: Gary Guo; +Cc: Danilo Krummrich, Alexandre Courbot, linux-pci, ojeda, dri-devel
In-Reply-To: <20260626-io_projection-v5-15-d0961471ae50@garyguo.net>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] `CoherentView<'a, T>` automatically implements `Send` when `T: Sync`, but provides an `as_mut()` capability to generate `&mut T`, allowing thread-safety bounds bypass.
- [High] `CoherentView::as_mut` consumes `self` by value on a `Copy` type and returns `&'a mut T`, but its safety documentation fails to require strict exclusivity.
--

commit 12e6f6d7abcd0a26719f3c4f268f24a07774b118
Author: Gary Guo <gary@garyguo.net>

rust: io: implement a view type for `Coherent`

Implement a `CoherentView` type which is a view of `Coherent`. To be able
to give out DMA handles, the view type contains both CPU and DMA pointers,
and the projection method projects both at once.

Delegate most of the `Io` implementation to `SysMemBackend`. Provide a
method to erase the DMA handle and give out a `SysMem` view, if the user
does not need the `dma_handle`.

> diff --git a/rust/kernel/dma.rs b/rust/kernel/dma.rs
> index 200def84fb69e..ab6504910e4f5 100644
> --- a/rust/kernel/dma.rs
> +++ b/rust/kernel/dma.rs

[ ... ]

> @@ -1133,6 +1140,132 @@ unsafe impl Send for CoherentHandle {}
>  // plain `Copy` values.
>  unsafe impl Sync for CoherentHandle {}
>  
> +/// View type for `Coherent`.
> +///
> +/// This is same as [`SysMem`] but with additional information that allows handing out a DMA handle.
> +pub struct CoherentView<'a, T: ?Sized> {
> +    cpu_addr: SysMem<'a, T>,
> +    dma_handle: DmaAddress,
> +}

[Severity: High]
Since SysMem<'a, T> automatically implements Send when T: Sync, CoherentView
also implements Send when T: Sync.

If T is !Send, can a caller move CoherentView to another thread and call
as_mut() to get a mutable reference? This appears to allow bypassing Rust's
concurrency guarantees and could lead to data races or memory corruption if
the type relies on thread-local invariants.

Note that this same pattern seems to pre-exist in Coherent<T> as well.

[ ... ]

> +    /// Returns a mutable reference to the data in the region.
> +    ///
> +    /// # Safety
> +    ///
> +    /// * Callers must ensure that the device does not read/write to/from memory while the returned
> +    ///   reference is live.
> +    /// * Callers must ensure that this call does not race with a read or write to the same region
> +    ///   while the returned reference is live.
> +    #[inline]
> +    pub unsafe fn as_mut(self) -> &'a mut T {

[Severity: High]
Because CoherentView derives Copy and as_mut() takes self by value instead of
taking a mutable borrow, can a caller call as_mut() multiple times in
succession to create overlapping mutable references?

While the safety documentation requires callers to prevent races, it does
not explicitly require strict exclusivity. Could a caller sequentially use
two aliases without data races, believing they adhered to the safety
contract, but still trigger undefined behavior due to aliasing violations?

This documentation omission might also exist in the preexisting
Coherent::as_mut.

> +        // SAFETY: pointer is aligned and valid per type invariant. Aliasing rule is satisfied per
> +        // safety requirement.
> +        unsafe { &mut *self.cpu_addr.as_ptr() }
> +    }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-io_projection-v5-0-d0961471ae50@garyguo.net?part=15

^ permalink raw reply

* Re: [PATCH v5 03/20] rust: io: restrict untyped IO access and `register!` to `Region`
From: sashiko-bot @ 2026-06-26 15:03 UTC (permalink / raw)
  To: Gary Guo; +Cc: Alexandre Courbot, Danilo Krummrich, ojeda, dri-devel, linux-pci
In-Reply-To: <20260626-io_projection-v5-3-d0961471ae50@garyguo.net>

> Currently the `Io` trait exposes a bunch of untyped IO accesses, but if the
> `Io` region itself is typed, then it might be weird to have
> 
>     let io: Mmio<u32> = /* ... */;
>     io.read8(1);
> 
> while not unsound, it is surely strange. Thus, restrict the untyped methods
> and also the register macro to `Region` type only.
> 
> Implement it by adding a generic type to `IoLoc` indicating allowed base
> types. This also paves the way to add typed register blocks in the future;
> for example, we could use this mechanism to block driver A's `register!()`
> generated macro from being used on driver B's MMIO. The same mechanism
> could be used for relative IO registers. These are future opportunities,
> and for now restrict everything to require `IoLoc<Region<SIZE>, _>`.
> 
> Suggested-by: Alexandre Courbot <acourbot@nvidia.com>
> Link: https://lore.kernel.org/rust-for-linux/DHLB3RO3OSF5.2R7F27U99BKLN@nvidia.com/
> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
> Signed-off-by: Gary Guo <gary@garyguo.net>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-io_projection-v5-0-d0961471ae50@garyguo.net?part=3


^ permalink raw reply

* Re: [PATCH v5 19/20] rust: io: add copying methods
From: sashiko-bot @ 2026-06-26 15:02 UTC (permalink / raw)
  To: Gary Guo; +Cc: ojeda, dri-devel, Alexandre Courbot, linux-pci, Danilo Krummrich
In-Reply-To: <20260626-io_projection-v5-19-d0961471ae50@garyguo.net>

> One feature that was lost from the old `dma_read!` and `dma_write!` when
> moving to `io_read!` and `io_write!` was the ability to read/write a large
> structs. However, the semantics was unclear to begin with, as there was no
> guarantee about their atomicity even for structs that were small enough to
> fit in u32. Re-introduce the capability in the form of copying methods.
> 
>     dma_read!(foo, bar) -> io_project!(foo, bar).copy_read()
>     dma_write!(foo, bar, baz) -> io_project!(foo, bar).copy_write(baz)
> 
> Model these semantics after memcpy so user has clear expectation of lack of
> atomicity. As an additional benefit of this change, this now works for MMIO
> as well by mapping them to `memcpy_{from,to}io`.
> 
> For slices which is DST so the `copy_read` and `copy_write` API above can't
> work, add `copy_from_slice` and `copy_to_slice` to copy from/to normal
> memory.
> 
> Signed-off-by: Gary Guo <gary@garyguo.net>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626-io_projection-v5-0-d0961471ae50@garyguo.net?part=19


^ permalink raw reply

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox