[PATCH v3 00/11] Fix incorrect iommu

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
@ 2025-09-05 18:06 Jason Gunthorpe
  2025-09-05 18:06 ` [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
                   ` (12 more replies)
  0 siblings, 13 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

The series patches have extensive descriptions as to the problem and
solution, but in short the ACS flags are not analyzed according to the
spec to form the iommu_groups that VFIO is expecting for security.

ACS is an egress control only. For a path the ACS flags on each hop only
effect what other devices the TLP is allowed to reach. It does not prevent
other devices from reaching into this path.

For VFIO if device A is permitted to access device B's MMIO then A and B
must be grouped together. This says that even if a path has isolating ACS
flags on each hop, off-path devices with non-isolating ACS can still reach
into that path and must be grouped gother.

For switches, a PCIe topology like:

                               -- DSP 02:00.0 -> End Point A
 Root 00:00.0 -> USP 01:00.0 --|
                               -- DSP 02:03.0 -> End Point B

Will generate unique single device groups for every device even if ACS is
not enabled on the two DSP ports. It should at least group A/B together
because no ACS means A can reach the MMIO of B. This is a serious failure
for the VFIO security model.

For multi-function-devices, a PCIe topology like:

                  -- MFD 00:1f.0 ACS not supported
  Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
                  |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS

Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
a spec perspective each device should get its own group, because ACS not
supported can assume no loopback is possible by spec.

For root-ports a PCIe topology like:
                                         -- Dev 01:00.0
  Root  00:00.00 --- Root Port 00:01.0 --|
                  |                      -- Dev 01:00.1
		  |- Dev 00:17.0

Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
ACS capability in the root port.

While ACS on root ports is underspecified in the spec, it should still
function as an egress control and limit access to either the MMIO of the
root port itself, or perhaps some other devices upstream of the root
complex - 00:17.0 perhaps in this example.

Historically the grouping in Linux has assumed the root port routes all
traffic into the TA/IOMMU and never bypasses the TA to go to other
functions in the root complex. Following the new understanding that ACS is
required for internal loopback also treat root ports with no ACS
capability as lacking internal loopback as well.

There is also some confusing spec language about how ACS and SRIOV works
which this series does not address.

This entire series goes further and makes some additional improvements to
the ACS validation found while studying this problem. The groups around a
PCIe to PCI bridge are shrunk to not include the PCIe bridge.

The last patches implement "ACS Enhanced" on top of it. Due to how ACS
Enhanced was defined as a non-backward compatible feature it is important
to get SW support out there.

Due to the potential of iommu_groups becoming wider and thus non-usable
for VFIO this should go to a linux-next tree to give it some more
exposure.

I have now tested this a few systems I could get:

 - Various Intel client systems:
   * Raptor Lake, with VMD enabled and using the real_dev mechanism
   * 6/7th generation 100 Series/C320
   * 5/6th generation 100 Series/C320 with a NIC MFD quirk
   * Tiger Lake
   * 5/6th generation Sunrise Point

  The 6/7th gen system has a root port without an ACS capability and it
  becomes ungrouped as described above.

  All systems have changes, the MFDs in the root complex all become ungrouped.

 - NVIDIA Grace system with 5 different PCI switches from two vendors
   Bug fix widening the iommu_groups works as expected here

This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups

v3:
 - Rebase to v6.17-rc4
 - Drop the quirks related patches
 - Change the MFD logic to process no ACS cap as meaning no internal
   loopback. This avoids creating non-isolated groups for MFD root ports in
   common AMD and Intel systems
 - Fix matching MFDs to ignore SRIOV VFs
 - Fix some kbuild splats
v2: https://patch.msgid.link/r/0-v2-4a9b9c983431+10e2-pcie_switch_groups_jgg@nvidia.com
 - Revise comments and commit messages
 - Rename struct pci_alias_set to pci_reachable_set
 - Make more sense of the special bus->self = NULL case for SRIOV
 - Add pci_group_alloc_non_isolated() for readability
 - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED
 - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD
   function is a bridge
 - New patches to add pci_mfd_isolation() to retain more cases of narrow
   groups on MFDs with missing ACS.
 - Redescribe the MFD related change as a bug fix. For a MFD to be
   isolated all functions must have egress control on their P2P.
v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com

Cc: galshalom@nvidia.com
Cc: tdave@nvidia.com
Cc: maorg@nvidia.com
Cc: kvm@vger.kernel.org
Cc: Ceric Le Goater" <clg@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (11):
  PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  PCI: Add pci_bus_isolated()
  iommu: Compute iommu_groups properly for PCIe switches
  iommu: Organize iommu_group by member size
  PCI: Add pci_reachable_set()
  iommu: Compute iommu_groups properly for PCIe MFDs
  iommu: Validate that pci_for_each_dma_alias() matches the groups
  PCI: Add the ACS Enhanced Capability definitions
  PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  PCI: Check ACS Extended flags for pci_bus_isolated()

 drivers/iommu/iommu.c         | 510 +++++++++++++++++++++++-----------
 drivers/pci/ats.c             |   4 +-
 drivers/pci/pci.c             |  73 ++++-
 drivers/pci/search.c          | 274 ++++++++++++++++++
 include/linux/pci.h           |  46 +++
 include/uapi/linux/pci_regs.h |  18 ++
 6 files changed, 759 insertions(+), 166 deletions(-)

base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
-- 
2.43.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  4:08   ` Donald Dutile
  2025-09-05 18:06 ` [PATCH v3 02/11] PCI: Add pci_bus_isolated() Jason Gunthorpe
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

The next patch wants to use this constant, share it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c         | 16 +++-------------
 include/uapi/linux/pci_regs.h | 10 ++++++++++
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 060ebe330ee163..2a47ddb01799c1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1408,16 +1408,6 @@ EXPORT_SYMBOL_GPL(iommu_group_id);
 static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 					       unsigned long *devfns);
 
-/*
- * To consider a PCI device isolated, we require ACS to support Source
- * Validation, Request Redirection, Completer Redirection, and Upstream
- * Forwarding.  This effectively means that devices cannot spoof their
- * requester ID, requests and completions cannot be redirected, and all
- * transactions are forwarded upstream, even as it passes through a
- * bridge where the target device is downstream.
- */
-#define REQ_ACS_FLAGS   (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
-
 /*
  * For multifunction devices which are not isolated from each other, find
  * all the other non-isolated functions and look for existing groups.  For
@@ -1430,13 +1420,13 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
 	struct pci_dev *tmp = NULL;
 	struct iommu_group *group;
 
-	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
+	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
 		return NULL;
 
 	for_each_pci_dev(tmp) {
 		if (tmp == pdev || tmp->bus != pdev->bus ||
 		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, REQ_ACS_FLAGS))
+		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
 			continue;
 
 		group = get_pci_alias_group(tmp, devfns);
@@ -1580,7 +1570,7 @@ struct iommu_group *pci_device_group(struct device *dev)
 		if (!bus->self)
 			continue;
 
-		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
+		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
 			break;
 
 		pdev = bus->self;
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index f5b17745de607d..6095e7d7d4cc48 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1009,6 +1009,16 @@
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+/*
+ * To consider a PCI device isolated, we require ACS to support Source
+ * Validation, Request Redirection, Completer Redirection, and Upstream
+ * Forwarding.  This effectively means that devices cannot spoof their
+ * requester ID, requests and completions cannot be redirected, and all
+ * transactions are forwarded upstream, even as it passes through a
+ * bridge where the target device is downstream.
+ */
+#define PCI_ACS_ISOLATED (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
+
 /* SATA capability */
 #define PCI_SATA_REGS		4	/* SATA REGs specifier */
 #define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 02/11] PCI: Add pci_bus_isolated()
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
  2025-09-05 18:06 ` [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  4:09   ` Donald Dutile
  2025-09-09 19:54   ` Bjorn Helgaas
  2025-09-05 18:06 ` [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

Prepare to move the calculation of the bus P2P isolation out of the iommu
code and into the PCI core. This allows using the faster list iteration
under the pci_bus_sem, and the code has a kinship with the logic in
pci_for_each_dma_alias().

Bus isolation is the concept that drives the iommu_groups for the purposes
of VFIO. Stated simply, if device A can send traffic to device B then they
must be in the same group.

Only PCIe provides isolation. The multi-drop electrical topology in
classical PCI allows any bus member to claim the transaction.

In PCIe isolation comes out of ACS. If a PCIe Switch and Root Complex has
ACS flags that prevent peer to peer traffic and funnel all operations to
the IOMMU then devices can be isolated.

Multi-function devices also have an isolation concern with self loopback
between the functions, though pci_bus_isolated() does not deal with
devices.

As a property of a bus, there are several positive cases:

 - The point to point "bus" on a physical PCIe link is isolated if the
   bridge/root device has something preventing self-access to its own
   MMIO.

 - A Root Port is usually isolated

 - A PCIe switch can be isolated if all it's Down Stream Ports have good
   ACS flags

pci_bus_isolated() implements these rules and returns an enum indicating
the level of isolation the bus has, with five possibilies:

 PCIE_ISOLATED: Traffic on this PCIE bus can not do any P2P.

 PCIE_SWITCH_DSP_NON_ISOLATED: The bus is the internal bus of a PCIE
     switch and the USP is isolated but the DSPs are not.

 PCIE_NON_ISOLATED: The PCIe bus has no isolation between the bridge or
     any downstream devices.

 PCI_BUS_NON_ISOLATED: It is a PCI/PCI-X but the bridge is PCIe, has no
     aliases and the bridge is isolated from the bus.

 PCI_BRIDGE_NON_ISOLATED: It is a PCI/PCI-X bus and has no isolation, the
     bridge is part of the group.

The calculation is done per-bus, so it is possible for a transactions from
a PCI device to travel through different bus isolation types on its way
upstream. PCIE_SWITCH_DSP_NON_ISOLATED/PCI_BUS_NON_ISOLATED and
PCIE_NON_ISOLATED/PCI_BRIDGE_NON_ISOLATED are the same for the purposes of
creating iommu groups. The distinction between PCIe and PCI allows for
easier understanding and debugging as to why the groups are chosen.

For the iommu groups if all busses on the upstream path are PCIE_ISOLATED
then the end device has a chance to have a single-device iommu_group. Once
any non-isolated bus segment is found that bus segment will have an
iommu_group that captures all downstream devices, and sometimes the
upstream bridge.

pci_bus_isolated() is principally about isolation, but there is an
overlap with grouping requirements for legacy PCI aliasing. For purely
legacy PCI environments pci_bus_isolated() returns
PCI_BRIDGE_NON_ISOLATED for everything and all devices within a hierarchy
are in one group. No need to worry about bridge aliasing.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 174 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  |  31 ++++++++
 2 files changed, 205 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 53840634fbfc2b..fe6c07e67cb8ce 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -113,6 +113,180 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
 	return ret;
 }
 
+static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+
+	/*
+	 * Within a PCIe switch we have an interior bus that has the Upstream
+	 * port as the bridge and a set of Downstream port bridging to the
+	 * egress ports.
+	 *
+	 * Each DSP has an ACS setting which controls where its traffic is
+	 * permitted to go. Any DSP with a permissive ACS setting can send
+	 * traffic flowing upstream back downstream through another DSP.
+	 *
+	 * Thus any non-permissive DSP spoils the whole bus.
+	 */
+	guard(rwsem_read)(&pci_bus_sem);
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		/* Don't understand what this is, be conservative */
+		if (!pci_is_pcie(pdev) ||
+		    pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM ||
+		    pdev->dma_alias_mask)
+			return PCIE_NON_ISOLATED;
+
+		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
+			return PCIE_SWITCH_DSP_NON_ISOLATED;
+	}
+	return PCIE_ISOLATED;
+}
+
+static bool pci_has_mmio(struct pci_dev *pdev)
+{
+	unsigned int i;
+
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		struct resource *res = pci_resource_n(pdev, i);
+
+		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
+			return true;
+	}
+	return false;
+}
+
+/**
+ * pci_bus_isolated - Determine how isolated connected devices are
+ * @bus: The bus to check
+ *
+ * Isolation is the ability of devices to talk to each other. Full isolation
+ * means that a device can only communicate with the IOMMU and can not do peer
+ * to peer within the fabric.
+ *
+ * We consider isolation on a bus by bus basis. If the bus will permit a
+ * transaction originated downstream to complete on anything other than the
+ * IOMMU then the bus is not isolated.
+ *
+ * Non-isolation includes all the downstream devices on this bus, and it may
+ * include the upstream bridge or port that is creating this bus.
+ *
+ * The various cases are returned in an enum.
+ *
+ * Broadly speaking this function evaluates the ACS settings in a PCI switch to
+ * determine if a PCI switch is configured to have full isolation.
+ *
+ * Old PCI/PCI-X busses cannot have isolation due to their physical properties,
+ * but they do have some aliasing properties that effect group creation.
+ *
+ * pci_bus_isolated() does not consider loopback internal to devices, like
+ * multi-function devices performing a self-loopback. The caller must check
+ * this separately. It does not considering alasing within the bus.
+ *
+ * It does not currently support the ACS P2P Egress Control Vector, Linux does
+ * not yet have any way to enable this feature. EC will create subsets of the
+ * bus that are isolated from other subsets.
+ */
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
+{
+	struct pci_dev *bridge = bus->self;
+	int type;
+
+	/*
+	 * This bus was created by pci_register_host_bridge(). The spec provides
+	 * no way to tell what kind of bus this is, for PCIe we expect this to
+	 * be internal to the root complex and not covered by any spec behavior.
+	 * Linux has historically been optimistic about this bus and treated it
+	 * as isolating. Given that the behavior of the root complex and the ACS
+	 * behavior of RCiEP's is explicitly not specified we hope that the
+	 * implementation is directing everything that reaches the root bus to
+	 * the IOMMU.
+	 */
+	if (pci_is_root_bus(bus))
+		return PCIE_ISOLATED;
+
+	/*
+	 * bus->self is only NULL for SRIOV VFs, it represents a "virtual" bus
+	 * within Linux to hold any bus numbers consumed by VF RIDs. Caller must
+	 * use pci_physfn() to get the bus for calling this function.
+	 */
+	if (WARN_ON(!bridge))
+		return PCI_BRIDGE_NON_ISOLATED;
+
+	/*
+	 * The bridge is not a PCIe bridge therefore this bus is PCI/PCI-X.
+	 *
+	 * PCI does not have anything like ACS. Any down stream device can bus
+	 * master an address that any other downstream device can claim. No
+	 * isolation is possible.
+	 */
+	if (!pci_is_pcie(bridge)) {
+		if (bridge->dev_flags & PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS)
+			type = PCI_EXP_TYPE_PCI_BRIDGE;
+		else
+			return PCI_BRIDGE_NON_ISOLATED;
+	} else {
+		type = pci_pcie_type(bridge);
+	}
+
+	switch (type) {
+	/*
+	 * Since PCIe links are point to point root ports are isolated if there
+	 * is no internal loopback to the root port's MMIO. Like MFDs assume if
+	 * there is no ACS cap then there is no loopback.
+	 */
+	case PCI_EXP_TYPE_ROOT_PORT:
+		if (bridge->acs_cap &&
+		    !pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
+			return PCIE_NON_ISOLATED;
+		return PCIE_ISOLATED;
+
+	/*
+	 * Since PCIe links are point to point a DSP is always considered
+	 * isolated. The internal bus of the switch will be non-isolated if the
+	 * DSP's have any ACS that allows upstream traffic to flow back
+	 * downstream to any DSP, including back to this DSP or its MMIO.
+	 */
+	case PCI_EXP_TYPE_DOWNSTREAM:
+		return PCIE_ISOLATED;
+
+	/*
+	 * bus is the interior bus of a PCI-E switch where ACS rules apply.
+	 */
+	case PCI_EXP_TYPE_UPSTREAM:
+		return pcie_switch_isolated(bus);
+
+	/*
+	 * PCIe to PCI/PCI-X - this bus is PCI.
+	 */
+	case PCI_EXP_TYPE_PCI_BRIDGE:
+		/*
+		 * A PCIe express bridge will use the subordinate bus number
+		 * with a 0 devfn as the RID in some cases. This causes all
+		 * subordinate devfns to alias with 0, which is the same
+		 * grouping as PCI_BUS_NON_ISOLATED. The RID of the bridge
+		 * itself is only used by the bridge.
+		 *
+		 * However, if the bridge has MMIO then we will assume the MMIO
+		 * is not isolated due to no ACS controls on this bridge type.
+		 */
+		if (pci_has_mmio(bridge))
+			return PCI_BRIDGE_NON_ISOLATED;
+		return PCI_BUS_NON_ISOLATED;
+
+	/*
+	 * PCI/PCI-X to PCIe - this bus is PCIe. We already know there must be a
+	 * PCI bus upstream of this bus, so just return non-isolated. If
+	 * upstream is PCI-X the PCIe RID should be preserved, but for PCI the
+	 * RID will be lost.
+	 */
+	case PCI_EXP_TYPE_PCIE_BRIDGE:
+		return PCI_BRIDGE_NON_ISOLATED;
+
+	default:
+		return PCI_BRIDGE_NON_ISOLATED;
+	}
+}
+
 static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
 {
 	struct pci_bus *child;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 59876de13860db..c36fff9d2254f8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -855,6 +855,32 @@ struct pci_dynids {
 	struct list_head	list;	/* For IDs added at runtime */
 };
 
+enum pci_bus_isolation {
+	/*
+	 * The bus is off a root port and the root port has isolated ACS flags
+	 * or the bus is part of a PCIe switch and the switch has isolated ACS
+	 * flags.
+	 */
+	PCIE_ISOLATED,
+	/*
+	 * The switch's DSP's are not isolated from each other but are isolated
+	 * from the USP.
+	 */
+	PCIE_SWITCH_DSP_NON_ISOLATED,
+	/* The above and the USP's MMIO is not isolated. */
+	PCIE_NON_ISOLATED,
+	/*
+	 * A PCI/PCI-X bus, no isolation. This is like
+	 * PCIE_SWITCH_DSP_NON_ISOLATED in that the upstream bridge is isolated
+	 * from the bus. The bus itself may also have a shared alias of devfn=0.
+	 */
+	PCI_BUS_NON_ISOLATED,
+	/*
+	 * The above and the bridge's MMIO is not isolated and the bridge's RID
+	 * may be an alias.
+	 */
+	PCI_BRIDGE_NON_ISOLATED,
+};
 
 /*
  * PCI Error Recovery System (PCI-ERS).  If a PCI device driver provides
@@ -1243,6 +1269,8 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
 
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
+
 int pci_dev_present(const struct pci_device_id *ids);
 
 int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
@@ -2056,6 +2084,9 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
 						 struct pci_dev *from)
 { return NULL; }
 
+static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
+{ return PCIE_NON_ISOLATED; }
+
 static inline int pci_dev_present(const struct pci_device_id *ids)
 { return 0; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
  2025-09-05 18:06 ` [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
  2025-09-05 18:06 ` [PATCH v3 02/11] PCI: Add pci_bus_isolated() Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  4:14   ` Donald Dutile
  2025-09-09 20:27   ` Bjorn Helgaas
  2025-09-05 18:06 ` [PATCH v3 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. I think it has been avoided
because the kernel command line options to target specific devices and
disable ACS are rarely used.

For discussion lets consider a simple topology like the below:

                               -- DSP 02:00.0 -> End Point A
 Root 00:00.0 -> USP 01:00.0 --|
                               -- DSP 02:03.0 -> End Point B

If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.

If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.

If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.

Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSPs group.

For root-ports a PCIe topology like:
                                         -- Dev 01:00.0
  Root  00:00.00 --- Root Port 00:01.0 --|
                  |                      -- Dev 01:00.1
		  |- Dev 00:17.0

Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
ACS capability in the root port.

While ACS on root ports is underspecified in the spec, it should still
function as an egress control and limit access to either the MMIO of the
root port itself, or perhaps some other devices upstream of the root
complex - 00:17.0 perhaps in this example.

Historically the grouping in Linux has assumed the root port routes all
traffic into the TA/IOMMU and never bypasses the TA to go to other
functions in the root complex. Following the new understanding that ACS is
required for internal loopback also treat root ports with no ACS
capability as lacking internal loopback as well.

The current algorithm has several issues:

 1) It implicitly depends on ordering. Since the existing group discovery
    only goes in the upstream direction discovering a downstream device
    before its upstream will cause the wrong creation of narrower groups.

 2) It assumes that if the path from the end point to the root is entirely
    ACS isolated then that end point is isolated. This misses cross-traffic
    in the asymmetric ACS case.

 3) When evaluating a non-isolated DSP it does not check peer DSPs for an
    already established group unless the multi-function feature does it.

 4) It does not understand the aliasing rule for PCIe to PCI bridges
    where the alias is to the subordinate bus. The bridge's RID on the
    primary bus is not aliased. This causes the PCIe to PCI bridge to be
    wrongly joined to the group with the downstream devices.

As grouping is a security property for VFIO creating incorrectly narrowed
groups is a security problem for the system.

Revise the design to solve these problems.

Explicitly require ordering, or return EPROBE_DEFER if things are out of
order. This avoids silent errors that created smaller groups and solves
problem #1.

Work on busses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all busses on the path to an end device are
isolated then the end device has a chance to make a single-device group.

Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches a lot of PCI
internals to get the information in the right format.

Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2

The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.

Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.

Finally, replace the initial pci_for_each_dma_alias() with a combination
of:

 - Directly checking pci_real_dma_dev() and enforcing ordering.
   The group should contain both pdev and pci_real_dma_dev(pdev) which is
   only possible if pdev is ordered after real_dma_dev. This solves a case
   of #1.

 - Indirectly relying on pci_bus_isolation() to report legacy PCI busses
   as non-isolated, with the enum including the distinction of the PCIe to
   PCI bridge being isolated from the downstream. This solves problem #4.

It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.

The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.

Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 279 ++++++++++++++++++++++++++++++++----------
 include/linux/pci.h   |   3 +
 2 files changed, 217 insertions(+), 65 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2a47ddb01799c1..1874bbdc73b75e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -65,8 +65,16 @@ struct iommu_group {
 	struct list_head entry;
 	unsigned int owner_cnt;
 	void *owner;
+
+	/* Used by the device_group() callbacks */
+	u32 bus_data;
 };
 
+/*
+ * Everything downstream of this group should share it.
+ */
+#define BUS_DATA_PCI_NON_ISOLATED BIT(0)
+
 struct group_device {
 	struct list_head list;
 	struct device *dev;
@@ -1484,25 +1492,6 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 	return NULL;
 }
 
-struct group_for_pci_data {
-	struct pci_dev *pdev;
-	struct iommu_group *group;
-};
-
-/*
- * DMA alias iterator callback, return the last seen device.  Stop and return
- * the IOMMU group if we find one along the way.
- */
-static int get_pci_alias_or_group(struct pci_dev *pdev, u16 alias, void *opaque)
-{
-	struct group_for_pci_data *data = opaque;
-
-	data->pdev = pdev;
-	data->group = iommu_group_get(&pdev->dev);
-
-	return data->group != NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1534,57 +1523,31 @@ struct iommu_group *generic_single_device_group(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(generic_single_device_group);
 
-/*
- * Use standard PCI bus topology, isolation features, and DMA alias quirks
- * to find or create an IOMMU group for a device.
- */
-struct iommu_group *pci_device_group(struct device *dev)
+static struct iommu_group *pci_group_alloc_non_isolated(void)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct group_for_pci_data data;
-	struct pci_bus *bus;
-	struct iommu_group *group = NULL;
-	u64 devfns[4] = { 0 };
+	struct iommu_group *group;
 
-	if (WARN_ON(!dev_is_pci(dev)))
-		return ERR_PTR(-EINVAL);
+	group = iommu_group_alloc();
+	if (IS_ERR(group))
+		return group;
+	group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+	return group;
+}
 
-	/*
-	 * Find the upstream DMA alias for the device.  A device must not
-	 * be aliased due to topology in order to have its own IOMMU group.
-	 * If we find an alias along the way that already belongs to a
-	 * group, use it.
-	 */
-	if (pci_for_each_dma_alias(pdev, get_pci_alias_or_group, &data))
-		return data.group;
-
-	pdev = data.pdev;
-
-	/*
-	 * Continue upstream from the point of minimum IOMMU granularity
-	 * due to aliases to the point where devices are protected from
-	 * peer-to-peer DMA by PCI ACS.  Again, if we find an existing
-	 * group, use it.
-	 */
-	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
-		if (!bus->self)
-			continue;
-
-		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
-			break;
-
-		pdev = bus->self;
-
-		group = iommu_group_get(&pdev->dev);
-		if (group)
-			return group;
-	}
+/*
+ * Return a group if the function has isolation restrictions related to
+ * aliases or MFD ACS.
+ */
+static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
+{
+	struct iommu_group *group;
+	DECLARE_BITMAP(devfns, 256) = {};
 
 	/*
 	 * Look for existing groups on device aliases.  If we alias another
 	 * device or another device aliases us, use the same group.
 	 */
-	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
+	group = get_pci_alias_group(pdev, devfns);
 	if (group)
 		return group;
 
@@ -1593,12 +1556,198 @@ struct iommu_group *pci_device_group(struct device *dev)
 	 * slot and aliases of those funcions, if any.  No need to clear
 	 * the search bitmap, the tested devfns are still valid.
 	 */
-	group = get_pci_function_alias_group(pdev, (unsigned long *)devfns);
+	group = get_pci_function_alias_group(pdev, devfns);
 	if (group)
 		return group;
 
-	/* No shared group found, allocate new */
-	return iommu_group_alloc();
+	/*
+	 * When MFD's are included in the set due to ACS we assume that if ACS
+	 * permits an internal loopback between functions it also permits the
+	 * loopback to go downstream if a function is a bridge.
+	 *
+	 * It is less clear what aliases mean when applied to a bridge. For now
+	 * be conservative and also propagate the group downstream.
+	 */
+	__clear_bit(pdev->devfn & 0xFF, devfns);
+	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
+		return pci_group_alloc_non_isolated();
+	return NULL;
+}
+
+/* Return a group if the upstream hierarchy has isolation restrictions. */
+static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
+{
+	/*
+	 * SRIOV functions may reside on a virtual bus, jump directly to the PFs
+	 * bus in all cases.
+	 */
+	struct pci_bus *bus = pci_physfn(pdev)->bus;
+	struct iommu_group *group;
+
+	/* Nothing upstream of this */
+	if (pci_is_root_bus(bus))
+		return NULL;
+
+	/*
+	 * !self is only for SRIOV virtual busses which should have been
+	 * excluded by pci_physfn()
+	 */
+	if (WARN_ON(!bus->self))
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(&bus->self->dev);
+	if (!group) {
+		/*
+		 * If the upstream bridge needs the same group as pdev then
+		 * there is no way for it's pci_device_group() to discover it.
+		 */
+		dev_err(&pdev->dev,
+			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
+			pci_name(bus->self));
+		return ERR_PTR(-EPROBE_DEFER);
+	}
+	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
+		return group;
+	iommu_group_put(group);
+	return NULL;
+}
+
+/*
+ * For legacy PCI we have two main considerations when forming groups:
+ *
+ *  1) In PCI we can loose the RID inside the fabric, or some devices will use
+ *     the wrong RID. The PCI core calls this aliasing, but from an IOMMU
+ *     perspective it means that a PCI device may have multiple RIDs and a
+ *     single RID may represent many PCI devices. This effectively means all the
+ *     aliases must share a translation, thus group, because the IOMMU cannot
+ *     tell devices apart.
+ *
+ *  2) PCI permits a bus segment to claim an address even if the transaction
+ *     originates from an end point not the CPU. When it happens it is called
+ *     peer to peer. Claiming a transaction in the middle of the bus hierarchy
+ *     bypasses the IOMMU translation. The IOMMU subsystem rules require these
+ *     devices to be placed in the same group because they lack isolation from
+ *     each other. In PCI Express the ACS system can be used to inhibit this and
+ *     force transactions to go to the IOMMU.
+ *
+ *     From a PCI perspective any given PCI bus is either isolating or
+ *     non-isolating. Isolating means downstream originated transactions always
+ *     progress toward the CPU and do not go to other devices on the bus
+ *     segment, while non-isolating means downstream originated transactions can
+ *     progress back downstream through another device on the bus segment.
+ *
+ *     Beyond buses a multi-function device or bridge can also allow
+ *     transactions to loop back internally from one function to another.
+ *
+ *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
+ *     that bus becomes a single group.
+ */
+struct iommu_group *pci_device_group(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct iommu_group *group;
+	struct pci_dev *real_pdev;
+
+	if (WARN_ON(!dev_is_pci(dev)))
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * Arches can supply a completely different PCI device that actually
+	 * does DMA.
+	 */
+	real_pdev = pci_real_dma_dev(pdev);
+	if (real_pdev != pdev) {
+		group = iommu_group_get(&real_pdev->dev);
+		if (!group) {
+			/*
+			 * The real_pdev has not had an iommu probed to it. We
+			 * can't create a new group here because there is no way
+			 * for pci_device_group(real_pdev) to pick it up.
+			 */
+			dev_err(dev,
+				"PCI device is probing out of order, real device of %s is not probed yet\n",
+				pci_name(real_pdev));
+			return ERR_PTR(-EPROBE_DEFER);
+		}
+		return group;
+	}
+
+	if (pdev->dev_flags & PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT)
+		return iommu_group_alloc();
+
+	/* Anything upstream of this enforcing non-isolated? */
+	group = pci_hierarchy_group(pdev);
+	if (group)
+		return group;
+
+	switch (pci_bus_isolated(pci_physfn(pdev)->bus)) {
+	case PCIE_ISOLATED:
+		/* Check multi-function groups and same-bus devfn aliases */
+		group = pci_get_function_group(pdev);
+		if (group)
+			return group;
+
+		/* No shared group found, allocate new */
+		return iommu_group_alloc();
+
+	/*
+	 * On legacy PCI there is no RID at an electrical level. On PCI-X the
+	 * RID of the bridge may be used in some cases instead of the
+	 * downstream's RID. This creates aliasing problems. PCI/PCI-X doesn't
+	 * provide isolation either. The end result is that as soon as we hit a
+	 * PCI/PCI-X bus we switch to non-isolated for the whole downstream for
+	 * both aliasing and isolation reasons. The bridge has to be included in
+	 * the group because of the aliasing.
+	 */
+	case PCI_BRIDGE_NON_ISOLATED:
+	/* A PCIe switch where the USP has MMIO and is not isolated. */
+	case PCIE_NON_ISOLATED:
+		group = iommu_group_get(&pdev->bus->self->dev);
+		if (WARN_ON(!group))
+			return ERR_PTR(-EINVAL);
+		/*
+		 * No need to be concerned with aliases here since we are going
+		 * to put the entire downstream tree in the bridge/USP's group.
+		 */
+		group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+		return group;
+
+	/*
+	 * It is a PCI bus and the upstream bridge/port does not alias or allow
+	 * P2P.
+	 */
+	case PCI_BUS_NON_ISOLATED:
+	/*
+	 * It is a PCIe switch and the DSP cannot reach the USP. The DSP's
+	 * are not isolated from each other and share a group.
+	 */
+	case PCIE_SWITCH_DSP_NON_ISOLATED: {
+		struct pci_dev *piter = NULL;
+
+		/*
+		 * All the downstream devices on the bus share a group. If this
+		 * is a PCIe switch then they will all be DSPs
+		 */
+		for_each_pci_dev(piter) {
+			if (piter->bus != pdev->bus)
+				continue;
+			group = iommu_group_get(&piter->dev);
+			if (group) {
+				pci_dev_put(piter);
+				if (WARN_ON(!(group->bus_data &
+					      BUS_DATA_PCI_NON_ISOLATED)))
+					group->bus_data |=
+						BUS_DATA_PCI_NON_ISOLATED;
+				return group;
+			}
+		}
+		return pci_group_alloc_non_isolated();
+	}
+	default:
+		break;
+	}
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
 }
 EXPORT_SYMBOL_GPL(pci_device_group);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c36fff9d2254f8..fb9adf0562f8ef 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2093,6 +2093,9 @@ static inline int pci_dev_present(const struct pci_device_id *ids)
 #define no_pci_devices()	(1)
 #define pci_dev_put(dev)	do { } while (0)
 
+static inline struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
+{ return dev; }
+
 static inline void pci_set_master(struct pci_dev *dev) { }
 static inline void pci_clear_master(struct pci_dev *dev) { }
 static inline int pci_enable_device(struct pci_dev *dev) { return -EIO; }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 04/11] iommu: Organize iommu_group by member size
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  4:16   ` Donald Dutile
  2025-09-05 18:06 ` [PATCH v3 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

To avoid some internal padding.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1874bbdc73b75e..543d6347c0e5e3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -58,13 +58,13 @@ struct iommu_group {
 	void *iommu_data;
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
-	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
-	unsigned int owner_cnt;
 	void *owner;
+	unsigned int owner_cnt;
+	int id;
 
 	/* Used by the device_group() callbacks */
 	u32 bus_data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 05/11] PCI: Add pci_reachable_set()
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09 21:03   ` Bjorn Helgaas
  2025-09-05 18:06 ` [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

Implement pci_reachable_set() to efficiently compute a set of devices on
the same bus that are "reachable" from a starting device. The meaning of
reachability is defined by the caller through a callback function.

This is a faster implementation of the same logic in
pci_device_group(). Being inside the PCI core allows use of pci_bus_sem so
it can use list_for_each_entry() on a small list of devices instead of the
expensive for_each_pci_dev(). Server systems can now have hundreds of PCI
devices, but typically only a very small number of devices per bus.

An example of a reachability function would be pci_devs_are_dma_aliases()
which would compute a set of devices on the same bus that are
aliases. This would also be useful in future support for the ACS P2P
Egress Vector which has a similar reachability problem.

This is effectively a graph algorithm where the set of devices on the bus
are vertexes and the reachable() function defines the edges. It returns a
set of vertexes that form a connected graph.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  | 12 ++++++
 2 files changed, 102 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index fe6c07e67cb8ce..dac6b042fd5f5d 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -595,3 +595,93 @@ int pci_dev_present(const struct pci_device_id *ids)
 	return 0;
 }
 EXPORT_SYMBOL(pci_dev_present);
+
+/**
+ * pci_reachable_set - Generate a bitmap of devices within a reachability set
+ * @start: First device in the set
+ * @devfns: The set of devices on the bus
+ * @reachable: Callback to tell if two devices can reach each other
+ *
+ * Compute a bitmap where every set bit is a device on the bus that is reachable
+ * from the start device, including the start device. Reachability between two
+ * devices is determined by a callback function.
+ *
+ * This is a non-recursive implementation that invokes the callback once per
+ * pair. The callback must be commutative:
+ *    reachable(a, b) == reachable(b, a)
+ * reachable() can form a cyclic graph:
+ *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
+ *
+ * Since this function is limited to a single bus the largest set can be 256
+ * devices large.
+ */
+void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb))
+{
+	struct pci_reachable_set todo_devfns = {};
+	struct pci_reachable_set next_devfns = {};
+	struct pci_bus *bus = start->bus;
+	bool again;
+
+	/* Assume devfn of all PCI devices is bounded by MAX_NR_DEVFNS */
+	static_assert(sizeof(next_devfns.devfns) * BITS_PER_BYTE >=
+		      MAX_NR_DEVFNS);
+
+	memset(devfns, 0, sizeof(devfns->devfns));
+	__set_bit(start->devfn, devfns->devfns);
+	__set_bit(start->devfn, next_devfns.devfns);
+
+	down_read(&pci_bus_sem);
+	while (true) {
+		unsigned int devfna;
+		unsigned int i;
+
+		/*
+		 * For each device that hasn't been checked compare every
+		 * device on the bus against it.
+		 */
+		again = false;
+		for_each_set_bit(devfna, next_devfns.devfns, MAX_NR_DEVFNS) {
+			struct pci_dev *deva = NULL;
+			struct pci_dev *devb;
+
+			list_for_each_entry(devb, &bus->devices, bus_list) {
+				if (devb->devfn == devfna)
+					deva = devb;
+
+				if (test_bit(devb->devfn, devfns->devfns))
+					continue;
+
+				if (!deva) {
+					deva = devb;
+					list_for_each_entry_continue(
+						deva, &bus->devices, bus_list)
+						if (deva->devfn == devfna)
+							break;
+				}
+
+				if (!reachable(deva, devb))
+					continue;
+
+				__set_bit(devb->devfn, todo_devfns.devfns);
+				again = true;
+			}
+		}
+
+		if (!again)
+			break;
+
+		/*
+		 * Every new bit adds a new deva to check, reloop the whole
+		 * thing. Expect this to be rare.
+		 */
+		for (i = 0; i != ARRAY_SIZE(devfns->devfns); i++) {
+			devfns->devfns[i] |= todo_devfns.devfns[i];
+			next_devfns.devfns[i] = todo_devfns.devfns[i];
+			todo_devfns.devfns[i] = 0;
+		}
+	}
+	up_read(&pci_bus_sem);
+}
+EXPORT_SYMBOL_GPL(pci_reachable_set);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index fb9adf0562f8ef..21f6b20b487f8d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -855,6 +855,10 @@ struct pci_dynids {
 	struct list_head	list;	/* For IDs added at runtime */
 };
 
+struct pci_reachable_set {
+	DECLARE_BITMAP(devfns, 256);
+};
+
 enum pci_bus_isolation {
 	/*
 	 * The bus is off a root port and the root port has isolated ACS flags
@@ -1269,6 +1273,9 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
 
+void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb));
 enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
 
 int pci_dev_present(const struct pci_device_id *ids);
@@ -2084,6 +2091,11 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
 						 struct pci_dev *from)
 { return NULL; }
 
+static inline void
+pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
+		  bool (*reachable)(struct pci_dev *deva, struct pci_dev *devb))
+{ }
+
 static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 { return PCIE_NON_ISOLATED; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  4:57   ` Donald Dutile
  2025-09-09 21:24   ` Bjorn Helgaas
  2025-09-05 18:06 ` [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

Like with switches the current MFD algorithm does not consider asymmetric
ACS within a MFD. If any MFD function has ACS that permits P2P the spec
says it can reach through the MFD internal loopback any other function in
the device.

For discussion let's consider a simple MFD topology like the below:

                      -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
      Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
                      |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS

This asymmetric ACS could be created using the config_acs kernel command
line parameter, from quirks, or from a poorly thought out device that has
ACS flags only on some functions.

Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
function. Thus we expect an iommu_group to contain all three
devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
single device group of 1f.6.

The current algorithm sees the good ACS flags on 00:1f.6 and does not
consider ACS on any other MFD functions.

For path properties the ACS flags say that 00:1f.6 is safe to use with
PASID and supports SVA as it will not have any portions of its address
space routed away from the IOMMU, this part of the ACS system is working
correctly.

Further, if one of the MFD functions is a bridge, eg like 1f.2:

                      -- MFD 00:1f.0
      Root 00:00.00 --|- MFD 00:1f.2 Root Port --- 01:01.0
                      |- MFD 00:1f.6

Then the correct grouping will include 01:01.0, 00:1f.0/2/6 together in a
group if there is any internal loopback within the MFD 00:1f. The current
algorithm does not understand this and gives 01:01.0 it's own group even
if it thinks there is an internal loopback in the MFD.

Unfortunately this detail makes it hard to fix. Currently the code assumes
that any MFD without an ACS cap has an internal loopback which will cause
a large number of modern real systems to group in a pessimistic way.

However, the PCI spec does not really support this:

   PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function
   Devices

    ACS P2P Request Redirect: must be implemented by Functions that
    support peer-to-peer traffic with other Functions.

Meaning from a spec perspective the absence of ACS indicates the absence
of internal loopback. Granted I think we are aware of older real devices
that ignore this, but it seems to be the only way forward.

So, rely on 6.12.1.2 and assume functions without ACS do not have internal
loopback. This resolves the common issue with modern systems and MFD root
ports, but it makes the ACS quirks system less used. Instead we'd want
quirks that say self-loopback is actually present, not like today's quirks
that say it is absent. This is surely negative for older hardware, but
positive for new HW that complies with the spec.

Use pci_reachable_set() in pci_device_group() to make the resulting
algorithm faster and easier to understand.

Add pci_mfds_are_same_group() which specifically looks pair-wise at all
functions in the MFDs. Any function with ACS capabilities and non-isolated
aCS flags will become reachable to all other functions.

pci_reachable_set() does the calculations for figuring out the set of
devices under the pci_bus_sem, which is better than repeatedly searching
across all PCI devices.

Once the set of devices is determined and the set has more than one device
use pci_get_slot() to search for any existing groups in the reachable set.

Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 189 +++++++++++++++++++-----------------------
 1 file changed, 87 insertions(+), 102 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 543d6347c0e5e3..fc3c71b243a850 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_group_id);
 
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns);
-
-/*
- * For multifunction devices which are not isolated from each other, find
- * all the other non-isolated functions and look for existing groups.  For
- * each function, we also need to look for aliases to or from other devices
- * that may already have a group.
- */
-static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
-							unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
-		return NULL;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus ||
-		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
-			continue;
-
-		group = get_pci_alias_group(tmp, devfns);
-		if (group) {
-			pci_dev_put(tmp);
-			return group;
-		}
-	}
-
-	return NULL;
-}
-
-/*
- * Look for aliases to or from the given device for existing groups. DMA
- * aliases are only supported on the same bus, therefore the search
- * space is quite small (especially since we're really only looking at pcie
- * device, and therefore only expect multiple slots on the root complex or
- * downstream switch ports).  It's conceivable though that a pair of
- * multifunction devices could have aliases between them that would cause a
- * loop.  To prevent this, we use a bitmap to track where we've been.
- */
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
-		return NULL;
-
-	group = iommu_group_get(&pdev->dev);
-	if (group)
-		return group;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus)
-			continue;
-
-		/* We alias them or they alias us */
-		if (pci_devs_are_dma_aliases(pdev, tmp)) {
-			group = get_pci_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-
-			group = get_pci_function_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-		}
-	}
-
-	return NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1534,44 +1455,108 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
 	return group;
 }
 
+/*
+ * All functions in the MFD need to be isolated from each other and get their
+ * own groups, otherwise the whole MFD will share a group.
+ */
+static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/*
+	 * SRIOV VFs will use the group of the PF if it has
+	 * BUS_DATA_PCI_NON_ISOLATED. We don't support VFs that also have ACS
+	 * that are set to non-isolating.
+	 */
+	if (deva->is_virtfn || devb->is_virtfn)
+		return false;
+
+	/* Are deva/devb functions in the same MFD? */
+	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
+		return false;
+	/* Don't understand what is happening, be conservative */
+	if (deva->multifunction != devb->multifunction)
+		return true;
+	if (!deva->multifunction)
+		return false;
+
+	/*
+	 * PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and
+	 * Multi-Function Devices
+	 *   ...
+	 *   ACS P2P Request Redirect: must be implemented by Functions that
+	 *   support peer-to-peer traffic with other Functions.
+	 *
+	 * Therefore assume if a MFD has no ACS capability then it does not
+	 * support a loopback. This is the reverse of what Linux <= v6.16
+	 * assumed - that any MFD was capable of P2P and used quirks identify
+	 * devices that complied with the above.
+	 */
+	if (deva->acs_cap && !pci_acs_enabled(deva, PCI_ACS_ISOLATED))
+		return true;
+	if (devb->acs_cap && !pci_acs_enabled(devb, PCI_ACS_ISOLATED))
+		return true;
+	return false;
+}
+
+static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/*
+	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
+	 */
+	if (pci_devs_are_dma_aliases(deva, devb))
+		return true;
+
+	return pci_mfds_are_same_group(deva, devb);
+}
+
 /*
  * Return a group if the function has isolation restrictions related to
  * aliases or MFD ACS.
  */
 static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
 {
-	struct iommu_group *group;
-	DECLARE_BITMAP(devfns, 256) = {};
+	struct pci_reachable_set devfns;
+	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
+	unsigned int devfn;
 
 	/*
-	 * Look for existing groups on device aliases.  If we alias another
-	 * device or another device aliases us, use the same group.
+	 * Look for existing groups on device aliases and multi-function ACS. If
+	 * we alias another device or another device aliases us, use the same
+	 * group.
+	 *
+	 * pci_reachable_set() should return the same bitmap if called for any
+	 * device in the set and we want all devices in the set to have the same
+	 * group.
 	 */
-	group = get_pci_alias_group(pdev, devfns);
-	if (group)
-		return group;
+	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
+	/* start is known to have iommu_group_get() == NULL */
+	__clear_bit(pdev->devfn, devfns.devfns);
 
 	/*
-	 * Look for existing groups on non-isolated functions on the same
-	 * slot and aliases of those funcions, if any.  No need to clear
-	 * the search bitmap, the tested devfns are still valid.
-	 */
-	group = get_pci_function_alias_group(pdev, devfns);
-	if (group)
-		return group;
-
-	/*
-	 * When MFD's are included in the set due to ACS we assume that if ACS
-	 * permits an internal loopback between functions it also permits the
-	 * loopback to go downstream if a function is a bridge.
+	 * When MFD functions are included in the set due to ACS we assume that
+	 * if ACS permits an internal loopback between functions it also permits
+	 * the loopback to go downstream if any function is a bridge.
 	 *
 	 * It is less clear what aliases mean when applied to a bridge. For now
 	 * be conservative and also propagate the group downstream.
 	 */
-	__clear_bit(pdev->devfn & 0xFF, devfns);
-	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
-		return pci_group_alloc_non_isolated();
-	return NULL;
+	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
+		return NULL;
+
+	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
+		struct iommu_group *group;
+		struct pci_dev *pdev_slot;
+
+		pdev_slot = pci_get_slot(pdev->bus, devfn);
+		group = iommu_group_get(&pdev_slot->dev);
+		pci_dev_put(pdev_slot);
+		if (group) {
+			if (WARN_ON(!(group->bus_data &
+				      BUS_DATA_PCI_NON_ISOLATED)))
+				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+			return group;
+		}
+	}
+	return pci_group_alloc_non_isolated();
 }
 
 /* Return a group if the upstream hierarchy has isolation restrictions. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  5:00   ` Donald Dutile
  2025-09-05 18:06 ` [PATCH v3 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

Directly check that the devices touched by pci_for_each_dma_alias() match
the groups that were built by pci_device_group(). This helps validate that
pci_for_each_dma_alias() and pci_bus_isolated() are consistent.

This should eventually be hidden behind a debug kconfig, but for now it is
good to get feedback from more diverse systems if there are any problems.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 76 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fc3c71b243a850..2bd43a5a9ad8d8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1627,7 +1627,7 @@ static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
  *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
  *     that bus becomes a single group.
  */
-struct iommu_group *pci_device_group(struct device *dev)
+static struct iommu_group *__pci_device_group(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct iommu_group *group;
@@ -1734,6 +1734,80 @@ struct iommu_group *pci_device_group(struct device *dev)
 	WARN_ON(true);
 	return ERR_PTR(-EINVAL);
 }
+
+struct check_group_aliases_data {
+	struct pci_dev *pdev;
+	struct iommu_group *group;
+};
+
+static void pci_check_group(const struct check_group_aliases_data *data,
+			    u16 alias, struct pci_dev *pdev)
+{
+	struct iommu_group *group;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return;
+
+	if (group != data->group)
+		dev_err(&data->pdev->dev,
+			"During group construction alias processing needed dev %s alias %x to have the same group but %u != %u\n",
+			pci_name(pdev), alias, data->group->id, group->id);
+	iommu_group_put(group);
+}
+
+static int pci_check_group_aliases(struct pci_dev *pdev, u16 alias,
+				   void *opaque)
+{
+	const struct check_group_aliases_data *data = opaque;
+
+	/*
+	 * Sometimes when a PCIe-PCI bridge is performing transactions on behalf
+	 * of its subordinate bus it uses devfn=0 on the subordinate bus as the
+	 * alias. This means that 0 will alias with all devfns on the
+	 * subordinate bus and so we expect to see those in the same group. pdev
+	 * in this case is the bridge itself and pdev->bus is the primary bus of
+	 * the bridge.
+	 */
+	if (pdev->bus->number != PCI_BUS_NUM(alias)) {
+		struct pci_dev *piter = NULL;
+
+		for_each_pci_dev(piter) {
+			if (pci_domain_nr(pdev->bus) ==
+				    pci_domain_nr(piter->bus) &&
+			    PCI_BUS_NUM(alias) == pdev->bus->number)
+				pci_check_group(data, alias, piter);
+		}
+	} else {
+		pci_check_group(data, alias, pdev);
+	}
+
+	return 0;
+}
+
+struct iommu_group *pci_device_group(struct device *dev)
+{
+	struct check_group_aliases_data data = {
+		.pdev = to_pci_dev(dev),
+	};
+	struct iommu_group *group;
+
+	if (!IS_ENABLED(CONFIG_PCI))
+		return ERR_PTR(-EINVAL);
+
+	group = __pci_device_group(dev);
+	if (IS_ERR(group))
+		return group;
+
+	/*
+	 * The IOMMU driver should use pci_for_each_dma_alias() to figure out
+	 * what RIDs to program and the core requires all the RIDs to fall
+	 * within the same group. Validate that everything worked properly.
+	 */
+	data.group = group;
+	pci_for_each_dma_alias(data.pdev, pci_check_group_aliases, &data);
+	return group;
+}
 EXPORT_SYMBOL_GPL(pci_device_group);
 
 /* Get the IOMMU group for device on fsl-mc bus */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 08/11] PCI: Add the ACS Enhanced Capability definitions
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  5:01   ` Donald Dutile
  2025-09-05 18:06 ` [PATCH v3 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

This brings the definitions up to PCI Express revision 5.0:

 * ACS I/O Request Blocking Enable
 * ACS DSP Memory Target Access Control
 * ACS USP Memory Target Access Control
 * ACS Unclaimed Request Redirect

Support for this entire grouping is advertised by the ACS Enhanced
Capability bit.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/uapi/linux/pci_regs.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 6095e7d7d4cc48..54621e6e83572e 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1005,8 +1005,16 @@
 #define  PCI_ACS_UF		0x0010	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x0020	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x0040	/* Direct Translated P2P */
+#define  PCI_ACS_ENHANCED	0x0080  /* IORB, DSP_MT_xx, USP_MT_XX. Capability only */
+#define  PCI_ACS_EGRESS_CTL_SZ	GENMASK(15, 8) /* Egress Control Vector Size */
 #define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
+#define  PCI_ACS_IORB		0x0080  /* I/O Request Blocking */
+#define  PCI_ACS_DSP_MT_RB	0x0100  /* DSP Memory Target Access Control Request Blocking */
+#define  PCI_ACS_DSP_MT_RR	0x0200  /* DSP Memory Target Access Control Request Redirect */
+#define  PCI_ACS_USP_MT_RB	0x0400  /* USP Memory Target Access Control Request Blocking */
+#define  PCI_ACS_USP_MT_RR	0x0800  /* USP Memory Target Access Control Request Redirect */
+#define  PCI_ACS_UNCLAIMED_RR	0x1000  /* Unclaimed Request Redirect Control */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  5:01   ` Donald Dutile
  2025-09-05 18:06 ` [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

The ACS Enhanced bits are intended to address a lack of precision in the
spec about what ACS P2P Request Redirect is supposed to do. While Linux
has long assumed that PCI_ACS_RR would cover MMIO BARs located in the root
port and PCIe Switch ports, the spec took the position that it is
implementation specific.

To get the behavior Linux has long assumed it should be setting:

  PCI_ACS_RR | PCI_ACS_DSP_MT_RR | PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAMED_RR

Follow this guidance in enable_acs and set the additional bits if ACS
Enhanced is supported.

Allow config_acs to control these bits if the device has ACS Enhanced.

The spec permits the HW to wire the bits, so after setting them
pci_acs_flags_enabled() does do a pci_read_config_word() to read the
actual value in effect.

Note that currently Linux sets these bits to 0, so any new HW that comes
supporting ACS Enhanced will end up with historical Linux disabling these
functions. Devices wanting to be compatible with old Linux will need to
wire the ctrl bits to follow ACS_RR. Devices that implement ACS Enhanced
and support the ctrl=0 behavior will break PASID SVA support and VFIO
isolation when ACS_RR is enabled.

Due to the above I strongly encourage backporting this change otherwise
old kernels may have issues with new generations of PCI switches.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/pci.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b0f4d98036cddd..983f71211f0055 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -957,6 +957,7 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
 			     const char *p, const u16 acs_mask, const u16 acs_flags)
 {
 	u16 flags = acs_flags;
+	u16 supported_flags;
 	u16 mask = acs_mask;
 	char *delimit;
 	int ret = 0;
@@ -1001,8 +1002,14 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
 			}
 		}
 
-		if (mask & ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
-			    PCI_ACS_UF | PCI_ACS_EC | PCI_ACS_DT)) {
+		supported_flags = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+				  PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_EC |
+				  PCI_ACS_DT;
+		if (caps->cap & PCI_ACS_ENHANCED)
+			supported_flags |= PCI_ACS_USP_MT_RR |
+					   PCI_ACS_DSP_MT_RR |
+					   PCI_ACS_UNCLAIMED_RR;
+		if (mask & ~supported_flags) {
 			pci_err(dev, "Invalid ACS flags specified\n");
 			return;
 		}
@@ -1062,6 +1069,14 @@ static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps)
 	/* Upstream Forwarding */
 	caps->ctrl |= (caps->cap & PCI_ACS_UF);
 
+	/*
+	 * USP/DSP Memory Target Access Control and Unclaimed Request Redirect
+	 */
+	if (caps->cap & PCI_ACS_ENHANCED) {
+		caps->ctrl |= PCI_ACS_USP_MT_RR | PCI_ACS_DSP_MT_RR |
+			      PCI_ACS_UNCLAIMED_RR;
+	}
+
 	/* Enable Translation Blocking for external devices and noats */
 	if (pci_ats_disabled() || dev->external_facing || dev->untrusted)
 		caps->ctrl |= (caps->cap & PCI_ACS_TB);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  5:02   ` Donald Dutile
  2025-09-09 21:43   ` Bjorn Helgaas
  2025-09-05 18:06 ` [PATCH v3 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

Switches ignore the PASID when routing TLPs. This means the path from the
PASID issuing end point to the IOMMU must be direct with no possibility
for another device to claim the addresses.

This is done using ACS flags and pci_enable_pasid() checks for this.

The new ACS Enhanced bits clarify some undefined behaviors in the spec
around what P2P Request Redirect means.

Linux has long assumed that PCI_ACS_RR implies PCI_ACS_DSP_MT_RR |
PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.

If the device supports ACS Enhanced then use the information it reports to
determine if PASID SVA is supported or not.

 PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
                    flowing transactions

 PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
                    flowing transactions

 PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
                       to all the DSP bridge windows from generating a
                       error.

Each of these cases would poke a hole in the PASID address space which is
not permitted.

Enhance the comments around pci_acs_flags_enabled() to better explain the
reasoning for its logic. Continue to take the approach of assuming the
device is doing the "right ACS" if it does not explicitly declare
otherwise.

Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/ats.c |  4 +++-
 drivers/pci/pci.c | 54 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index ec6c8dbdc5e9c9..00603c2c4ff0ea 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -416,7 +416,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
 	if (!pasid)
 		return -EINVAL;
 
-	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
+	if (!pci_acs_path_enabled(pdev, NULL,
+				  PCI_ACS_RR | PCI_ACS_UF | PCI_ACS_USP_MT_RR |
+				  PCI_ACS_DSP_MT_RR | PCI_ACS_UNCLAIMED_RR))
 		return -EINVAL;
 
 	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 983f71211f0055..620b7f79093854 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3606,6 +3606,52 @@ void pci_configure_ari(struct pci_dev *dev)
 	}
 }
 
+
+/*
+ * The spec is not clear what it means if the capability bit is 0. One view is
+ * that the device acts as though the ctrl bit is zero, another view is the
+ * device behavior is undefined.
+ *
+ * Historically Linux has taken the position that the capability bit as 0 means
+ * the device supports the most favorable interpretation of the spec - ie that
+ * things like P2P RR are always on. As this is security sensitive we expect
+ * devices that do not follow this rule to be quirked.
+ *
+ * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
+ * and switch ports. If those ports have no MMIO then it is not relavent.
+ * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
+ * window that is not fully decoded by the downstream windows.
+ *
+ * This takes the same approach with ACS Enhanced, if the device does not
+ * support it then we assume the ACS P2P RR has all the enhanced behaviors too.
+ *
+ * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
+ * those values would break old kernels on the edge cases they cover, the only
+ * compatible thing for a new device to implement is ACS Enhanced supported with
+ * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
+ */
+static u16 pci_acs_ctrl_mask(struct pci_dev *pdev, u16 hw_cap)
+{
+	/*
+	 * Egress Control enables use of the Egress Control Vector which is not
+	 * present without the cap.
+	 */
+	u16 mask = PCI_ACS_EC;
+
+	mask = hw_cap & (PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+				      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
+
+	/*
+	 * If ACS Enhanced is supported the device reports what it is doing
+	 * through these bits which may not be settable.
+	 */
+	if (hw_cap & PCI_ACS_ENHANCED)
+		mask |= PCI_ACS_IORB | PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR |
+			PCI_ACS_USP_MT_RB | PCI_ACS_USP_MT_RR |
+			PCI_ACS_UNCLAIMED_RR;
+	return mask;
+}
+
 static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
 {
 	int pos;
@@ -3615,15 +3661,9 @@ static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
 	if (!pos)
 		return false;
 
-	/*
-	 * Except for egress control, capabilities are either required
-	 * or only required if controllable.  Features missing from the
-	 * capability field can therefore be assumed as hard-wired enabled.
-	 */
 	pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
-	acs_flags &= (cap | PCI_ACS_EC);
-
 	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+	acs_flags &= pci_acs_ctrl_mask(pdev, cap);
 	return (ctrl & acs_flags) == acs_flags;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 11/11] PCI: Check ACS Extended flags for pci_bus_isolated()
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
@ 2025-09-05 18:06 ` Jason Gunthorpe
  2025-09-09  5:04   ` Donald Dutile
  2025-09-15  9:41 ` [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Cédric Le Goater
  2025-09-22 22:39 ` Alex Williamson
  12 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

When looking at a PCIe switch we want to see that the USP/DSP MMIO have
request redirect enabled. Detect the case where the USP is expressly not
isolated from the DSP and ensure the USP is included in the group.

The DSP Memory Target also applies to the Root Port, check it there
too. If upstream directed transactions can reach the root port MMIO then
it is not isolated.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index dac6b042fd5f5d..cba417cbe3476e 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -127,6 +127,8 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
 	 * traffic flowing upstream back downstream through another DSP.
 	 *
 	 * Thus any non-permissive DSP spoils the whole bus.
+	 * PCI_ACS_UNCLAIMED_RR is not required since rejecting requests with
+	 * error is still isolation.
 	 */
 	guard(rwsem_read)(&pci_bus_sem);
 	list_for_each_entry(pdev, &bus->devices, bus_list) {
@@ -136,8 +138,14 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
 		    pdev->dma_alias_mask)
 			return PCIE_NON_ISOLATED;
 
-		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
+		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED |
+						   PCI_ACS_DSP_MT_RR |
+						   PCI_ACS_USP_MT_RR)) {
+			/* The USP is isolated from the DSP */
+			if (!pci_acs_enabled(pdev, PCI_ACS_USP_MT_RR))
+				return PCIE_NON_ISOLATED;
 			return PCIE_SWITCH_DSP_NON_ISOLATED;
+		}
 	}
 	return PCIE_ISOLATED;
 }
@@ -232,11 +240,13 @@ enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 	/*
 	 * Since PCIe links are point to point root ports are isolated if there
 	 * is no internal loopback to the root port's MMIO. Like MFDs assume if
-	 * there is no ACS cap then there is no loopback.
+	 * there is no ACS cap then there is no loopback. The root port uses
+	 * DSP_MT_RR for its own MMIO.
 	 */
 	case PCI_EXP_TYPE_ROOT_PORT:
 		if (bridge->acs_cap &&
-		    !pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
+		    !pci_acs_enabled(bridge,
+				     PCI_ACS_ISOLATED | PCI_ACS_DSP_MT_RR))
 			return PCIE_NON_ISOLATED;
 		return PCIE_ISOLATED;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  2025-09-05 18:06 ` [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
@ 2025-09-09  4:08   ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  4:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Jason,
Hi.


On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> The next patch wants to use this constant, share it.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c         | 16 +++-------------
>   include/uapi/linux/pci_regs.h | 10 ++++++++++
>   2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 060ebe330ee163..2a47ddb01799c1 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1408,16 +1408,6 @@ EXPORT_SYMBOL_GPL(iommu_group_id);
>   static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
>   					       unsigned long *devfns);
>   
> -/*
> - * To consider a PCI device isolated, we require ACS to support Source
> - * Validation, Request Redirection, Completer Redirection, and Upstream
> - * Forwarding.  This effectively means that devices cannot spoof their
> - * requester ID, requests and completions cannot be redirected, and all
> - * transactions are forwarded upstream, even as it passes through a
> - * bridge where the target device is downstream.
> - */
> -#define REQ_ACS_FLAGS   (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
> -
>   /*
>    * For multifunction devices which are not isolated from each other, find
>    * all the other non-isolated functions and look for existing groups.  For
> @@ -1430,13 +1420,13 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
>   	struct pci_dev *tmp = NULL;
>   	struct iommu_group *group;
>   
> -	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> +	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
>   		return NULL;
>   
>   	for_each_pci_dev(tmp) {
>   		if (tmp == pdev || tmp->bus != pdev->bus ||
>   		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
> -		    pci_acs_enabled(tmp, REQ_ACS_FLAGS))
> +		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
>   			continue;
>   
>   		group = get_pci_alias_group(tmp, devfns);
> @@ -1580,7 +1570,7 @@ struct iommu_group *pci_device_group(struct device *dev)
>   		if (!bus->self)
>   			continue;
>   
> -		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
> +		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
>   			break;
>   
>   		pdev = bus->self;
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index f5b17745de607d..6095e7d7d4cc48 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1009,6 +1009,16 @@
>   #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
>   #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
>   
> +/*
> + * To consider a PCI device isolated, we require ACS to support Source
> + * Validation, Request Redirection, Completer Redirection, and Upstream
> + * Forwarding.  This effectively means that devices cannot spoof their
> + * requester ID, requests and completions cannot be redirected, and all
> + * transactions are forwarded upstream, even as it passes through a
> + * bridge where the target device is downstream.
> + */
> +#define PCI_ACS_ISOLATED (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
> +
>   /* SATA capability */
>   #define PCI_SATA_REGS		4	/* SATA REGs specifier */
>   #define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
like the move & rename...

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 02/11] PCI: Add pci_bus_isolated()
  2025-09-05 18:06 ` [PATCH v3 02/11] PCI: Add pci_bus_isolated() Jason Gunthorpe
@ 2025-09-09  4:09   ` Donald Dutile
  2025-09-09 19:54   ` Bjorn Helgaas
  1 sibling, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  4:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> Prepare to move the calculation of the bus P2P isolation out of the iommu
> code and into the PCI core. This allows using the faster list iteration
> under the pci_bus_sem, and the code has a kinship with the logic in
> pci_for_each_dma_alias().
> 
> Bus isolation is the concept that drives the iommu_groups for the purposes
> of VFIO. Stated simply, if device A can send traffic to device B then they
> must be in the same group.
> 
> Only PCIe provides isolation. The multi-drop electrical topology in
> classical PCI allows any bus member to claim the transaction.
> 
> In PCIe isolation comes out of ACS. If a PCIe Switch and Root Complex has
> ACS flags that prevent peer to peer traffic and funnel all operations to
> the IOMMU then devices can be isolated.
> 
> Multi-function devices also have an isolation concern with self loopback
> between the functions, though pci_bus_isolated() does not deal with
> devices.
> 
> As a property of a bus, there are several positive cases:
> 
>   - The point to point "bus" on a physical PCIe link is isolated if the
>     bridge/root device has something preventing self-access to its own
>     MMIO.
> 
>   - A Root Port is usually isolated
> 
>   - A PCIe switch can be isolated if all it's Down Stream Ports have good
>     ACS flags
> 
> pci_bus_isolated() implements these rules and returns an enum indicating
> the level of isolation the bus has, with five possibilies:
> 
>   PCIE_ISOLATED: Traffic on this PCIE bus can not do any P2P.
> 
>   PCIE_SWITCH_DSP_NON_ISOLATED: The bus is the internal bus of a PCIE
>       switch and the USP is isolated but the DSPs are not.
> 
>   PCIE_NON_ISOLATED: The PCIe bus has no isolation between the bridge or
>       any downstream devices.
> 
>   PCI_BUS_NON_ISOLATED: It is a PCI/PCI-X but the bridge is PCIe, has no
>       aliases and the bridge is isolated from the bus.
> 
>   PCI_BRIDGE_NON_ISOLATED: It is a PCI/PCI-X bus and has no isolation, the
>       bridge is part of the group.
> 
> The calculation is done per-bus, so it is possible for a transactions from
> a PCI device to travel through different bus isolation types on its way
> upstream. PCIE_SWITCH_DSP_NON_ISOLATED/PCI_BUS_NON_ISOLATED and
> PCIE_NON_ISOLATED/PCI_BRIDGE_NON_ISOLATED are the same for the purposes of
> creating iommu groups. The distinction between PCIe and PCI allows for
> easier understanding and debugging as to why the groups are chosen.
> 
> For the iommu groups if all busses on the upstream path are PCIE_ISOLATED
> then the end device has a chance to have a single-device iommu_group. Once
> any non-isolated bus segment is found that bus segment will have an
> iommu_group that captures all downstream devices, and sometimes the
> upstream bridge.
> 
> pci_bus_isolated() is principally about isolation, but there is an
> overlap with grouping requirements for legacy PCI aliasing. For purely
> legacy PCI environments pci_bus_isolated() returns
> PCI_BRIDGE_NON_ISOLATED for everything and all devices within a hierarchy
> are in one group. No need to worry about bridge aliasing.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/pci/search.c | 174 +++++++++++++++++++++++++++++++++++++++++++
>   include/linux/pci.h  |  31 ++++++++
>   2 files changed, 205 insertions(+)
> 
> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
> index 53840634fbfc2b..fe6c07e67cb8ce 100644
> --- a/drivers/pci/search.c
> +++ b/drivers/pci/search.c
> @@ -113,6 +113,180 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
>   	return ret;
>   }
>   
> +static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +
> +	/*
> +	 * Within a PCIe switch we have an interior bus that has the Upstream
> +	 * port as the bridge and a set of Downstream port bridging to the
> +	 * egress ports.
> +	 *
> +	 * Each DSP has an ACS setting which controls where its traffic is
> +	 * permitted to go. Any DSP with a permissive ACS setting can send
> +	 * traffic flowing upstream back downstream through another DSP.
> +	 *
> +	 * Thus any non-permissive DSP spoils the whole bus.
> +	 */
> +	guard(rwsem_read)(&pci_bus_sem);
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		/* Don't understand what this is, be conservative */
> +		if (!pci_is_pcie(pdev) ||
> +		    pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM ||
> +		    pdev->dma_alias_mask)
> +			return PCIE_NON_ISOLATED;
> +
> +		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> +			return PCIE_SWITCH_DSP_NON_ISOLATED;
> +	}
> +	return PCIE_ISOLATED;
> +}
> +
> +static bool pci_has_mmio(struct pci_dev *pdev)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> +		struct resource *res = pci_resource_n(pdev, i);
> +
> +		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/**
> + * pci_bus_isolated - Determine how isolated connected devices are
> + * @bus: The bus to check
> + *
> + * Isolation is the ability of devices to talk to each other. Full isolation
> + * means that a device can only communicate with the IOMMU and can not do peer
> + * to peer within the fabric.
> + *
> + * We consider isolation on a bus by bus basis. If the bus will permit a
> + * transaction originated downstream to complete on anything other than the
> + * IOMMU then the bus is not isolated.
> + *
> + * Non-isolation includes all the downstream devices on this bus, and it may
> + * include the upstream bridge or port that is creating this bus.
> + *
> + * The various cases are returned in an enum.
> + *
> + * Broadly speaking this function evaluates the ACS settings in a PCI switch to
> + * determine if a PCI switch is configured to have full isolation.
> + *
> + * Old PCI/PCI-X busses cannot have isolation due to their physical properties,
> + * but they do have some aliasing properties that effect group creation.
> + *
> + * pci_bus_isolated() does not consider loopback internal to devices, like
> + * multi-function devices performing a self-loopback. The caller must check
> + * this separately. It does not considering alasing within the bus.
> + *
> + * It does not currently support the ACS P2P Egress Control Vector, Linux does
> + * not yet have any way to enable this feature. EC will create subsets of the
> + * bus that are isolated from other subsets.
> + */
> +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> +{
> +	struct pci_dev *bridge = bus->self;
> +	int type;
> +
> +	/*
> +	 * This bus was created by pci_register_host_bridge(). The spec provides
> +	 * no way to tell what kind of bus this is, for PCIe we expect this to
> +	 * be internal to the root complex and not covered by any spec behavior.
> +	 * Linux has historically been optimistic about this bus and treated it
> +	 * as isolating. Given that the behavior of the root complex and the ACS
> +	 * behavior of RCiEP's is explicitly not specified we hope that the
> +	 * implementation is directing everything that reaches the root bus to
> +	 * the IOMMU.
> +	 */
> +	if (pci_is_root_bus(bus))
> +		return PCIE_ISOLATED;
> +
> +	/*
> +	 * bus->self is only NULL for SRIOV VFs, it represents a "virtual" bus
> +	 * within Linux to hold any bus numbers consumed by VF RIDs. Caller must
> +	 * use pci_physfn() to get the bus for calling this function.
> +	 */
> +	if (WARN_ON(!bridge))
> +		return PCI_BRIDGE_NON_ISOLATED;
> +
> +	/*
> +	 * The bridge is not a PCIe bridge therefore this bus is PCI/PCI-X.
> +	 *
> +	 * PCI does not have anything like ACS. Any down stream device can bus
> +	 * master an address that any other downstream device can claim. No
> +	 * isolation is possible.
> +	 */
> +	if (!pci_is_pcie(bridge)) {
> +		if (bridge->dev_flags & PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS)
> +			type = PCI_EXP_TYPE_PCI_BRIDGE;
> +		else
> +			return PCI_BRIDGE_NON_ISOLATED;
> +	} else {
> +		type = pci_pcie_type(bridge);
> +	}
> +
> +	switch (type) {
> +	/*
> +	 * Since PCIe links are point to point root ports are isolated if there
> +	 * is no internal loopback to the root port's MMIO. Like MFDs assume if
> +	 * there is no ACS cap then there is no loopback.
> +	 */
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +		if (bridge->acs_cap &&
> +		    !pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
> +			return PCIE_NON_ISOLATED;
> +		return PCIE_ISOLATED;
> +
> +	/*
> +	 * Since PCIe links are point to point a DSP is always considered
> +	 * isolated. The internal bus of the switch will be non-isolated if the
> +	 * DSP's have any ACS that allows upstream traffic to flow back
> +	 * downstream to any DSP, including back to this DSP or its MMIO.
> +	 */
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +		return PCIE_ISOLATED;
> +
> +	/*
> +	 * bus is the interior bus of a PCI-E switch where ACS rules apply.
> +	 */
> +	case PCI_EXP_TYPE_UPSTREAM:
> +		return pcie_switch_isolated(bus);
> +
> +	/*
> +	 * PCIe to PCI/PCI-X - this bus is PCI.
> +	 */
> +	case PCI_EXP_TYPE_PCI_BRIDGE:
> +		/*
> +		 * A PCIe express bridge will use the subordinate bus number
> +		 * with a 0 devfn as the RID in some cases. This causes all
> +		 * subordinate devfns to alias with 0, which is the same
> +		 * grouping as PCI_BUS_NON_ISOLATED. The RID of the bridge
> +		 * itself is only used by the bridge.
> +		 *
> +		 * However, if the bridge has MMIO then we will assume the MMIO
> +		 * is not isolated due to no ACS controls on this bridge type.
> +		 */
> +		if (pci_has_mmio(bridge))
> +			return PCI_BRIDGE_NON_ISOLATED;
> +		return PCI_BUS_NON_ISOLATED;
> +
> +	/*
> +	 * PCI/PCI-X to PCIe - this bus is PCIe. We already know there must be a
> +	 * PCI bus upstream of this bus, so just return non-isolated. If
> +	 * upstream is PCI-X the PCIe RID should be preserved, but for PCI the
> +	 * RID will be lost.
> +	 */
> +	case PCI_EXP_TYPE_PCIE_BRIDGE:
> +		return PCI_BRIDGE_NON_ISOLATED;
> +
> +	default:
> +		return PCI_BRIDGE_NON_ISOLATED;
> +	}
> +}
> +
>   static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
>   {
>   	struct pci_bus *child;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 59876de13860db..c36fff9d2254f8 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -855,6 +855,32 @@ struct pci_dynids {
>   	struct list_head	list;	/* For IDs added at runtime */
>   };
>   
> +enum pci_bus_isolation {
> +	/*
> +	 * The bus is off a root port and the root port has isolated ACS flags
> +	 * or the bus is part of a PCIe switch and the switch has isolated ACS
> +	 * flags.
> +	 */
> +	PCIE_ISOLATED,
> +	/*
> +	 * The switch's DSP's are not isolated from each other but are isolated
> +	 * from the USP.
> +	 */
> +	PCIE_SWITCH_DSP_NON_ISOLATED,
> +	/* The above and the USP's MMIO is not isolated. */
> +	PCIE_NON_ISOLATED,
> +	/*
> +	 * A PCI/PCI-X bus, no isolation. This is like
> +	 * PCIE_SWITCH_DSP_NON_ISOLATED in that the upstream bridge is isolated
> +	 * from the bus. The bus itself may also have a shared alias of devfn=0.
> +	 */
> +	PCI_BUS_NON_ISOLATED,
> +	/*
> +	 * The above and the bridge's MMIO is not isolated and the bridge's RID
> +	 * may be an alias.
> +	 */
> +	PCI_BRIDGE_NON_ISOLATED,
> +};
>   
>   /*
>    * PCI Error Recovery System (PCI-ERS).  If a PCI device driver provides
> @@ -1243,6 +1269,8 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
>   struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
>   struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
>   
> +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
> +
>   int pci_dev_present(const struct pci_device_id *ids);
>   
>   int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
> @@ -2056,6 +2084,9 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
>   						 struct pci_dev *from)
>   { return NULL; }
>   
> +static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> +{ return PCIE_NON_ISOLATED; }
> +
>   static inline int pci_dev_present(const struct pci_device_id *ids)
>   { return 0; }
>   
clarity a +1.

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-09-05 18:06 ` [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
@ 2025-09-09  4:14   ` Donald Dutile
  2025-09-09 12:18     ` Jason Gunthorpe
  2025-09-09 20:27   ` Bjorn Helgaas
  1 sibling, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  4:14 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu


one nit below...

On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> The current algorithm does not work if ACS is turned off, and it is not
> clear how this has been missed for so long. I think it has been avoided
> because the kernel command line options to target specific devices and
> disable ACS are rarely used.
> 
> For discussion lets consider a simple topology like the below:
> 
>                                 -- DSP 02:00.0 -> End Point A
>   Root 00:00.0 -> USP 01:00.0 --|
>                                 -- DSP 02:03.0 -> End Point B
> 
> If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
> B to all have unique single device groups.
> 
> If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
> unique single device groups while 02:00.0, 02:03.0, A, B are part of one
> multi-device group.
> 
> If the DSPs have asymmetric ACS, with one fully isolating and one
> non-isolating we also expect the above multi-device group result.
> 
> Instead the current algorithm always creates unique single device groups
> for this topology. It happens because the pci_device_group(DSP)
> immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
> true and decides the DSP can get a unique group. The pci_device_group(A)
> immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
> then takes the DSPs group.
> 
> For root-ports a PCIe topology like:
>                                           -- Dev 01:00.0
>    Root  00:00.00 --- Root Port 00:01.0 --|
>                    |                      -- Dev 01:00.1
> 		  |- Dev 00:17.0
> 
> Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
> ACS capability in the root port.
> 
> While ACS on root ports is underspecified in the spec, it should still
> function as an egress control and limit access to either the MMIO of the
> root port itself, or perhaps some other devices upstream of the root
> complex - 00:17.0 perhaps in this example.
> 
> Historically the grouping in Linux has assumed the root port routes all
> traffic into the TA/IOMMU and never bypasses the TA to go to other
> functions in the root complex. Following the new understanding that ACS is
> required for internal loopback also treat root ports with no ACS
> capability as lacking internal loopback as well.
> 
> The current algorithm has several issues:
> 
>   1) It implicitly depends on ordering. Since the existing group discovery
>      only goes in the upstream direction discovering a downstream device
>      before its upstream will cause the wrong creation of narrower groups.
> 
>   2) It assumes that if the path from the end point to the root is entirely
>      ACS isolated then that end point is isolated. This misses cross-traffic
>      in the asymmetric ACS case.
> 
>   3) When evaluating a non-isolated DSP it does not check peer DSPs for an
>      already established group unless the multi-function feature does it.
> 
>   4) It does not understand the aliasing rule for PCIe to PCI bridges
>      where the alias is to the subordinate bus. The bridge's RID on the
>      primary bus is not aliased. This causes the PCIe to PCI bridge to be
>      wrongly joined to the group with the downstream devices.
> 
> As grouping is a security property for VFIO creating incorrectly narrowed
> groups is a security problem for the system.
> 
> Revise the design to solve these problems.
> 
> Explicitly require ordering, or return EPROBE_DEFER if things are out of
> order. This avoids silent errors that created smaller groups and solves
> problem #1.
> 
> Work on busses, not devices. Isolation is a property of the bus, and the
> first non-isolated bus should form a group containing all devices
> downstream of that bus. If all busses on the path to an end device are
> isolated then the end device has a chance to make a single-device group.
> 
> Use pci_bus_isolation() to compute the bus's isolation status based on the
> ACS flags and technology. pci_bus_isolation() touches a lot of PCI
> internals to get the information in the right format.
> 
> Add a new flag in the iommu_group to record that the group contains a
> non-isolated bus. Any downstream pci_device_group() will see
> bus->self->iommu_group is non-isolated and unconditionally join it. This
> makes the first non-isolation apply to all downstream devices and solves
> problem #2
> 
> The bus's non-isolated iommu_group will be stored in either the DSP of
> PCIe switch or the bus->self upstream device, depending on the situation.
> When storing in the DSP all the DSPs are checked first for a pre-existing
> non-isolated iommu_group. When stored in the upstream the flag forces it
> to all downstreams. This solves problem #3.
> 
> Put the handling of end-device aliases and MFD into pci_get_alias_group()
> and only call it in cases where we have a fully isolated path. Otherwise
> every downstream device on the bus is going to be joined to the group of
> bus->self.
> 
> Finally, replace the initial pci_for_each_dma_alias() with a combination
> of:
> 
>   - Directly checking pci_real_dma_dev() and enforcing ordering.
>     The group should contain both pdev and pci_real_dma_dev(pdev) which is
>     only possible if pdev is ordered after real_dma_dev. This solves a case
>     of #1.
> 
>   - Indirectly relying on pci_bus_isolation() to report legacy PCI busses
>     as non-isolated, with the enum including the distinction of the PCIe to
>     PCI bridge being isolated from the downstream. This solves problem #4.
> 
> It is very likely this is going to expand iommu_group membership in
> existing systems. After all that is the security bug that is being
> fixed. Expanding the iommu_groups risks problems for users using VFIO.
> 
> The intention is to have a more accurate reflection of the security
> properties in the system and should be seen as a security fix. However
> people who have ACS disabled may now need to enable it. As such users may
> have had good reason for ACS to be disabled I strongly recommend that
> backporting of this also include the new config_acs option so that such
> users can potentially minimally enable ACS only where needed.
> 
> Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 279 ++++++++++++++++++++++++++++++++----------
>   include/linux/pci.h   |   3 +
>   2 files changed, 217 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2a47ddb01799c1..1874bbdc73b75e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -65,8 +65,16 @@ struct iommu_group {
>   	struct list_head entry;
>   	unsigned int owner_cnt;
>   	void *owner;
> +
> +	/* Used by the device_group() callbacks */
> +	u32 bus_data;
>   };
>   
> +/*
> + * Everything downstream of this group should share it.
> + */
> +#define BUS_DATA_PCI_NON_ISOLATED BIT(0)
> +
>   struct group_device {
>   	struct list_head list;
>   	struct device *dev;
> @@ -1484,25 +1492,6 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
>   	return NULL;
>   }
>   
> -struct group_for_pci_data {
> -	struct pci_dev *pdev;
> -	struct iommu_group *group;
> -};
> -
> -/*
> - * DMA alias iterator callback, return the last seen device.  Stop and return
> - * the IOMMU group if we find one along the way.
> - */
> -static int get_pci_alias_or_group(struct pci_dev *pdev, u16 alias, void *opaque)
> -{
> -	struct group_for_pci_data *data = opaque;
> -
> -	data->pdev = pdev;
> -	data->group = iommu_group_get(&pdev->dev);
> -
> -	return data->group != NULL;
> -}
> -
>   /*
>    * Generic device_group call-back function. It just allocates one
>    * iommu-group per device.
> @@ -1534,57 +1523,31 @@ struct iommu_group *generic_single_device_group(struct device *dev)
>   }
>   EXPORT_SYMBOL_GPL(generic_single_device_group);
>   
> -/*
> - * Use standard PCI bus topology, isolation features, and DMA alias quirks
> - * to find or create an IOMMU group for a device.
> - */
> -struct iommu_group *pci_device_group(struct device *dev)
> +static struct iommu_group *pci_group_alloc_non_isolated(void)
Maybe iommu_group_alloc_non_isolated() would be a better name, since that's all it does.

Ideally, iommu_group_alloc() would add a flag for isolated/not-isolated default tagging,
but that would involve more twiddling to other files, that doesn't seem worth the effort.
So the fcn rename will do.

>   {
> -	struct pci_dev *pdev = to_pci_dev(dev);
> -	struct group_for_pci_data data;
> -	struct pci_bus *bus;
> -	struct iommu_group *group = NULL;
> -	u64 devfns[4] = { 0 };
> +	struct iommu_group *group;
>   
> -	if (WARN_ON(!dev_is_pci(dev)))
> -		return ERR_PTR(-EINVAL);
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group))
> +		return group;
> +	group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +	return group;
> +}
>   
> -	/*
> -	 * Find the upstream DMA alias for the device.  A device must not
> -	 * be aliased due to topology in order to have its own IOMMU group.
> -	 * If we find an alias along the way that already belongs to a
> -	 * group, use it.
> -	 */
> -	if (pci_for_each_dma_alias(pdev, get_pci_alias_or_group, &data))
> -		return data.group;
> -
> -	pdev = data.pdev;
> -
> -	/*
> -	 * Continue upstream from the point of minimum IOMMU granularity
> -	 * due to aliases to the point where devices are protected from
> -	 * peer-to-peer DMA by PCI ACS.  Again, if we find an existing
> -	 * group, use it.
> -	 */
> -	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
> -		if (!bus->self)
> -			continue;
> -
> -		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
> -			break;
> -
> -		pdev = bus->self;
> -
> -		group = iommu_group_get(&pdev->dev);
> -		if (group)
> -			return group;
> -	}
> +/*
> + * Return a group if the function has isolation restrictions related to
> + * aliases or MFD ACS.
> + */
> +static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
> +{
> +	struct iommu_group *group;
> +	DECLARE_BITMAP(devfns, 256) = {};
>   
>   	/*
>   	 * Look for existing groups on device aliases.  If we alias another
>   	 * device or another device aliases us, use the same group.
>   	 */
> -	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
> +	group = get_pci_alias_group(pdev, devfns);
>   	if (group)
>   		return group;
>   
> @@ -1593,12 +1556,198 @@ struct iommu_group *pci_device_group(struct device *dev)
>   	 * slot and aliases of those funcions, if any.  No need to clear
>   	 * the search bitmap, the tested devfns are still valid.
>   	 */
> -	group = get_pci_function_alias_group(pdev, (unsigned long *)devfns);
> +	group = get_pci_function_alias_group(pdev, devfns);
>   	if (group)
>   		return group;
>   
> -	/* No shared group found, allocate new */
> -	return iommu_group_alloc();
> +	/*
> +	 * When MFD's are included in the set due to ACS we assume that if ACS
> +	 * permits an internal loopback between functions it also permits the
> +	 * loopback to go downstream if a function is a bridge.
> +	 *
> +	 * It is less clear what aliases mean when applied to a bridge. For now
> +	 * be conservative and also propagate the group downstream.
> +	 */
> +	__clear_bit(pdev->devfn & 0xFF, devfns);
> +	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
> +		return pci_group_alloc_non_isolated();
> +	return NULL;
> +}
> +
> +/* Return a group if the upstream hierarchy has isolation restrictions. */
> +static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
> +{
> +	/*
> +	 * SRIOV functions may reside on a virtual bus, jump directly to the PFs
> +	 * bus in all cases.
> +	 */
> +	struct pci_bus *bus = pci_physfn(pdev)->bus;
> +	struct iommu_group *group;
> +
> +	/* Nothing upstream of this */
> +	if (pci_is_root_bus(bus))
> +		return NULL;
> +
> +	/*
> +	 * !self is only for SRIOV virtual busses which should have been
> +	 * excluded by pci_physfn()
> +	 */
> +	if (WARN_ON(!bus->self))
> +		return ERR_PTR(-EINVAL);
> +
> +	group = iommu_group_get(&bus->self->dev);
> +	if (!group) {
> +		/*
> +		 * If the upstream bridge needs the same group as pdev then
> +		 * there is no way for it's pci_device_group() to discover it.
> +		 */
> +		dev_err(&pdev->dev,
> +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
> +			pci_name(bus->self));
> +		return ERR_PTR(-EPROBE_DEFER);
> +	}
> +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
> +		return group;
> +	iommu_group_put(group);
> +	return NULL;
> +}
> +
> +/*
> + * For legacy PCI we have two main considerations when forming groups:
> + *
> + *  1) In PCI we can loose the RID inside the fabric, or some devices will use
> + *     the wrong RID. The PCI core calls this aliasing, but from an IOMMU
> + *     perspective it means that a PCI device may have multiple RIDs and a
> + *     single RID may represent many PCI devices. This effectively means all the
> + *     aliases must share a translation, thus group, because the IOMMU cannot
> + *     tell devices apart.
> + *
> + *  2) PCI permits a bus segment to claim an address even if the transaction
> + *     originates from an end point not the CPU. When it happens it is called
> + *     peer to peer. Claiming a transaction in the middle of the bus hierarchy
> + *     bypasses the IOMMU translation. The IOMMU subsystem rules require these
> + *     devices to be placed in the same group because they lack isolation from
> + *     each other. In PCI Express the ACS system can be used to inhibit this and
> + *     force transactions to go to the IOMMU.
> + *
> + *     From a PCI perspective any given PCI bus is either isolating or
> + *     non-isolating. Isolating means downstream originated transactions always
> + *     progress toward the CPU and do not go to other devices on the bus
> + *     segment, while non-isolating means downstream originated transactions can
> + *     progress back downstream through another device on the bus segment.
> + *
> + *     Beyond buses a multi-function device or bridge can also allow
> + *     transactions to loop back internally from one function to another.
> + *
> + *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
> + *     that bus becomes a single group.
> + */
> +struct iommu_group *pci_device_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct iommu_group *group;
> +	struct pci_dev *real_pdev;
> +
> +	if (WARN_ON(!dev_is_pci(dev)))
> +		return ERR_PTR(-EINVAL);
> +
> +	/*
> +	 * Arches can supply a completely different PCI device that actually
> +	 * does DMA.
> +	 */
> +	real_pdev = pci_real_dma_dev(pdev);
> +	if (real_pdev != pdev) {
> +		group = iommu_group_get(&real_pdev->dev);
> +		if (!group) {
> +			/*
> +			 * The real_pdev has not had an iommu probed to it. We
> +			 * can't create a new group here because there is no way
> +			 * for pci_device_group(real_pdev) to pick it up.
> +			 */
> +			dev_err(dev,
> +				"PCI device is probing out of order, real device of %s is not probed yet\n",
> +				pci_name(real_pdev));
> +			return ERR_PTR(-EPROBE_DEFER);
> +		}
> +		return group;
> +	}
> +
> +	if (pdev->dev_flags & PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT)
> +		return iommu_group_alloc();
> +
> +	/* Anything upstream of this enforcing non-isolated? */
> +	group = pci_hierarchy_group(pdev);
> +	if (group)
> +		return group;
> +
> +	switch (pci_bus_isolated(pci_physfn(pdev)->bus)) {
> +	case PCIE_ISOLATED:
> +		/* Check multi-function groups and same-bus devfn aliases */
> +		group = pci_get_function_group(pdev);
> +		if (group)
> +			return group;
> +
> +		/* No shared group found, allocate new */
> +		return iommu_group_alloc();
> +
> +	/*
> +	 * On legacy PCI there is no RID at an electrical level. On PCI-X the
> +	 * RID of the bridge may be used in some cases instead of the
> +	 * downstream's RID. This creates aliasing problems. PCI/PCI-X doesn't
> +	 * provide isolation either. The end result is that as soon as we hit a
> +	 * PCI/PCI-X bus we switch to non-isolated for the whole downstream for
> +	 * both aliasing and isolation reasons. The bridge has to be included in
> +	 * the group because of the aliasing.
> +	 */
> +	case PCI_BRIDGE_NON_ISOLATED:
> +	/* A PCIe switch where the USP has MMIO and is not isolated. */
> +	case PCIE_NON_ISOLATED:
> +		group = iommu_group_get(&pdev->bus->self->dev);
> +		if (WARN_ON(!group))
> +			return ERR_PTR(-EINVAL);
> +		/*
> +		 * No need to be concerned with aliases here since we are going
> +		 * to put the entire downstream tree in the bridge/USP's group.
> +		 */
> +		group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +		return group;
> +
> +	/*
> +	 * It is a PCI bus and the upstream bridge/port does not alias or allow
> +	 * P2P.
> +	 */
> +	case PCI_BUS_NON_ISOLATED:
> +	/*
> +	 * It is a PCIe switch and the DSP cannot reach the USP. The DSP's
> +	 * are not isolated from each other and share a group.
> +	 */
> +	case PCIE_SWITCH_DSP_NON_ISOLATED: {
> +		struct pci_dev *piter = NULL;
> +
> +		/*
> +		 * All the downstream devices on the bus share a group. If this
> +		 * is a PCIe switch then they will all be DSPs
> +		 */
> +		for_each_pci_dev(piter) {
> +			if (piter->bus != pdev->bus)
> +				continue;
> +			group = iommu_group_get(&piter->dev);
> +			if (group) {
> +				pci_dev_put(piter);
> +				if (WARN_ON(!(group->bus_data &
> +					      BUS_DATA_PCI_NON_ISOLATED)))
> +					group->bus_data |=
> +						BUS_DATA_PCI_NON_ISOLATED;
> +				return group;
> +			}
> +		}
> +		return pci_group_alloc_non_isolated();
> +	}
> +	default:
> +		break;
> +	}
> +	WARN_ON(true);
> +	return ERR_PTR(-EINVAL);
>   }
>   EXPORT_SYMBOL_GPL(pci_device_group);
>   
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index c36fff9d2254f8..fb9adf0562f8ef 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2093,6 +2093,9 @@ static inline int pci_dev_present(const struct pci_device_id *ids)
>   #define no_pci_devices()	(1)
>   #define pci_dev_put(dev)	do { } while (0)
>   
> +static inline struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
> +{ return dev; }
> +
>   static inline void pci_set_master(struct pci_dev *dev) { }
>   static inline void pci_clear_master(struct pci_dev *dev) { }
>   static inline int pci_enable_device(struct pci_dev *dev) { return -EIO; }


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/11] iommu: Organize iommu_group by member size
  2025-09-05 18:06 ` [PATCH v3 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
@ 2025-09-09  4:16   ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  4:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> To avoid some internal padding.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 1874bbdc73b75e..543d6347c0e5e3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -58,13 +58,13 @@ struct iommu_group {
>   	void *iommu_data;
>   	void (*iommu_data_release)(void *iommu_data);
>   	char *name;
> -	int id;
>   	struct iommu_domain *default_domain;
>   	struct iommu_domain *blocking_domain;
>   	struct iommu_domain *domain;
>   	struct list_head entry;
> -	unsigned int owner_cnt;
>   	void *owner;
> +	unsigned int owner_cnt;
> +	int id;
>   
>   	/* Used by the device_group() callbacks */
>   	u32 bus_data;

ok, but still leaves a 32-bit hole at the end, which would occur in the struct if bus_data was put after id or owner_cnt.

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-05 18:06 ` [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
@ 2025-09-09  4:57   ` Donald Dutile
  2025-09-09 13:31     ` Jason Gunthorpe
  2025-09-09 21:24   ` Bjorn Helgaas
  1 sibling, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  4:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> Like with switches the current MFD algorithm does not consider asymmetric
> ACS within a MFD. If any MFD function has ACS that permits P2P the spec
> says it can reach through the MFD internal loopback any other function in
> the device.
> 
> For discussion let's consider a simple MFD topology like the below:
> 
>                        -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>        Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>                        |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> This asymmetric ACS could be created using the config_acs kernel command
> line parameter, from quirks, or from a poorly thought out device that has
> ACS flags only on some functions.
> 
> Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
> do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
> function. Thus we expect an iommu_group to contain all three
> devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
> single device group of 1f.6.
> 
> The current algorithm sees the good ACS flags on 00:1f.6 and does not
> consider ACS on any other MFD functions.
> 
> For path properties the ACS flags say that 00:1f.6 is safe to use with
> PASID and supports SVA as it will not have any portions of its address
> space routed away from the IOMMU, this part of the ACS system is working
> correctly.
> 
> Further, if one of the MFD functions is a bridge, eg like 1f.2:
> 
>                        -- MFD 00:1f.0
>        Root 00:00.00 --|- MFD 00:1f.2 Root Port --- 01:01.0
>                        |- MFD 00:1f.6
> 
> Then the correct grouping will include 01:01.0, 00:1f.0/2/6 together in a
> group if there is any internal loopback within the MFD 00:1f. The current
> algorithm does not understand this and gives 01:01.0 it's own group even
> if it thinks there is an internal loopback in the MFD.
> 
> Unfortunately this detail makes it hard to fix. Currently the code assumes
> that any MFD without an ACS cap has an internal loopback which will cause
> a large number of modern real systems to group in a pessimistic way.
> 
> However, the PCI spec does not really support this:
> 
>     PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function
>     Devices
> 
>      ACS P2P Request Redirect: must be implemented by Functions that
>      support peer-to-peer traffic with other Functions.
> 
> Meaning from a spec perspective the absence of ACS indicates the absence
> of internal loopback. Granted I think we are aware of older real devices
> that ignore this, but it seems to be the only way forward.
> 
> So, rely on 6.12.1.2 and assume functions without ACS do not have internal
> loopback. This resolves the common issue with modern systems and MFD root
> ports, but it makes the ACS quirks system less used. Instead we'd want
> quirks that say self-loopback is actually present, not like today's quirks
> that say it is absent. This is surely negative for older hardware, but
> positive for new HW that complies with the spec.
> 
> Use pci_reachable_set() in pci_device_group() to make the resulting
> algorithm faster and easier to understand.
> 
> Add pci_mfds_are_same_group() which specifically looks pair-wise at all
> functions in the MFDs. Any function with ACS capabilities and non-isolated
> aCS flags will become reachable to all other functions.
> 
> pci_reachable_set() does the calculations for figuring out the set of
> devices under the pci_bus_sem, which is better than repeatedly searching
> across all PCI devices.
> 
> Once the set of devices is determined and the set has more than one device
> use pci_get_slot() to search for any existing groups in the reachable set.
> 
> Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 189 +++++++++++++++++++-----------------------
>   1 file changed, 87 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 543d6347c0e5e3..fc3c71b243a850 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_id);
>   
> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> -					       unsigned long *devfns);
> -
> -/*
> - * For multifunction devices which are not isolated from each other, find
> - * all the other non-isolated functions and look for existing groups.  For
> - * each function, we also need to look for aliases to or from other devices
> - * that may already have a group.
> - */
> -static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
> -							unsigned long *devfns)
> -{
> -	struct pci_dev *tmp = NULL;
> -	struct iommu_group *group;
> -
> -	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> -		return NULL;
> -
> -	for_each_pci_dev(tmp) {
> -		if (tmp == pdev || tmp->bus != pdev->bus ||
> -		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
> -		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
> -			continue;
> -
> -		group = get_pci_alias_group(tmp, devfns);
> -		if (group) {
> -			pci_dev_put(tmp);
> -			return group;
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
> -/*
> - * Look for aliases to or from the given device for existing groups. DMA
> - * aliases are only supported on the same bus, therefore the search
> - * space is quite small (especially since we're really only looking at pcie
> - * device, and therefore only expect multiple slots on the root complex or
> - * downstream switch ports).  It's conceivable though that a pair of
> - * multifunction devices could have aliases between them that would cause a
> - * loop.  To prevent this, we use a bitmap to track where we've been.
> - */
> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> -					       unsigned long *devfns)
> -{
> -	struct pci_dev *tmp = NULL;
> -	struct iommu_group *group;
> -
> -	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
> -		return NULL;
> -
> -	group = iommu_group_get(&pdev->dev);
> -	if (group)
> -		return group;
> -
> -	for_each_pci_dev(tmp) {
> -		if (tmp == pdev || tmp->bus != pdev->bus)
> -			continue;
> -
> -		/* We alias them or they alias us */
> -		if (pci_devs_are_dma_aliases(pdev, tmp)) {
> -			group = get_pci_alias_group(tmp, devfns);
> -			if (group) {
> -				pci_dev_put(tmp);
> -				return group;
> -			}
> -
> -			group = get_pci_function_alias_group(tmp, devfns);
> -			if (group) {
> -				pci_dev_put(tmp);
> -				return group;
> -			}
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
>   /*
>    * Generic device_group call-back function. It just allocates one
>    * iommu-group per device.
> @@ -1534,44 +1455,108 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
>   	return group;
>   }
>   
> +/*
> + * All functions in the MFD need to be isolated from each other and get their
> + * own groups, otherwise the whole MFD will share a group.
> + */
> +static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
> +{
> +	/*
> +	 * SRIOV VFs will use the group of the PF if it has
> +	 * BUS_DATA_PCI_NON_ISOLATED. We don't support VFs that also have ACS
> +	 * that are set to non-isolating.
> +	 */
> +	if (deva->is_virtfn || devb->is_virtfn)
> +		return false;
> +
> +	/* Are deva/devb functions in the same MFD? */
> +	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
> +		return false;
> +	/* Don't understand what is happening, be conservative */
> +	if (deva->multifunction != devb->multifunction)
> +		return true;
> +	if (!deva->multifunction)
> +		return false;
> +
> +	/*
> +	 * PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and
> +	 * Multi-Function Devices
> +	 *   ...
> +	 *   ACS P2P Request Redirect: must be implemented by Functions that
> +	 *   support peer-to-peer traffic with other Functions.
> +	 *
> +	 * Therefore assume if a MFD has no ACS capability then it does not
> +	 * support a loopback. This is the reverse of what Linux <= v6.16
> +	 * assumed - that any MFD was capable of P2P and used quirks identify
> +	 * devices that complied with the above.
> +	 */
> +	if (deva->acs_cap && !pci_acs_enabled(deva, PCI_ACS_ISOLATED))
> +		return true;
> +	if (devb->acs_cap && !pci_acs_enabled(devb, PCI_ACS_ISOLATED))
> +		return true;
> +	return false;
> +}
> +
> +static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
> +{
> +	/*
> +	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
> +	 */
> +	if (pci_devs_are_dma_aliases(deva, devb))
> +		return true;
> +
> +	return pci_mfds_are_same_group(deva, devb);
> +}
> +
>   /*
>    * Return a group if the function has isolation restrictions related to
>    * aliases or MFD ACS.
>    */
>   static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
>   {
> -	struct iommu_group *group;
> -	DECLARE_BITMAP(devfns, 256) = {};
> +	struct pci_reachable_set devfns;
> +	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
> +	unsigned int devfn;
>   
>   	/*
> -	 * Look for existing groups on device aliases.  If we alias another
> -	 * device or another device aliases us, use the same group.
> +	 * Look for existing groups on device aliases and multi-function ACS. If
> +	 * we alias another device or another device aliases us, use the same
> +	 * group.
> +	 *
> +	 * pci_reachable_set() should return the same bitmap if called for any
> +	 * device in the set and we want all devices in the set to have the same
> +	 * group.
>   	 */
> -	group = get_pci_alias_group(pdev, devfns);
> -	if (group)
> -		return group;
> +	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
> +	/* start is known to have iommu_group_get() == NULL */
> +	__clear_bit(pdev->devfn, devfns.devfns);
>   
>   	/*
> -	 * Look for existing groups on non-isolated functions on the same
> -	 * slot and aliases of those funcions, if any.  No need to clear
> -	 * the search bitmap, the tested devfns are still valid.
> -	 */
> -	group = get_pci_function_alias_group(pdev, devfns);
> -	if (group)
> -		return group;
> -
> -	/*
> -	 * When MFD's are included in the set due to ACS we assume that if ACS
> -	 * permits an internal loopback between functions it also permits the
> -	 * loopback to go downstream if a function is a bridge.
> +	 * When MFD functions are included in the set due to ACS we assume that
> +	 * if ACS permits an internal loopback between functions it also permits
> +	 * the loopback to go downstream if any function is a bridge.
>   	 *
>   	 * It is less clear what aliases mean when applied to a bridge. For now
>   	 * be conservative and also propagate the group downstream.
>   	 */
> -	__clear_bit(pdev->devfn & 0xFF, devfns);
> -	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
> -		return pci_group_alloc_non_isolated();
> -	return NULL;
... and why was the above code done in patch 3 and then undone here to
use the reachable() support in patch 5 below, when patch 5 could be moved before
patch 3, and we just get to this final implementation, dropping (some of) patch 3?

> +	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
> +		return NULL;
> +
> +	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
> +		struct iommu_group *group;
> +		struct pci_dev *pdev_slot;
> +
> +		pdev_slot = pci_get_slot(pdev->bus, devfn);
> +		group = iommu_group_get(&pdev_slot->dev);
> +		pci_dev_put(pdev_slot);
> +		if (group) {
> +			if (WARN_ON(!(group->bus_data &
> +				      BUS_DATA_PCI_NON_ISOLATED)))
> +				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +			return group;
> +		}
> +	}
> +	return pci_group_alloc_non_isolated();
>   }
>   
>   /* Return a group if the upstream hierarchy has isolation restrictions. */


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-09-05 18:06 ` [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
@ 2025-09-09  5:00   ` Donald Dutile
  2025-09-09 15:35     ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  5:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> Directly check that the devices touched by pci_for_each_dma_alias() match
> the groups that were built by pci_device_group(). This helps validate that
Do they have to match, as in equal, or be included ?

> pci_for_each_dma_alias() and pci_bus_isolated() are consistent.
> 
> This should eventually be hidden behind a debug kconfig, but for now it is
> good to get feedback from more diverse systems if there are any problems.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 76 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index fc3c71b243a850..2bd43a5a9ad8d8 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1627,7 +1627,7 @@ static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
>    *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
>    *     that bus becomes a single group.
>    */
> -struct iommu_group *pci_device_group(struct device *dev)
> +static struct iommu_group *__pci_device_group(struct device *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct iommu_group *group;
> @@ -1734,6 +1734,80 @@ struct iommu_group *pci_device_group(struct device *dev)
>   	WARN_ON(true);
>   	return ERR_PTR(-EINVAL);
>   }
> +
> +struct check_group_aliases_data {
> +	struct pci_dev *pdev;
> +	struct iommu_group *group;
> +};
> +
> +static void pci_check_group(const struct check_group_aliases_data *data,
> +			    u16 alias, struct pci_dev *pdev)
> +{
> +	struct iommu_group *group;
> +
> +	group = iommu_group_get(&pdev->dev);
> +	if (!group)
> +		return;
> +
> +	if (group != data->group)
> +		dev_err(&data->pdev->dev,
> +			"During group construction alias processing needed dev %s alias %x to have the same group but %u != %u\n",
> +			pci_name(pdev), alias, data->group->id, group->id);
> +	iommu_group_put(group);
> +}
> +
> +static int pci_check_group_aliases(struct pci_dev *pdev, u16 alias,
> +				   void *opaque)
> +{
> +	const struct check_group_aliases_data *data = opaque;
> +
> +	/*
> +	 * Sometimes when a PCIe-PCI bridge is performing transactions on behalf
> +	 * of its subordinate bus it uses devfn=0 on the subordinate bus as the
> +	 * alias. This means that 0 will alias with all devfns on the
> +	 * subordinate bus and so we expect to see those in the same group. pdev
> +	 * in this case is the bridge itself and pdev->bus is the primary bus of
> +	 * the bridge.
> +	 */
> +	if (pdev->bus->number != PCI_BUS_NUM(alias)) {
> +		struct pci_dev *piter = NULL;
> +
> +		for_each_pci_dev(piter) {
> +			if (pci_domain_nr(pdev->bus) ==
> +				    pci_domain_nr(piter->bus) &&
> +			    PCI_BUS_NUM(alias) == pdev->bus->number)
> +				pci_check_group(data, alias, piter);
> +		}
> +	} else {
> +		pci_check_group(data, alias, pdev);
> +	}
> +
> +	return 0;
> +}
> +
> +struct iommu_group *pci_device_group(struct device *dev)
> +{
> +	struct check_group_aliases_data data = {
> +		.pdev = to_pci_dev(dev),
> +	};
> +	struct iommu_group *group;
> +
> +	if (!IS_ENABLED(CONFIG_PCI))
> +		return ERR_PTR(-EINVAL);
> +
> +	group = __pci_device_group(dev);
> +	if (IS_ERR(group))
> +		return group;
> +
> +	/*
> +	 * The IOMMU driver should use pci_for_each_dma_alias() to figure out
> +	 * what RIDs to program and the core requires all the RIDs to fall
> +	 * within the same group. Validate that everything worked properly.
> +	 */
> +	data.group = group;
> +	pci_for_each_dma_alias(data.pdev, pci_check_group_aliases, &data);
> +	return group;
> +}
>   EXPORT_SYMBOL_GPL(pci_device_group);
>   
>   /* Get the IOMMU group for device on fsl-mc bus */


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 08/11] PCI: Add the ACS Enhanced Capability definitions
  2025-09-05 18:06 ` [PATCH v3 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
@ 2025-09-09  5:01   ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  5:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> This brings the definitions up to PCI Express revision 5.0:
> 
>   * ACS I/O Request Blocking Enable
>   * ACS DSP Memory Target Access Control
>   * ACS USP Memory Target Access Control
>   * ACS Unclaimed Request Redirect
> 
> Support for this entire grouping is advertised by the ACS Enhanced
> Capability bit.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   include/uapi/linux/pci_regs.h | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 6095e7d7d4cc48..54621e6e83572e 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1005,8 +1005,16 @@
>   #define  PCI_ACS_UF		0x0010	/* Upstream Forwarding */
>   #define  PCI_ACS_EC		0x0020	/* P2P Egress Control */
>   #define  PCI_ACS_DT		0x0040	/* Direct Translated P2P */
> +#define  PCI_ACS_ENHANCED	0x0080  /* IORB, DSP_MT_xx, USP_MT_XX. Capability only */
> +#define  PCI_ACS_EGRESS_CTL_SZ	GENMASK(15, 8) /* Egress Control Vector Size */
>   #define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
>   #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
> +#define  PCI_ACS_IORB		0x0080  /* I/O Request Blocking */
> +#define  PCI_ACS_DSP_MT_RB	0x0100  /* DSP Memory Target Access Control Request Blocking */
> +#define  PCI_ACS_DSP_MT_RR	0x0200  /* DSP Memory Target Access Control Request Redirect */
> +#define  PCI_ACS_USP_MT_RB	0x0400  /* USP Memory Target Access Control Request Blocking */
> +#define  PCI_ACS_USP_MT_RR	0x0800  /* USP Memory Target Access Control Request Redirect */
> +#define  PCI_ACS_UNCLAIMED_RR	0x1000  /* Unclaimed Request Redirect Control */
>   #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
>   
>   /*

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  2025-09-05 18:06 ` [PATCH v3 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
@ 2025-09-09  5:01   ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  5:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> The ACS Enhanced bits are intended to address a lack of precision in the
> spec about what ACS P2P Request Redirect is supposed to do. While Linux
> has long assumed that PCI_ACS_RR would cover MMIO BARs located in the root
> port and PCIe Switch ports, the spec took the position that it is
> implementation specific.
> 
> To get the behavior Linux has long assumed it should be setting:
> 
>    PCI_ACS_RR | PCI_ACS_DSP_MT_RR | PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAMED_RR
> 
> Follow this guidance in enable_acs and set the additional bits if ACS
> Enhanced is supported.
> 
> Allow config_acs to control these bits if the device has ACS Enhanced.
> 
> The spec permits the HW to wire the bits, so after setting them
> pci_acs_flags_enabled() does do a pci_read_config_word() to read the
> actual value in effect.
> 
> Note that currently Linux sets these bits to 0, so any new HW that comes
> supporting ACS Enhanced will end up with historical Linux disabling these
> functions. Devices wanting to be compatible with old Linux will need to
> wire the ctrl bits to follow ACS_RR. Devices that implement ACS Enhanced
> and support the ctrl=0 behavior will break PASID SVA support and VFIO
> isolation when ACS_RR is enabled.
> 
> Due to the above I strongly encourage backporting this change otherwise
> old kernels may have issues with new generations of PCI switches.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/pci/pci.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0f4d98036cddd..983f71211f0055 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -957,6 +957,7 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
>   			     const char *p, const u16 acs_mask, const u16 acs_flags)
>   {
>   	u16 flags = acs_flags;
> +	u16 supported_flags;
>   	u16 mask = acs_mask;
>   	char *delimit;
>   	int ret = 0;
> @@ -1001,8 +1002,14 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
>   			}
>   		}
>   
> -		if (mask & ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
> -			    PCI_ACS_UF | PCI_ACS_EC | PCI_ACS_DT)) {
> +		supported_flags = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
> +				  PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_EC |
> +				  PCI_ACS_DT;
> +		if (caps->cap & PCI_ACS_ENHANCED)
> +			supported_flags |= PCI_ACS_USP_MT_RR |
> +					   PCI_ACS_DSP_MT_RR |
> +					   PCI_ACS_UNCLAIMED_RR;
> +		if (mask & ~supported_flags) {
>   			pci_err(dev, "Invalid ACS flags specified\n");
>   			return;
>   		}
> @@ -1062,6 +1069,14 @@ static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps)
>   	/* Upstream Forwarding */
>   	caps->ctrl |= (caps->cap & PCI_ACS_UF);
>   
> +	/*
> +	 * USP/DSP Memory Target Access Control and Unclaimed Request Redirect
> +	 */
> +	if (caps->cap & PCI_ACS_ENHANCED) {
> +		caps->ctrl |= PCI_ACS_USP_MT_RR | PCI_ACS_DSP_MT_RR |
> +			      PCI_ACS_UNCLAIMED_RR;
> +	}
> +
>   	/* Enable Translation Blocking for external devices and noats */
>   	if (pci_ats_disabled() || dev->external_facing || dev->untrusted)
>   		caps->ctrl |= (caps->cap & PCI_ACS_TB);

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-09-05 18:06 ` [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
@ 2025-09-09  5:02   ` Donald Dutile
  2025-09-09 21:43   ` Bjorn Helgaas
  1 sibling, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  5:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> Switches ignore the PASID when routing TLPs. This means the path from the
> PASID issuing end point to the IOMMU must be direct with no possibility
> for another device to claim the addresses.
> 
> This is done using ACS flags and pci_enable_pasid() checks for this.
> 
> The new ACS Enhanced bits clarify some undefined behaviors in the spec
> around what P2P Request Redirect means.
> 
> Linux has long assumed that PCI_ACS_RR implies PCI_ACS_DSP_MT_RR |
> PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.
> 
> If the device supports ACS Enhanced then use the information it reports to
> determine if PASID SVA is supported or not.
> 
>   PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
>                      flowing transactions
> 
>   PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
>                      flowing transactions
> 
>   PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
>                         to all the DSP bridge windows from generating a
>                         error.
> 
> Each of these cases would poke a hole in the PASID address space which is
> not permitted.
> 
> Enhance the comments around pci_acs_flags_enabled() to better explain the
> reasoning for its logic. Continue to take the approach of assuming the
> device is doing the "right ACS" if it does not explicitly declare
> otherwise.
> 
> Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/pci/ats.c |  4 +++-
>   drivers/pci/pci.c | 54 +++++++++++++++++++++++++++++++++++++++++------
>   2 files changed, 50 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index ec6c8dbdc5e9c9..00603c2c4ff0ea 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -416,7 +416,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
>   	if (!pasid)
>   		return -EINVAL;
>   
> -	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
> +	if (!pci_acs_path_enabled(pdev, NULL,
> +				  PCI_ACS_RR | PCI_ACS_UF | PCI_ACS_USP_MT_RR |
> +				  PCI_ACS_DSP_MT_RR | PCI_ACS_UNCLAIMED_RR))
>   		return -EINVAL;
>   
>   	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 983f71211f0055..620b7f79093854 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3606,6 +3606,52 @@ void pci_configure_ari(struct pci_dev *dev)
>   	}
>   }
>   
> +
> +/*
> + * The spec is not clear what it means if the capability bit is 0. One view is
> + * that the device acts as though the ctrl bit is zero, another view is the
> + * device behavior is undefined.
> + *
> + * Historically Linux has taken the position that the capability bit as 0 means
> + * the device supports the most favorable interpretation of the spec - ie that
> + * things like P2P RR are always on. As this is security sensitive we expect
> + * devices that do not follow this rule to be quirked.
> + *
> + * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
> + * and switch ports. If those ports have no MMIO then it is not relavent.
> + * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
> + * window that is not fully decoded by the downstream windows.
> + *
> + * This takes the same approach with ACS Enhanced, if the device does not
> + * support it then we assume the ACS P2P RR has all the enhanced behaviors too.
> + *
> + * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
> + * those values would break old kernels on the edge cases they cover, the only
> + * compatible thing for a new device to implement is ACS Enhanced supported with
> + * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
> + */
> +static u16 pci_acs_ctrl_mask(struct pci_dev *pdev, u16 hw_cap)
> +{
> +	/*
> +	 * Egress Control enables use of the Egress Control Vector which is not
> +	 * present without the cap.
> +	 */
> +	u16 mask = PCI_ACS_EC;
> +
> +	mask = hw_cap & (PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
> +				      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
> +
> +	/*
> +	 * If ACS Enhanced is supported the device reports what it is doing
> +	 * through these bits which may not be settable.
> +	 */
> +	if (hw_cap & PCI_ACS_ENHANCED)
> +		mask |= PCI_ACS_IORB | PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR |
> +			PCI_ACS_USP_MT_RB | PCI_ACS_USP_MT_RR |
> +			PCI_ACS_UNCLAIMED_RR;
> +	return mask;
> +}
> +
>   static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
>   {
>   	int pos;
> @@ -3615,15 +3661,9 @@ static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
>   	if (!pos)
>   		return false;
>   
> -	/*
> -	 * Except for egress control, capabilities are either required
> -	 * or only required if controllable.  Features missing from the
> -	 * capability field can therefore be assumed as hard-wired enabled.
> -	 */
>   	pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
> -	acs_flags &= (cap | PCI_ACS_EC);
> -
>   	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +	acs_flags &= pci_acs_ctrl_mask(pdev, cap);
>   	return (ctrl & acs_flags) == acs_flags;
>   }
>   

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 11/11] PCI: Check ACS Extended flags for pci_bus_isolated()
  2025-09-05 18:06 ` [PATCH v3 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
@ 2025-09-09  5:04   ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09  5:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave



On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> When looking at a PCIe switch we want to see that the USP/DSP MMIO have
> request redirect enabled. Detect the case where the USP is expressly not
> isolated from the DSP and ensure the USP is included in the group.
> 
> The DSP Memory Target also applies to the Root Port, check it there
> too. If upstream directed transactions can reach the root port MMIO then
> it is not isolated.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/pci/search.c | 16 +++++++++++++---
>   1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
> index dac6b042fd5f5d..cba417cbe3476e 100644
> --- a/drivers/pci/search.c
> +++ b/drivers/pci/search.c
> @@ -127,6 +127,8 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
>   	 * traffic flowing upstream back downstream through another DSP.
>   	 *
>   	 * Thus any non-permissive DSP spoils the whole bus.
> +	 * PCI_ACS_UNCLAIMED_RR is not required since rejecting requests with
> +	 * error is still isolation.
>   	 */
>   	guard(rwsem_read)(&pci_bus_sem);
>   	list_for_each_entry(pdev, &bus->devices, bus_list) {
> @@ -136,8 +138,14 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
>   		    pdev->dma_alias_mask)
>   			return PCIE_NON_ISOLATED;
>   
> -		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> +		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED |
> +						   PCI_ACS_DSP_MT_RR |
> +						   PCI_ACS_USP_MT_RR)) {
> +			/* The USP is isolated from the DSP */
> +			if (!pci_acs_enabled(pdev, PCI_ACS_USP_MT_RR))
> +				return PCIE_NON_ISOLATED;
>   			return PCIE_SWITCH_DSP_NON_ISOLATED;
> +		}
>   	}
>   	return PCIE_ISOLATED;
>   }
> @@ -232,11 +240,13 @@ enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
>   	/*
>   	 * Since PCIe links are point to point root ports are isolated if there
>   	 * is no internal loopback to the root port's MMIO. Like MFDs assume if
> -	 * there is no ACS cap then there is no loopback.
> +	 * there is no ACS cap then there is no loopback. The root port uses
> +	 * DSP_MT_RR for its own MMIO.
>   	 */
>   	case PCI_EXP_TYPE_ROOT_PORT:
>   		if (bridge->acs_cap &&
> -		    !pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
> +		    !pci_acs_enabled(bridge,
> +				     PCI_ACS_ISOLATED | PCI_ACS_DSP_MT_RR))
>   			return PCIE_NON_ISOLATED;
>   		return PCIE_ISOLATED;
>   
Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-09-09  4:14   ` Donald Dutile
@ 2025-09-09 12:18     ` Jason Gunthorpe
  2025-09-09 19:33       ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-09 12:18 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 12:14:00AM -0400, Donald Dutile wrote:

> > -/*
> > - * Use standard PCI bus topology, isolation features, and DMA alias quirks
> > - * to find or create an IOMMU group for a device.
> > - */
> > -struct iommu_group *pci_device_group(struct device *dev)
> > +static struct iommu_group *pci_group_alloc_non_isolated(void)

> Maybe iommu_group_alloc_non_isolated() would be a better name, since that's all it does.

The way I've organized it makes the bus data a per-bus thing, so
having pci in the name when setting BUS_DATA_PCI_NON_ISOLATED is
correct.

What I did was turn iommu_group_alloc() into 

static struct iommu_group *iommu_group_alloc_data(u32 bus_data)

Then

struct iommu_group *iommu_group_alloc(void)
{
	return iommu_group_alloc_data(0);
}

And instead of pci_group_alloc_non_isolated() it is just:

	return iommu_group_alloc_data(BUS_DATA_PCI_NON_ISOLATED);

So everything is setup generically if someday another bus would like
to have its own data.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-09  4:57   ` Donald Dutile
@ 2025-09-09 13:31     ` Jason Gunthorpe
  2025-09-09 19:55       ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-09 13:31 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 12:57:59AM -0400, Donald Dutile wrote:

> ... and why was the above code done in patch 3 and then undone here to
> use the reachable() support in patch 5 below, when patch 5 could be moved before
> patch 3, and we just get to this final implementation, dropping (some of) patch 3?

If you use that order then the switch stuff has to be done and redone :(

I put it in this order because the switch change seems lower risk to
me. Fewer people have switches in their system. While the MFD change
on top is higher risk, even my simple consumer test systems hit
troubles with it.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-09-09  5:00   ` Donald Dutile
@ 2025-09-09 15:35     ` Jason Gunthorpe
  2025-09-09 19:58       ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-09 15:35 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 01:00:08AM -0400, Donald Dutile wrote:
> 
> 
> On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
> > Directly check that the devices touched by pci_for_each_dma_alias() match
> > the groups that were built by pci_device_group(). This helps validate that
> Do they have to match, as in equal, or be included ?

All aliases have to be in the same group, or have no group discovered yet.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-09-09 12:18     ` Jason Gunthorpe
@ 2025-09-09 19:33       ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/9/25 8:18 AM, Jason Gunthorpe wrote:
> On Tue, Sep 09, 2025 at 12:14:00AM -0400, Donald Dutile wrote:
> 
>>> -/*
>>> - * Use standard PCI bus topology, isolation features, and DMA alias quirks
>>> - * to find or create an IOMMU group for a device.
>>> - */
>>> -struct iommu_group *pci_device_group(struct device *dev)
>>> +static struct iommu_group *pci_group_alloc_non_isolated(void)
> 
>> Maybe iommu_group_alloc_non_isolated() would be a better name, since that's all it does.
> 
> The way I've organized it makes the bus data a per-bus thing, so
> having pci in the name when setting BUS_DATA_PCI_NON_ISOLATED is
> correct.
> 
> What I did was turn iommu_group_alloc() into
> 
> static struct iommu_group *iommu_group_alloc_data(u32 bus_data)
> 
> Then
> 
> struct iommu_group *iommu_group_alloc(void)
> {
> 	return iommu_group_alloc_data(0);
> }
> 
> And instead of pci_group_alloc_non_isolated() it is just:
> 
> 	return iommu_group_alloc_data(BUS_DATA_PCI_NON_ISOLATED);
> 
> So everything is setup generically if someday another bus would like
> to have its own data.
> 
/my bad, I scanned pci_group_alloc_non_isolated() as calling iommu_group_alloc() & not iommu_group_alloc_data() as you pointed out.
Looks good.

Reviewed-by: Donald Dutile <ddutile@redhat.com>

> Jason
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 02/11] PCI: Add pci_bus_isolated()
  2025-09-05 18:06 ` [PATCH v3 02/11] PCI: Add pci_bus_isolated() Jason Gunthorpe
  2025-09-09  4:09   ` Donald Dutile
@ 2025-09-09 19:54   ` Bjorn Helgaas
  2025-09-09 21:21     ` Jason Gunthorpe
  1 sibling, 1 reply; 52+ messages in thread
From: Bjorn Helgaas @ 2025-09-09 19:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Sep 05, 2025 at 03:06:17PM -0300, Jason Gunthorpe wrote:
> Prepare to move the calculation of the bus P2P isolation out of the iommu
> code and into the PCI core. This allows using the faster list iteration
> under the pci_bus_sem, and the code has a kinship with the logic in
> pci_for_each_dma_alias().
> 
> Bus isolation is the concept that drives the iommu_groups for the purposes
> of VFIO. Stated simply, if device A can send traffic to device B then they
> must be in the same group.
> 
> Only PCIe provides isolation. The multi-drop electrical topology in
> classical PCI allows any bus member to claim the transaction.
> 
> In PCIe isolation comes out of ACS. If a PCIe Switch and Root Complex has
> ACS flags that prevent peer to peer traffic and funnel all operations to
> the IOMMU then devices can be isolated.

I guess a device being isolated means that peer-to-peer Requests from
a different bus can't reach it?

Did you mean "Root Port" instead of "Root Complex"?  Or are you
assuming an ACS Capability in an RCRB?  (I don't think Linux supports
RCRBs, except maybe for CXL)

> Multi-function devices also have an isolation concern with self loopback
> between the functions, though pci_bus_isolated() does not deal with
> devices.

It looks like multi-function devices *can* implement ACS and can
isolate functions from each other (PCIe r7.0, sec 6.12.1.2).  But it
sounds like we're ignoring peer-to-peer on the same bus for now and
assuming devices on the same bus can't be isolated from each other?

If we ignore ACS on non-bridge multi-function devices, I think the
only way to isolate things is bridge ACS that controls forwarding
between buses.  If everything on the bus must be in the same group, it
makes sense for pci_bus_isolated() to take a pci_bus pointer and not
deal with individual devices.

Below, it seems like sometimes we refer to *buses* being isolated and
other times *devices* (Root Port, Switch Port, Switch, etc), so I'm a
little confused.

> As a property of a bus, there are several positive cases:
> 
>  - The point to point "bus" on a physical PCIe link is isolated if the
>    bridge/root device has something preventing self-access to its own
>    MMIO.
>
>  - A Root Port is usually isolated
>
>  - A PCIe switch can be isolated if all it's Down Stream Ports have good
>    ACS flags

I guess this is saying that a switch's internal bus is isolated if all
the DSPs have the ACS flags we need?

s/it's/its/
s/Down Stream Ports/Downstream Ports/

> pci_bus_isolated() implements these rules and returns an enum indicating
> the level of isolation the bus has, with five possibilies:
> 
>  PCIE_ISOLATED: Traffic on this PCIE bus can not do any P2P.

Is this saying that peer-to-peer Requests can't reach devices on this
bus?  Or Requests *from* this bus can only go to the IOMMU?

>  PCIE_SWITCH_DSP_NON_ISOLATED: The bus is the internal bus of a PCIE
>      switch and the USP is isolated but the DSPs are not.
> 
>  PCIE_NON_ISOLATED: The PCIe bus has no isolation between the bridge or
>      any downstream devices.
> 
>  PCI_BUS_NON_ISOLATED: It is a PCI/PCI-X but the bridge is PCIe, has no
>      aliases and the bridge is isolated from the bus.

s|PCI/PCI-X|PCI/PCI-X bus| to match below?

>  PCI_BRIDGE_NON_ISOLATED: It is a PCI/PCI-X bus and has no isolation, the
>      bridge is part of the group.
> 
> The calculation is done per-bus, so it is possible for a transactions from
> a PCI device to travel through different bus isolation types on its way
> upstream. PCIE_SWITCH_DSP_NON_ISOLATED/PCI_BUS_NON_ISOLATED and
> PCIE_NON_ISOLATED/PCI_BRIDGE_NON_ISOLATED are the same for the purposes of
> creating iommu groups. The distinction between PCIe and PCI allows for
> easier understanding and debugging as to why the groups are chosen.

s/for a transactions/for a transaction/

> For the iommu groups if all busses on the upstream path are PCIE_ISOLATED
> then the end device has a chance to have a single-device iommu_group. Once
> any non-isolated bus segment is found that bus segment will have an
> iommu_group that captures all downstream devices, and sometimes the
> upstream bridge.
> 
> pci_bus_isolated() is principally about isolation, but there is an
> overlap with grouping requirements for legacy PCI aliasing. For purely
> legacy PCI environments pci_bus_isolated() returns
> PCI_BRIDGE_NON_ISOLATED for everything and all devices within a hierarchy
> are in one group. No need to worry about bridge aliasing.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/pci/search.c | 174 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h  |  31 ++++++++
>  2 files changed, 205 insertions(+)
> 
> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
> index 53840634fbfc2b..fe6c07e67cb8ce 100644
> --- a/drivers/pci/search.c
> +++ b/drivers/pci/search.c
> @@ -113,6 +113,180 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
>  	return ret;
>  }
>  
> +static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +
> +	/*
> +	 * Within a PCIe switch we have an interior bus that has the Upstream
> +	 * port as the bridge and a set of Downstream port bridging to the
> +	 * egress ports.

s/interior/internal/ to match commit log and use below
s/Upstream port/Upstream Port/
s/set of Downstream port/set of Downstream Ports/

> +	 *
> +	 * Each DSP has an ACS setting which controls where its traffic is
> +	 * permitted to go. Any DSP with a permissive ACS setting can send
> +	 * traffic flowing upstream back downstream through another DSP.
> +	 *
> +	 * Thus any non-permissive DSP spoils the whole bus.

s/non-permissive/permissive/ ?  seems backwards to me

> +	guard(rwsem_read)(&pci_bus_sem);
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		/* Don't understand what this is, be conservative */
> +		if (!pci_is_pcie(pdev) ||
> +		    pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM ||
> +		    pdev->dma_alias_mask)
> +			return PCIE_NON_ISOLATED;
> +
> +		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> +			return PCIE_SWITCH_DSP_NON_ISOLATED;
> +	}
> +	return PCIE_ISOLATED;
> +}
> +
> +static bool pci_has_mmio(struct pci_dev *pdev)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> +		struct resource *res = pci_resource_n(pdev, i);
> +
> +		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/**
> + * pci_bus_isolated - Determine how isolated connected devices are
> + * @bus: The bus to check
> + *
> + * Isolation is the ability of devices to talk to each other. Full isolation
> + * means that a device can only communicate with the IOMMU and can not do peer
> + * to peer within the fabric.

I would say "isolation" is something about the ability to *prevent*
devices from talking to each other.

> + * We consider isolation on a bus by bus basis. If the bus will permit a
> + * transaction originated downstream to complete on anything other than the
> + * IOMMU then the bus is not isolated.
> + *
> + * Non-isolation includes all the downstream devices on this bus, and it may
> + * include the upstream bridge or port that is creating this bus.
> + *
> + * The various cases are returned in an enum.
> + *
> + * Broadly speaking this function evaluates the ACS settings in a PCI switch to
> + * determine if a PCI switch is configured to have full isolation.

s/PCI/PCIe/ since other text here is pretty consistent about
distinguishing them

Maybe s/if a PCI switch/if it/, since they must refer to the same
device.

> + * Old PCI/PCI-X busses cannot have isolation due to their physical properties,
> + * but they do have some aliasing properties that effect group creation.

s/effect/affect/

> + * pci_bus_isolated() does not consider loopback internal to devices, like
> + * multi-function devices performing a self-loopback. The caller must check
> + * this separately. It does not considering alasing within the bus.

s/alasing/aliasing/ (I guess this refers to the
PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS thing where a bridge takes ownership?)

> + * It does not currently support the ACS P2P Egress Control Vector, Linux does
> + * not yet have any way to enable this feature. EC will create subsets of the
> + * bus that are isolated from other subsets.
> + */
> +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> +{
> +	struct pci_dev *bridge = bus->self;
> +	int type;
> +
> +	/*
> +	 * This bus was created by pci_register_host_bridge(). The spec provides
> +	 * no way to tell what kind of bus this is, for PCIe we expect this to
> +	 * be internal to the root complex and not covered by any spec behavior.
> +	 * Linux has historically been optimistic about this bus and treated it
> +	 * as isolating. Given that the behavior of the root complex and the ACS
> +	 * behavior of RCiEP's is explicitly not specified we hope that the
> +	 * implementation is directing everything that reaches the root bus to
> +	 * the IOMMU.
> +	 */
> +	if (pci_is_root_bus(bus))
> +		return PCIE_ISOLATED;
> +
> +	/*
> +	 * bus->self is only NULL for SRIOV VFs, it represents a "virtual" bus
> +	 * within Linux to hold any bus numbers consumed by VF RIDs. Caller must
> +	 * use pci_physfn() to get the bus for calling this function.

s/VF RIDs/VFs/  I think?  I think we allocate these virtual bus
numbers when enabling the VFs.

> +	if (WARN_ON(!bridge))
> +		return PCI_BRIDGE_NON_ISOLATED;
> +
> +	/*
> +	 * The bridge is not a PCIe bridge therefore this bus is PCI/PCI-X.
> +	 *
> +	 * PCI does not have anything like ACS. Any down stream device can bus
> +	 * master an address that any other downstream device can claim. No
> +	 * isolation is possible.

s/down stream device/downstream device/

I guess this comment applies to the !pci_is_pcie() branch below?
Maybe it should go inside the "if"?

> +	if (!pci_is_pcie(bridge)) {
> +		if (bridge->dev_flags & PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS)
> +			type = PCI_EXP_TYPE_PCI_BRIDGE;
> +		else
> +			return PCI_BRIDGE_NON_ISOLATED;
> +	} else {
> +		type = pci_pcie_type(bridge);
> +	}
> +
> +	switch (type) {
> +	/*
> +	 * Since PCIe links are point to point root ports are isolated if there
> +	 * is no internal loopback to the root port's MMIO. Like MFDs assume if
> +	 * there is no ACS cap then there is no loopback.
> +	 */
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +		if (bridge->acs_cap &&
> +		    !pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
> +			return PCIE_NON_ISOLATED;
> +		return PCIE_ISOLATED;
> +
> +	/*
> +	 * Since PCIe links are point to point a DSP is always considered
> +	 * isolated. The internal bus of the switch will be non-isolated if the
> +	 * DSP's have any ACS that allows upstream traffic to flow back
> +	 * downstream to any DSP, including back to this DSP or its MMIO.
> +	 */
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +		return PCIE_ISOLATED;
> +
> +	/*
> +	 * bus is the interior bus of a PCI-E switch where ACS rules apply.

s/interior/internal/ to match use above
s/PCI-E/PCIe/

I'm not sure what this is saying.  A USP can't have an ACS Capability
unless it's part of a multi-function device.

> +	 */
> +	case PCI_EXP_TYPE_UPSTREAM:
> +		return pcie_switch_isolated(bus);
> +
> +	/*
> +	 * PCIe to PCI/PCI-X - this bus is PCI.
> +	 */
> +	case PCI_EXP_TYPE_PCI_BRIDGE:
> +		/*
> +		 * A PCIe express bridge will use the subordinate bus number
> +		 * with a 0 devfn as the RID in some cases. This causes all
> +		 * subordinate devfns to alias with 0, which is the same
> +		 * grouping as PCI_BUS_NON_ISOLATED. The RID of the bridge
> +		 * itself is only used by the bridge.
> +		 *
> +		 * However, if the bridge has MMIO then we will assume the MMIO
> +		 * is not isolated due to no ACS controls on this bridge type.

s/PCIe express/PCIe/

> +		 */
> +		if (pci_has_mmio(bridge))
> +			return PCI_BRIDGE_NON_ISOLATED;
> +		return PCI_BUS_NON_ISOLATED;
> +
> +	/*
> +	 * PCI/PCI-X to PCIe - this bus is PCIe. We already know there must be a
> +	 * PCI bus upstream of this bus, so just return non-isolated. If
> +	 * upstream is PCI-X the PCIe RID should be preserved, but for PCI the
> +	 * RID will be lost.
> +	 */
> +	case PCI_EXP_TYPE_PCIE_BRIDGE:
> +		return PCI_BRIDGE_NON_ISOLATED;
> +
> +	default:
> +		return PCI_BRIDGE_NON_ISOLATED;
> +	}
> +}
> +
>  static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
>  {
>  	struct pci_bus *child;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 59876de13860db..c36fff9d2254f8 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -855,6 +855,32 @@ struct pci_dynids {
>  	struct list_head	list;	/* For IDs added at runtime */
>  };
>  
> +enum pci_bus_isolation {
> +	/*
> +	 * The bus is off a root port and the root port has isolated ACS flags
> +	 * or the bus is part of a PCIe switch and the switch has isolated ACS
> +	 * flags.
> +	 */
> +	PCIE_ISOLATED,
> +	/*
> +	 * The switch's DSP's are not isolated from each other but are isolated
> +	 * from the USP.
> +	 */
> +	PCIE_SWITCH_DSP_NON_ISOLATED,
> +	/* The above and the USP's MMIO is not isolated. */
> +	PCIE_NON_ISOLATED,
> +	/*
> +	 * A PCI/PCI-X bus, no isolation. This is like
> +	 * PCIE_SWITCH_DSP_NON_ISOLATED in that the upstream bridge is isolated
> +	 * from the bus. The bus itself may also have a shared alias of devfn=0.
> +	 */
> +	PCI_BUS_NON_ISOLATED,
> +	/*
> +	 * The above and the bridge's MMIO is not isolated and the bridge's RID
> +	 * may be an alias.
> +	 */
> +	PCI_BRIDGE_NON_ISOLATED,
> +};
>  
>  /*
>   * PCI Error Recovery System (PCI-ERS).  If a PCI device driver provides
> @@ -1243,6 +1269,8 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
>  struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
>  struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
>  
> +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
> +
>  int pci_dev_present(const struct pci_device_id *ids);
>  
>  int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
> @@ -2056,6 +2084,9 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
>  						 struct pci_dev *from)
>  { return NULL; }
>  
> +static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> +{ return PCIE_NON_ISOLATED; }
> +
>  static inline int pci_dev_present(const struct pci_device_id *ids)
>  { return 0; }
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-09 13:31     ` Jason Gunthorpe
@ 2025-09-09 19:55       ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09 19:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/9/25 9:31 AM, Jason Gunthorpe wrote:
> On Tue, Sep 09, 2025 at 12:57:59AM -0400, Donald Dutile wrote:
> 
>> ... and why was the above code done in patch 3 and then undone here to
>> use the reachable() support in patch 5 below, when patch 5 could be moved before
>> patch 3, and we just get to this final implementation, dropping (some of) patch 3?
> 
> If you use that order then the switch stuff has to be done and redone :(
> 
> I put it in this order because the switch change seems lower risk to
> me. Fewer people have switches in their system. While the MFD change
> on top is higher risk, even my simple consumer test systems hit
> troubles with it.
> 
In 'my world' I see -lots- of switches in servers.
I don't disagree on the MFD being a higher risk, and more common across all systems.

> Jason
> 

poe-tay-toe, poh-tah-toh... It gets to the end point needed.
Thanks for reasoning...

Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-09-09 15:35     ` Jason Gunthorpe
@ 2025-09-09 19:58       ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-09 19:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/9/25 11:35 AM, Jason Gunthorpe wrote:
> On Tue, Sep 09, 2025 at 01:00:08AM -0400, Donald Dutile wrote:
>>
>>
>> On 9/5/25 2:06 PM, Jason Gunthorpe wrote:
>>> Directly check that the devices touched by pci_for_each_dma_alias() match
>>> the groups that were built by pci_device_group(). This helps validate that
>> Do they have to match, as in equal, or be included ?
> 
> All aliases have to be in the same group, or have no group discovered yet.
> 
I guess I'm not asking correctly, as I think you agreed, but I'm looking for a clearer statement.
You said 'in' the same group; that's not 'equal', or what I think of as a 'match' of the pci_device_group() & dma-alias.

So, is it in equality/match, or inclusion/inclusive check; if the later, just tweak the verbage.

> Jason
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-09-05 18:06 ` [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
  2025-09-09  4:14   ` Donald Dutile
@ 2025-09-09 20:27   ` Bjorn Helgaas
  2025-09-09 21:21     ` Jason Gunthorpe
  1 sibling, 1 reply; 52+ messages in thread
From: Bjorn Helgaas @ 2025-09-09 20:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Sep 05, 2025 at 03:06:18PM -0300, Jason Gunthorpe wrote:
> The current algorithm does not work if ACS is turned off, and it is not
> clear how this has been missed for so long. I think it has been avoided
> because the kernel command line options to target specific devices and
> disable ACS are rarely used.
> 
> For discussion lets consider a simple topology like the below:

s/lets consider/consider/ (or "let's")

> 
>                                -- DSP 02:00.0 -> End Point A
>  Root 00:00.0 -> USP 01:00.0 --|
>                                -- DSP 02:03.0 -> End Point B
> 
> If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
> B to all have unique single device groups.
> 
> If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
> unique single device groups while 02:00.0, 02:03.0, A, B are part of one
> multi-device group.
> 
> If the DSPs have asymmetric ACS, with one fully isolating and one
> non-isolating we also expect the above multi-device group result.
> 
> Instead the current algorithm always creates unique single device groups
> for this topology. It happens because the pci_device_group(DSP)
> immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
> true and decides the DSP can get a unique group. The pci_device_group(A)
> immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
> then takes the DSPs group.

s/takes the DSPs group/takes the DSP's group/ (I guess?)

> For root-ports a PCIe topology like:

s/root-ports/Root Ports/ (also various "root port" and "root complex"
spellings below that are typically capitalized in drivers/pci/)

>                                          -- Dev 01:00.0
>   Root  00:00.00 --- Root Port 00:01.0 --|
>                   |                      -- Dev 01:00.1
> 		  |- Dev 00:17.0
> 
> Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
> ACS capability in the root port.
> 
> While ACS on root ports is underspecified in the spec, it should still
> function as an egress control and limit access to either the MMIO of the
> root port itself, or perhaps some other devices upstream of the root
> complex - 00:17.0 perhaps in this example.

Does ACS have some kind of MMIO-specific restriction?  Oh, I guess
this must be the "Memory Target Access Control" piece?  (Added by the
upcoming patch 08/11).

> Historically the grouping in Linux has assumed the root port routes all
> traffic into the TA/IOMMU and never bypasses the TA to go to other
> functions in the root complex. Following the new understanding that ACS is
> required for internal loopback also treat root ports with no ACS
> capability as lacking internal loopback as well.
> 
> The current algorithm has several issues:
> 
>  1) It implicitly depends on ordering. Since the existing group discovery
>     only goes in the upstream direction discovering a downstream device
>     before its upstream will cause the wrong creation of narrower groups.
> 
>  2) It assumes that if the path from the end point to the root is entirely
>     ACS isolated then that end point is isolated. This misses cross-traffic
>     in the asymmetric ACS case.
> 
>  3) When evaluating a non-isolated DSP it does not check peer DSPs for an
>     already established group unless the multi-function feature does it.
> 
>  4) It does not understand the aliasing rule for PCIe to PCI bridges
>     where the alias is to the subordinate bus. The bridge's RID on the
>     primary bus is not aliased. This causes the PCIe to PCI bridge to be
>     wrongly joined to the group with the downstream devices.
> 
> As grouping is a security property for VFIO creating incorrectly narrowed
> groups is a security problem for the system.

I.e., we treated devices as being isolated from P2PDMA when they
actually were not isolated, right?  More isolation => smaller
(narrower) IOMMU groups?

> Revise the design to solve these problems.
> 
> Explicitly require ordering, or return EPROBE_DEFER if things are out of
> order. This avoids silent errors that created smaller groups and solves
> problem #1.

If it's easy to state, would be nice to say what ordering is required.
The issue mentioned above was "discovering a downstream device before
its upstream", so I guess you want to discover upstream devices before
downstream?  Obviously PCI enumeration already works that way, so
IOMMU group discovery must be a little different.

> Work on busses, not devices. Isolation is a property of the bus, and the
> first non-isolated bus should form a group containing all devices
> downstream of that bus. If all busses on the path to an end device are
> isolated then the end device has a chance to make a single-device group.
> 
> Use pci_bus_isolation() to compute the bus's isolation status based on the
> ACS flags and technology. pci_bus_isolation() touches a lot of PCI
> internals to get the information in the right format.
> 
> Add a new flag in the iommu_group to record that the group contains a
> non-isolated bus. Any downstream pci_device_group() will see
> bus->self->iommu_group is non-isolated and unconditionally join it. This
> makes the first non-isolation apply to all downstream devices and solves
> problem #2
> 
> The bus's non-isolated iommu_group will be stored in either the DSP of
> PCIe switch or the bus->self upstream device, depending on the situation.
> When storing in the DSP all the DSPs are checked first for a pre-existing
> non-isolated iommu_group. When stored in the upstream the flag forces it
> to all downstreams. This solves problem #3.
> 
> Put the handling of end-device aliases and MFD into pci_get_alias_group()
> and only call it in cases where we have a fully isolated path. Otherwise
> every downstream device on the bus is going to be joined to the group of
> bus->self.
> 
> Finally, replace the initial pci_for_each_dma_alias() with a combination
> of:
> 
>  - Directly checking pci_real_dma_dev() and enforcing ordering.
>    The group should contain both pdev and pci_real_dma_dev(pdev) which is
>    only possible if pdev is ordered after real_dma_dev. This solves a case
>    of #1.
> 
>  - Indirectly relying on pci_bus_isolation() to report legacy PCI busses
>    as non-isolated, with the enum including the distinction of the PCIe to
>    PCI bridge being isolated from the downstream. This solves problem #4.
> 
> It is very likely this is going to expand iommu_group membership in
> existing systems. After all that is the security bug that is being
> fixed. Expanding the iommu_groups risks problems for users using VFIO.
> 
> The intention is to have a more accurate reflection of the security
> properties in the system and should be seen as a security fix. However
> people who have ACS disabled may now need to enable it. As such users may
> have had good reason for ACS to be disabled I strongly recommend that
> backporting of this also include the new config_acs option so that such
> users can potentially minimally enable ACS only where needed.

Minor nits below.

> +/* Return a group if the upstream hierarchy has isolation restrictions. */
> +static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
> +{
> +	/*
> +	 * SRIOV functions may reside on a virtual bus, jump directly to the PFs
> +	 * bus in all cases.
> +	 */
> +	struct pci_bus *bus = pci_physfn(pdev)->bus;
> +	struct iommu_group *group;
> +
> +	/* Nothing upstream of this */
> +	if (pci_is_root_bus(bus))
> +		return NULL;
> +
> +	/*
> +	 * !self is only for SRIOV virtual busses which should have been
> +	 * excluded by pci_physfn()
> +	 */
> +	if (WARN_ON(!bus->self))
> +		return ERR_PTR(-EINVAL);
> +
> +	group = iommu_group_get(&bus->self->dev);
> +	if (!group) {
> +		/*
> +		 * If the upstream bridge needs the same group as pdev then
> +		 * there is no way for it's pci_device_group() to discover it.

s/it's/its/

> +		dev_err(&pdev->dev,
> +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
> +			pci_name(bus->self));
> +		return ERR_PTR(-EPROBE_DEFER);
> +	}
> +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
> +		return group;
> +	iommu_group_put(group);
> +	return NULL;
> +}
> +
> +/*
> + * For legacy PCI we have two main considerations when forming groups:
> + *
> + *  1) In PCI we can loose the RID inside the fabric, or some devices will use
> + *     the wrong RID. The PCI core calls this aliasing, but from an IOMMU
> + *     perspective it means that a PCI device may have multiple RIDs and a
> + *     single RID may represent many PCI devices. This effectively means all the
> + *     aliases must share a translation, thus group, because the IOMMU cannot
> + *     tell devices apart.

s/loose/lose/

> + *  2) PCI permits a bus segment to claim an address even if the transaction
> + *     originates from an end point not the CPU. When it happens it is called
> + *     peer to peer. Claiming a transaction in the middle of the bus hierarchy
> + *     bypasses the IOMMU translation. The IOMMU subsystem rules require these
> + *     devices to be placed in the same group because they lack isolation from
> + *     each other. In PCI Express the ACS system can be used to inhibit this and
> + *     force transactions to go to the IOMMU.
> + *
> + *     From a PCI perspective any given PCI bus is either isolating or
> + *     non-isolating. Isolating means downstream originated transactions always
> + *     progress toward the CPU and do not go to other devices on the bus
> + *     segment, while non-isolating means downstream originated transactions can
> + *     progress back downstream through another device on the bus segment.
> + *
> + *     Beyond buses a multi-function device or bridge can also allow
> + *     transactions to loop back internally from one function to another.

s/PCI Express/PCIe/ to match other usage?

Elsewhere in this series you use "busses".  "Buses" is more common in
both drivers/pci and drivers/iommu.

> + *
> + *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
> + *     that bus becomes a single group.

s/non isolating/non-isolating/ to match usage above

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/11] PCI: Add pci_reachable_set()
  2025-09-05 18:06 ` [PATCH v3 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
@ 2025-09-09 21:03   ` Bjorn Helgaas
  2025-09-10 16:13     ` Jason Gunthorpe
  2025-09-11 19:56     ` Donald Dutile
  0 siblings, 2 replies; 52+ messages in thread
From: Bjorn Helgaas @ 2025-09-09 21:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Sep 05, 2025 at 03:06:20PM -0300, Jason Gunthorpe wrote:
> Implement pci_reachable_set() to efficiently compute a set of devices on
> the same bus that are "reachable" from a starting device. The meaning of
> reachability is defined by the caller through a callback function.
> 
> This is a faster implementation of the same logic in
> pci_device_group(). Being inside the PCI core allows use of pci_bus_sem so
> it can use list_for_each_entry() on a small list of devices instead of the
> expensive for_each_pci_dev(). Server systems can now have hundreds of PCI
> devices, but typically only a very small number of devices per bus.
> 
> An example of a reachability function would be pci_devs_are_dma_aliases()
> which would compute a set of devices on the same bus that are
> aliases. This would also be useful in future support for the ACS P2P
> Egress Vector which has a similar reachability problem.
> 
> This is effectively a graph algorithm where the set of devices on the bus
> are vertexes and the reachable() function defines the edges. It returns a
> set of vertexes that form a connected graph.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/pci/search.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h  | 12 ++++++
>  2 files changed, 102 insertions(+)
> 
> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
> index fe6c07e67cb8ce..dac6b042fd5f5d 100644
> --- a/drivers/pci/search.c
> +++ b/drivers/pci/search.c
> @@ -595,3 +595,93 @@ int pci_dev_present(const struct pci_device_id *ids)
>  	return 0;
>  }
>  EXPORT_SYMBOL(pci_dev_present);
> +
> +/**
> + * pci_reachable_set - Generate a bitmap of devices within a reachability set
> + * @start: First device in the set
> + * @devfns: The set of devices on the bus

@devfns is a return parameter, right?  Maybe mention that somewhere?
And the fact that the set only includes the *reachable* devices on the
bus.

> + * @reachable: Callback to tell if two devices can reach each other
> + *
> + * Compute a bitmap where every set bit is a device on the bus that is reachable
> + * from the start device, including the start device. Reachability between two
> + * devices is determined by a callback function.
> + *
> + * This is a non-recursive implementation that invokes the callback once per
> + * pair. The callback must be commutative:
> + *    reachable(a, b) == reachable(b, a)
> + * reachable() can form a cyclic graph:
> + *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
> + *
> + * Since this function is limited to a single bus the largest set can be 256
> + * devices large.
> + */
> +void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
> +		       bool (*reachable)(struct pci_dev *deva,
> +					 struct pci_dev *devb))
> +{
> +	struct pci_reachable_set todo_devfns = {};
> +	struct pci_reachable_set next_devfns = {};
> +	struct pci_bus *bus = start->bus;
> +	bool again;
> +
> +	/* Assume devfn of all PCI devices is bounded by MAX_NR_DEVFNS */
> +	static_assert(sizeof(next_devfns.devfns) * BITS_PER_BYTE >=
> +		      MAX_NR_DEVFNS);
> +
> +	memset(devfns, 0, sizeof(devfns->devfns));
> +	__set_bit(start->devfn, devfns->devfns);
> +	__set_bit(start->devfn, next_devfns.devfns);
> +
> +	down_read(&pci_bus_sem);
> +	while (true) {
> +		unsigned int devfna;
> +		unsigned int i;
> +
> +		/*
> +		 * For each device that hasn't been checked compare every
> +		 * device on the bus against it.
> +		 */
> +		again = false;
> +		for_each_set_bit(devfna, next_devfns.devfns, MAX_NR_DEVFNS) {
> +			struct pci_dev *deva = NULL;
> +			struct pci_dev *devb;
> +
> +			list_for_each_entry(devb, &bus->devices, bus_list) {
> +				if (devb->devfn == devfna)
> +					deva = devb;
> +
> +				if (test_bit(devb->devfn, devfns->devfns))
> +					continue;
> +
> +				if (!deva) {
> +					deva = devb;
> +					list_for_each_entry_continue(
> +						deva, &bus->devices, bus_list)
> +						if (deva->devfn == devfna)
> +							break;
> +				}
> +
> +				if (!reachable(deva, devb))
> +					continue;
> +
> +				__set_bit(devb->devfn, todo_devfns.devfns);
> +				again = true;
> +			}
> +		}
> +
> +		if (!again)
> +			break;
> +
> +		/*
> +		 * Every new bit adds a new deva to check, reloop the whole
> +		 * thing. Expect this to be rare.
> +		 */
> +		for (i = 0; i != ARRAY_SIZE(devfns->devfns); i++) {
> +			devfns->devfns[i] |= todo_devfns.devfns[i];
> +			next_devfns.devfns[i] = todo_devfns.devfns[i];
> +			todo_devfns.devfns[i] = 0;
> +		}
> +	}
> +	up_read(&pci_bus_sem);
> +}
> +EXPORT_SYMBOL_GPL(pci_reachable_set);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index fb9adf0562f8ef..21f6b20b487f8d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -855,6 +855,10 @@ struct pci_dynids {
>  	struct list_head	list;	/* For IDs added at runtime */
>  };
>  
> +struct pci_reachable_set {
> +	DECLARE_BITMAP(devfns, 256);
> +};
> +
>  enum pci_bus_isolation {
>  	/*
>  	 * The bus is off a root port and the root port has isolated ACS flags
> @@ -1269,6 +1273,9 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
>  struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
>  struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
>  
> +void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
> +		       bool (*reachable)(struct pci_dev *deva,
> +					 struct pci_dev *devb));
>  enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
>  
>  int pci_dev_present(const struct pci_device_id *ids);
> @@ -2084,6 +2091,11 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
>  						 struct pci_dev *from)
>  { return NULL; }
>  
> +static inline void
> +pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
> +		  bool (*reachable)(struct pci_dev *deva, struct pci_dev *devb))
> +{ }
> +
>  static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
>  { return PCIE_NON_ISOLATED; }
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-09-09 20:27   ` Bjorn Helgaas
@ 2025-09-09 21:21     ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-09 21:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 03:27:02PM -0500, Bjorn Helgaas wrote:
> > Instead the current algorithm always creates unique single device groups
> > for this topology. It happens because the pci_device_group(DSP)
> > immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
> > true and decides the DSP can get a unique group. The pci_device_group(A)
> > immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
> > then takes the DSPs group.
> 
> s/takes the DSPs group/takes the DSP's group/ (I guess?)

yeah

> > While ACS on root ports is underspecified in the spec, it should still
> > function as an egress control and limit access to either the MMIO of the
> > root port itself, or perhaps some other devices upstream of the root
> > complex - 00:17.0 perhaps in this example.
> 
> Does ACS have some kind of MMIO-specific restriction? 

I guess no, the text could be more generic here.

> > As grouping is a security property for VFIO creating incorrectly narrowed
> > groups is a security problem for the system.
> 
> I.e., we treated devices as being isolated from P2PDMA when they
> actually were not isolated, right?  More isolation => smaller
> (narrower) IOMMU groups?

Yes

> > Revise the design to solve these problems.
> > 
> > Explicitly require ordering, or return EPROBE_DEFER if things are out of
> > order. This avoids silent errors that created smaller groups and solves
> > problem #1.
> 
> If it's easy to state, would be nice to say what ordering is required.
> The issue mentioned above was "discovering a downstream device before
> its upstream", so I guess you want to discover upstream devices before
> downstream?  

yes

> Obviously PCI enumeration already works that way, so
> IOMMU group discovery must be a little different.
 
iommu group discovery is driven off of iommu probing which can happen
in enough different ways that it needs to be checked.


I will fix the other notes

Thanks,
Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 02/11] PCI: Add pci_bus_isolated()
  2025-09-09 19:54   ` Bjorn Helgaas
@ 2025-09-09 21:21     ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-09 21:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 02:54:09PM -0500, Bjorn Helgaas wrote:
> On Fri, Sep 05, 2025 at 03:06:17PM -0300, Jason Gunthorpe wrote:
> > Prepare to move the calculation of the bus P2P isolation out of the iommu
> > code and into the PCI core. This allows using the faster list iteration
> > under the pci_bus_sem, and the code has a kinship with the logic in
> > pci_for_each_dma_alias().
> > 
> > Bus isolation is the concept that drives the iommu_groups for the purposes
> > of VFIO. Stated simply, if device A can send traffic to device B then they
> > must be in the same group.
> > 
> > Only PCIe provides isolation. The multi-drop electrical topology in
> > classical PCI allows any bus member to claim the transaction.
> > 
> > In PCIe isolation comes out of ACS. If a PCIe Switch and Root Complex has
> > ACS flags that prevent peer to peer traffic and funnel all operations to
> > the IOMMU then devices can be isolated.
> 
> I guess a device being isolated means that peer-to-peer Requests from
> a different bus can't reach it?

peer-to-peer requests from a different *device*

> Did you mean "Root Port" instead of "Root Complex"?  Or are you
> assuming an ACS Capability in an RCRB?  (I don't think Linux supports
> RCRBs, except maybe for CXL)

Can't really say, the interaction of ACS within the Root Port, Root
Complex and so on is not really fully specified. Something within the
Root Complex routes to the TA/IOMMU.

Linux has, and continues with this series, to assume that the Root
Port is routing to the TA because we don't accumulate other Root
Complex devices into shared groups.

> > Multi-function devices also have an isolation concern with self loopback
> > between the functions, though pci_bus_isolated() does not deal with
> > devices.
> 
> It looks like multi-function devices *can* implement ACS and can
> isolate functions from each other (PCIe r7.0, sec 6.12.1.2).  But it
> sounds like we're ignoring peer-to-peer on the same bus for now and
> assuming devices on the same bus can't be isolated from each other?

Yes, MFD has ACS, but no it is not ignored for now. The iommu grouping
code has two parts, one for busses/switches and another for MFDs.

This couple of patches switches the busses/switches to use the new
mechanism and leaves MFD alone. Later patches correct the issues in
MFD as well. The above comment is trying to explain this split in the
patch series.

> If we ignore ACS on non-bridge multi-function devices, I think the
> only way to isolate things is bridge ACS that controls forwarding
> between buses.  

Yes

> If everything on the bus must be in the same group, it
> makes sense for pci_bus_isolated() to take a pci_bus pointer and not
> deal with individual devices.

> Below, it seems like sometimes we refer to *buses* being isolated and
> other times *devices* (Root Port, Switch Port, Switch, etc), so I'm a
> little confused.

They are different things, and have different treatment. I've tried to
keep them separated by having code that works on busses and different
code that works on devices. 

A bus is isolating if upstream travelling transactions reaching the
bus only go upstream.

A device is isolated if its bus and all upstream busses are isolating,
and the device itself has no internal loopback.

In this code it sometimes talks about the device in terms of a
bridge, USP, DSP, or Root Port. All of these are a bit special because
a upstream travelling transaction is permitted to internal loopback to
the bridge device without touching a bus.

So while the bridge device may be on an isolating bus, the bridge
device itself is not isolated from its downstream bus.

> > As a property of a bus, there are several positive cases:
> > 
> >  - The point to point "bus" on a physical PCIe link is isolated if the
> >    bridge/root device has something preventing self-access to its own
> >    MMIO.
> >
> >  - A Root Port is usually isolated
> >
> >  - A PCIe switch can be isolated if all it's Down Stream Ports have good
> >    ACS flags
> 
> I guess this is saying that a switch's internal bus is isolated if all
> the DSPs have the ACS flags we need?

Yes

> > pci_bus_isolated() implements these rules and returns an enum indicating
> > the level of isolation the bus has, with five possibilies:
> > 
> >  PCIE_ISOLATED: Traffic on this PCIE bus can not do any P2P.
> 
> Is this saying that peer-to-peer Requests can't reach devices on this
> bus?  Or Requests *from* this bus can only go to the IOMMU?

Transactions mastered on this bus travelling in the upstream direction
are only received by the upstream bridge and are never received by any
other device on the bus.

Or Transactions reaching this bus travelling in the upstream direction
continue upstream and never go back downstream.

I would not use the words 'from' or 'to' when talking about busses,
they don't originate or terminate transactions they are just
transports.

> > + * pci_bus_isolated() does not consider loopback internal to devices, like
> > + * multi-function devices performing a self-loopback. The caller must check
> > + * this separately. It does not considering alasing within the bus.
> 
> s/alasing/aliasing/ (I guess this refers to the
> PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS thing where a bridge takes ownership?)

Yes

> > +	/*
> > +	 * bus->self is only NULL for SRIOV VFs, it represents a "virtual" bus
> > +	 * within Linux to hold any bus numbers consumed by VF RIDs. Caller must
> > +	 * use pci_physfn() to get the bus for calling this function.
> 
> s/VF RIDs/VFs/  I think?  I think we allocate these virtual bus
> numbers when enabling the VFs.

Maybe BDF instead of RID

> > +	/*
> > +	 * bus is the interior bus of a PCI-E switch where ACS rules apply.
> 
> s/interior/internal/ to match use above
> s/PCI-E/PCIe/
> 
> I'm not sure what this is saying.  A USP can't have an ACS Capability
> unless it's part of a multi-function device.

So it can have ACS :)

Not sure what is unclear?

I will fix up all the notes

Thanks,
Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-05 18:06 ` [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
  2025-09-09  4:57   ` Donald Dutile
@ 2025-09-09 21:24   ` Bjorn Helgaas
  2025-09-09 23:20     ` Jason Gunthorpe
  2025-09-10  1:59     ` Donald Dutile
  1 sibling, 2 replies; 52+ messages in thread
From: Bjorn Helgaas @ 2025-09-09 21:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Sep 05, 2025 at 03:06:21PM -0300, Jason Gunthorpe wrote:
> Like with switches the current MFD algorithm does not consider asymmetric
> ACS within a MFD. If any MFD function has ACS that permits P2P the spec
> says it can reach through the MFD internal loopback any other function in
> the device.
> 
> For discussion let's consider a simple MFD topology like the below:
> 
>                       -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>       Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>                       |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS

REQ_ACS_FLAGS was renamed in an earlier patch.

I don't quite understand the "Root 00:00.00" notation.  I guess it
must refer to the root bus 00?  It looks sort of like a bridge, and
the ".00" makes it look sort of like a bus/device/function address,
but I don't think it is.

> This asymmetric ACS could be created using the config_acs kernel command
> line parameter, from quirks, or from a poorly thought out device that has
> ACS flags only on some functions.
> 
> Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
> do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
> function. Thus we expect an iommu_group to contain all three
> devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
> single device group of 1f.6.
> 
> The current algorithm sees the good ACS flags on 00:1f.6 and does not
> consider ACS on any other MFD functions.
> 
> For path properties the ACS flags say that 00:1f.6 is safe to use with
> PASID and supports SVA as it will not have any portions of its address
> space routed away from the IOMMU, this part of the ACS system is working
> correctly.
> 
> Further, if one of the MFD functions is a bridge, eg like 1f.2:
> 
>                       -- MFD 00:1f.0
>       Root 00:00.00 --|- MFD 00:1f.2 Root Port --- 01:01.0
>                       |- MFD 00:1f.6

Same question.

> Then the correct grouping will include 01:01.0, 00:1f.0/2/6 together in a
> group if there is any internal loopback within the MFD 00:1f. The current
> algorithm does not understand this and gives 01:01.0 it's own group even
> if it thinks there is an internal loopback in the MFD.

s/it's/its/

> Unfortunately this detail makes it hard to fix. Currently the code assumes
> that any MFD without an ACS cap has an internal loopback which will cause
> a large number of modern real systems to group in a pessimistic way.
> 
> However, the PCI spec does not really support this:
> 
>    PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function
>    Devices
> 
>     ACS P2P Request Redirect: must be implemented by Functions that
>     support peer-to-peer traffic with other Functions.

I would include the PCIe r7.0 spec revision, even though the PCI SIG
seems to try to preserve section numbers across revisions.

It seems pretty clear that Multi-Function Devices that have an ACS
Capability and support peer-to-peer traffic with other Functions are
required to implement ACS P2P Request Redirect.

> Meaning from a spec perspective the absence of ACS indicates the absence
> of internal loopback. Granted I think we are aware of older real devices
> that ignore this, but it seems to be the only way forward.

It's not as clear to me that Multi-Function Devices that support
peer-to-peer traffic are required to have an ACS Capability at all.

Alex might remember more, but I kind of suspect the current system of
quirks is there because of devices that do internal loopback but have
no ACS Capability.

> So, rely on 6.12.1.2 and assume functions without ACS do not have internal
> loopback. This resolves the common issue with modern systems and MFD root
> ports, but it makes the ACS quirks system less used. Instead we'd want
> quirks that say self-loopback is actually present, not like today's quirks
> that say it is absent. This is surely negative for older hardware, but
> positive for new HW that complies with the spec.
> 
> Use pci_reachable_set() in pci_device_group() to make the resulting
> algorithm faster and easier to understand.
> 
> Add pci_mfds_are_same_group() which specifically looks pair-wise at all
> functions in the MFDs. Any function with ACS capabilities and non-isolated
> aCS flags will become reachable to all other functions.

s/aCS/ACS/

> pci_reachable_set() does the calculations for figuring out the set of
> devices under the pci_bus_sem, which is better than repeatedly searching
> across all PCI devices.
> 
> Once the set of devices is determined and the set has more than one device
> use pci_get_slot() to search for any existing groups in the reachable set.
> 
> Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/iommu.c | 189 +++++++++++++++++++-----------------------
>  1 file changed, 87 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 543d6347c0e5e3..fc3c71b243a850 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_id);
>  
> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> -					       unsigned long *devfns);
> -
> -/*
> - * For multifunction devices which are not isolated from each other, find
> - * all the other non-isolated functions and look for existing groups.  For
> - * each function, we also need to look for aliases to or from other devices
> - * that may already have a group.
> - */
> -static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
> -							unsigned long *devfns)
> -{
> -	struct pci_dev *tmp = NULL;
> -	struct iommu_group *group;
> -
> -	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> -		return NULL;
> -
> -	for_each_pci_dev(tmp) {
> -		if (tmp == pdev || tmp->bus != pdev->bus ||
> -		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
> -		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
> -			continue;
> -
> -		group = get_pci_alias_group(tmp, devfns);
> -		if (group) {
> -			pci_dev_put(tmp);
> -			return group;
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
> -/*
> - * Look for aliases to or from the given device for existing groups. DMA
> - * aliases are only supported on the same bus, therefore the search
> - * space is quite small (especially since we're really only looking at pcie
> - * device, and therefore only expect multiple slots on the root complex or
> - * downstream switch ports).  It's conceivable though that a pair of
> - * multifunction devices could have aliases between them that would cause a
> - * loop.  To prevent this, we use a bitmap to track where we've been.
> - */
> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> -					       unsigned long *devfns)
> -{
> -	struct pci_dev *tmp = NULL;
> -	struct iommu_group *group;
> -
> -	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
> -		return NULL;
> -
> -	group = iommu_group_get(&pdev->dev);
> -	if (group)
> -		return group;
> -
> -	for_each_pci_dev(tmp) {
> -		if (tmp == pdev || tmp->bus != pdev->bus)
> -			continue;
> -
> -		/* We alias them or they alias us */
> -		if (pci_devs_are_dma_aliases(pdev, tmp)) {
> -			group = get_pci_alias_group(tmp, devfns);
> -			if (group) {
> -				pci_dev_put(tmp);
> -				return group;
> -			}
> -
> -			group = get_pci_function_alias_group(tmp, devfns);
> -			if (group) {
> -				pci_dev_put(tmp);
> -				return group;
> -			}
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
>  /*
>   * Generic device_group call-back function. It just allocates one
>   * iommu-group per device.
> @@ -1534,44 +1455,108 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
>  	return group;
>  }
>  
> +/*
> + * All functions in the MFD need to be isolated from each other and get their
> + * own groups, otherwise the whole MFD will share a group.
> + */
> +static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
> +{
> +	/*
> +	 * SRIOV VFs will use the group of the PF if it has
> +	 * BUS_DATA_PCI_NON_ISOLATED. We don't support VFs that also have ACS
> +	 * that are set to non-isolating.

"SR-IOV" is more typical in drivers/pci/.

> +	 */
> +	if (deva->is_virtfn || devb->is_virtfn)
> +		return false;
> +
> +	/* Are deva/devb functions in the same MFD? */
> +	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
> +		return false;
> +	/* Don't understand what is happening, be conservative */
> +	if (deva->multifunction != devb->multifunction)
> +		return true;
> +	if (!deva->multifunction)
> +		return false;
> +
> +	/*
> +	 * PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and

PCIe r7.0, sec 6.12.1.2

> +	 * Multi-Function Devices
> +	 *   ...
> +	 *   ACS P2P Request Redirect: must be implemented by Functions that
> +	 *   support peer-to-peer traffic with other Functions.
> +	 *
> +	 * Therefore assume if a MFD has no ACS capability then it does not
> +	 * support a loopback. This is the reverse of what Linux <= v6.16
> +	 * assumed - that any MFD was capable of P2P and used quirks identify
> +	 * devices that complied with the above.
> +	 */
> +	if (deva->acs_cap && !pci_acs_enabled(deva, PCI_ACS_ISOLATED))
> +		return true;
> +	if (devb->acs_cap && !pci_acs_enabled(devb, PCI_ACS_ISOLATED))
> +		return true;
> +	return false;
> +}
> +
> +static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
> +{
> +	/*
> +	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
> +	 */
> +	if (pci_devs_are_dma_aliases(deva, devb))
> +		return true;
> +
> +	return pci_mfds_are_same_group(deva, devb);
> +}
> +
>  /*
>   * Return a group if the function has isolation restrictions related to
>   * aliases or MFD ACS.
>   */
>  static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
>  {
> -	struct iommu_group *group;
> -	DECLARE_BITMAP(devfns, 256) = {};
> +	struct pci_reachable_set devfns;
> +	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
> +	unsigned int devfn;
>  
>  	/*
> -	 * Look for existing groups on device aliases.  If we alias another
> -	 * device or another device aliases us, use the same group.
> +	 * Look for existing groups on device aliases and multi-function ACS. If
> +	 * we alias another device or another device aliases us, use the same
> +	 * group.
> +	 *
> +	 * pci_reachable_set() should return the same bitmap if called for any
> +	 * device in the set and we want all devices in the set to have the same
> +	 * group.
>  	 */
> -	group = get_pci_alias_group(pdev, devfns);
> -	if (group)
> -		return group;
> +	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
> +	/* start is known to have iommu_group_get() == NULL */
> +	__clear_bit(pdev->devfn, devfns.devfns);
>  
>  	/*
> -	 * Look for existing groups on non-isolated functions on the same
> -	 * slot and aliases of those funcions, if any.  No need to clear
> -	 * the search bitmap, the tested devfns are still valid.
> -	 */
> -	group = get_pci_function_alias_group(pdev, devfns);
> -	if (group)
> -		return group;
> -
> -	/*
> -	 * When MFD's are included in the set due to ACS we assume that if ACS
> -	 * permits an internal loopback between functions it also permits the
> -	 * loopback to go downstream if a function is a bridge.
> +	 * When MFD functions are included in the set due to ACS we assume that
> +	 * if ACS permits an internal loopback between functions it also permits
> +	 * the loopback to go downstream if any function is a bridge.
>  	 *
>  	 * It is less clear what aliases mean when applied to a bridge. For now
>  	 * be conservative and also propagate the group downstream.
>  	 */
> -	__clear_bit(pdev->devfn & 0xFF, devfns);
> -	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
> -		return pci_group_alloc_non_isolated();
> -	return NULL;
> +	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
> +		return NULL;
> +
> +	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
> +		struct iommu_group *group;
> +		struct pci_dev *pdev_slot;
> +
> +		pdev_slot = pci_get_slot(pdev->bus, devfn);
> +		group = iommu_group_get(&pdev_slot->dev);
> +		pci_dev_put(pdev_slot);
> +		if (group) {
> +			if (WARN_ON(!(group->bus_data &
> +				      BUS_DATA_PCI_NON_ISOLATED)))
> +				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +			return group;
> +		}
> +	}
> +	return pci_group_alloc_non_isolated();
>  }
>  
>  /* Return a group if the upstream hierarchy has isolation restrictions. */
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-09-05 18:06 ` [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
  2025-09-09  5:02   ` Donald Dutile
@ 2025-09-09 21:43   ` Bjorn Helgaas
  2025-09-10 17:34     ` Jason Gunthorpe
  1 sibling, 1 reply; 52+ messages in thread
From: Bjorn Helgaas @ 2025-09-09 21:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Sep 05, 2025 at 03:06:25PM -0300, Jason Gunthorpe wrote:
> Switches ignore the PASID when routing TLPs. This means the path from the
> PASID issuing end point to the IOMMU must be direct with no possibility
> for another device to claim the addresses.
> 
> This is done using ACS flags and pci_enable_pasid() checks for this.
> 
> The new ACS Enhanced bits clarify some undefined behaviors in the spec
> around what P2P Request Redirect means.
> 
> Linux has long assumed that PCI_ACS_RR implies PCI_ACS_DSP_MT_RR |
> PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.
> 
> If the device supports ACS Enhanced then use the information it reports to
> determine if PASID SVA is supported or not.
> 
>  PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
>                     flowing transactions
> 
>  PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
>                     flowing transactions

s/BAR's/BARs/ (no possession here)

>  PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
>                        to all the DSP bridge windows from generating a
>                        error.
> 
> Each of these cases would poke a hole in the PASID address space which is
> not permitted.
> 
> Enhance the comments around pci_acs_flags_enabled() to better explain the
> reasoning for its logic. Continue to take the approach of assuming the
> device is doing the "right ACS" if it does not explicitly declare
> otherwise.
> 
> Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/pci/ats.c |  4 +++-
>  drivers/pci/pci.c | 54 +++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 50 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index ec6c8dbdc5e9c9..00603c2c4ff0ea 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -416,7 +416,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
>  	if (!pasid)
>  		return -EINVAL;
>  
> -	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
> +	if (!pci_acs_path_enabled(pdev, NULL,
> +				  PCI_ACS_RR | PCI_ACS_UF | PCI_ACS_USP_MT_RR |
> +				  PCI_ACS_DSP_MT_RR | PCI_ACS_UNCLAIMED_RR))
>  		return -EINVAL;
>  
>  	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 983f71211f0055..620b7f79093854 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3606,6 +3606,52 @@ void pci_configure_ari(struct pci_dev *dev)
>  	}
>  }
>  
> +
> +/*
> + * The spec is not clear what it means if the capability bit is 0. One view is
> + * that the device acts as though the ctrl bit is zero, another view is the
> + * device behavior is undefined.
> + *
> + * Historically Linux has taken the position that the capability bit as 0 means
> + * the device supports the most favorable interpretation of the spec - ie that
> + * things like P2P RR are always on. As this is security sensitive we expect
> + * devices that do not follow this rule to be quirked.

Interpreting a 0 Capability bit, i.e., per spec "the component does
not implement the feature", as "the component behaves as though the
feature is always enabled" sounds like a stretch to me.

The point of ACS is to prevent things that are normally allowed, so if
a component doesn't implement an ACS feature, my first guess would be
that the component doesn't prevent the relevant behavior.

Sounds like a mess and might be worth an ECR to clarify the spec.

"Most favorable interpretation of the spec" feels a little ambiguous
to me since it assumes something about what we consider "favorable".

> + * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
> + * and switch ports. If those ports have no MMIO then it is not relavent.
> + * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
> + * window that is not fully decoded by the downstream windows.

s/relavent/relevant/

> + * This takes the same approach with ACS Enhanced, if the device does not
> + * support it then we assume the ACS P2P RR has all the enhanced behaviors too.
> + *
> + * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
> + * those values would break old kernels on the edge cases they cover, the only
> + * compatible thing for a new device to implement is ACS Enhanced supported with
> + * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
> + */
> +static u16 pci_acs_ctrl_mask(struct pci_dev *pdev, u16 hw_cap)
> +{
> +	/*
> +	 * Egress Control enables use of the Egress Control Vector which is not
> +	 * present without the cap.
> +	 */
> +	u16 mask = PCI_ACS_EC;
> +
> +	mask = hw_cap & (PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
> +				      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
> +
> +	/*
> +	 * If ACS Enhanced is supported the device reports what it is doing
> +	 * through these bits which may not be settable.
> +	 */
> +	if (hw_cap & PCI_ACS_ENHANCED)
> +		mask |= PCI_ACS_IORB | PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR |
> +			PCI_ACS_USP_MT_RB | PCI_ACS_USP_MT_RR |
> +			PCI_ACS_UNCLAIMED_RR;
> +	return mask;
> +}
> +
>  static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
>  {
>  	int pos;
> @@ -3615,15 +3661,9 @@ static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
>  	if (!pos)
>  		return false;
>  
> -	/*
> -	 * Except for egress control, capabilities are either required
> -	 * or only required if controllable.  Features missing from the
> -	 * capability field can therefore be assumed as hard-wired enabled.
> -	 */
>  	pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
> -	acs_flags &= (cap | PCI_ACS_EC);
> -
>  	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +	acs_flags &= pci_acs_ctrl_mask(pdev, cap);
>  	return (ctrl & acs_flags) == acs_flags;
>  }
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-09 21:24   ` Bjorn Helgaas
@ 2025-09-09 23:20     ` Jason Gunthorpe
  2025-09-10  1:59     ` Donald Dutile
  1 sibling, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-09 23:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 04:24:57PM -0500, Bjorn Helgaas wrote:
> > 
> >                       -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
> >       Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
> >                       |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> REQ_ACS_FLAGS was renamed in an earlier patch.
> 
> I don't quite understand the "Root 00:00.00" notation.  I guess it
> must refer to the root bus 00?  It looks sort of like a bridge, and
> the ".00" makes it look sort of like a bus/device/function address,
> but I don't think it is.

Call it the host bridge or whatever is creating the bus segment, it
doesn't actually matter for the examples.

> > However, the PCI spec does not really support this:
> > 
> >    PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function
> >    Devices
> > 
> >     ACS P2P Request Redirect: must be implemented by Functions that
> >     support peer-to-peer traffic with other Functions.
> 
> I would include the PCIe r7.0 spec revision, even though the PCI SIG
> seems to try to preserve section numbers across revisions.
> 
> It seems pretty clear that Multi-Function Devices that have an ACS
> Capability and support peer-to-peer traffic with other Functions are
> required to implement ACS P2P Request Redirect.
> 
> > Meaning from a spec perspective the absence of ACS indicates the absence
> > of internal loopback. Granted I think we are aware of older real devices
> > that ignore this, but it seems to be the only way forward.
> 
> It's not as clear to me that Multi-Function Devices that support
> peer-to-peer traffic are required to have an ACS Capability at all.

How do you read it that way?

6.12.1.1 is reasonably clear that "This section applies to Root Ports
and Switch Downstream Ports that implement an ACS Extended Capability
structure."

While 6.12.2.2 is less so "This section applies to Multi-Function
Device ACS Functions"

I don't know what the author's intent was, but I have a hard time
reading the "must be implemented" as optional..

Frankly PCI SIG has made a mess here :(

> Alex might remember more, but I kind of suspect the current system of
> quirks is there because of devices that do internal loopback but have
> no ACS Capability.

This is correct, there are a few cases where it was confirmed that
internal loopback exists with no ACS

But mostly we have haphazardly added ACS quirks on demand whenever
someone was annoyed by what the current algorithm did. Most of the
investigations seem to have determined there is no actual loopback
suggesting people are reading the spec as above.

So, I don't see how to make it workable to default that most compliant
systems require quirks. Effectively this is a proposal to invert that
and only quirk those we know have internal loopback without ACS..

I will fix the other remarks

Thanks,
Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-09 21:24   ` Bjorn Helgaas
  2025-09-09 23:20     ` Jason Gunthorpe
@ 2025-09-10  1:59     ` Donald Dutile
  2025-09-10 17:43       ` Jason Gunthorpe
  1 sibling, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-10  1:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/9/25 5:24 PM, Bjorn Helgaas wrote:
> On Fri, Sep 05, 2025 at 03:06:21PM -0300, Jason Gunthorpe wrote:
>> Like with switches the current MFD algorithm does not consider asymmetric
>> ACS within a MFD. If any MFD function has ACS that permits P2P the spec
>> says it can reach through the MFD internal loopback any other function in
>> the device.
>>
>> For discussion let's consider a simple MFD topology like the below:
>>
>>                        -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>>        Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>>                        |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> REQ_ACS_FLAGS was renamed in an earlier patch.
> 
> I don't quite understand the "Root 00:00.00" notation.  I guess it
> must refer to the root bus 00?  It looks sort of like a bridge, and
> the ".00" makes it look sort of like a bus/device/function address,
> but I don't think it is.
> 
>> This asymmetric ACS could be created using the config_acs kernel command
>> line parameter, from quirks, or from a poorly thought out device that has
>> ACS flags only on some functions.
>>
>> Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
>> do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
>> function. Thus we expect an iommu_group to contain all three
>> devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
>> single device group of 1f.6.
>>
>> The current algorithm sees the good ACS flags on 00:1f.6 and does not
>> consider ACS on any other MFD functions.
>>
>> For path properties the ACS flags say that 00:1f.6 is safe to use with
>> PASID and supports SVA as it will not have any portions of its address
>> space routed away from the IOMMU, this part of the ACS system is working
>> correctly.
>>
>> Further, if one of the MFD functions is a bridge, eg like 1f.2:
>>
>>                        -- MFD 00:1f.0
>>        Root 00:00.00 --|- MFD 00:1f.2 Root Port --- 01:01.0
>>                        |- MFD 00:1f.6
> 
> Same question.
> 
>> Then the correct grouping will include 01:01.0, 00:1f.0/2/6 together in a
>> group if there is any internal loopback within the MFD 00:1f. The current
>> algorithm does not understand this and gives 01:01.0 it's own group even
>> if it thinks there is an internal loopback in the MFD.
> 
> s/it's/its/
> 
>> Unfortunately this detail makes it hard to fix. Currently the code assumes
>> that any MFD without an ACS cap has an internal loopback which will cause
>> a large number of modern real systems to group in a pessimistic way.
>>
>> However, the PCI spec does not really support this:
>>
>>     PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function
>>     Devices
>>
>>      ACS P2P Request Redirect: must be implemented by Functions that
>>      support peer-to-peer traffic with other Functions.
> 
> I would include the PCIe r7.0 spec revision, even though the PCI SIG
> seems to try to preserve section numbers across revisions.
> 
> It seems pretty clear that Multi-Function Devices that have an ACS
> Capability and support peer-to-peer traffic with other Functions are
> required to implement ACS P2P Request Redirect.
> 
>> Meaning from a spec perspective the absence of ACS indicates the absence
>> of internal loopback. Granted I think we are aware of older real devices
>> that ignore this, but it seems to be the only way forward.
> 
> It's not as clear to me that Multi-Function Devices that support
> peer-to-peer traffic are required to have an ACS Capability at all.
> 
> Alex might remember more, but I kind of suspect the current system of
> quirks is there because of devices that do internal loopback but have
> no ACS Capability.
> 
and they are quirks b/c ... they violated the spec.... they are suppose
to have an ACS Cap if they can do internal loopback p2p dma.

I'm assuming the quirks that the current system of quirks impacts the
groups and/or reachability, such that the quirks are accounted for, and that
history isn't lost (and we have another regression issue).

>> So, rely on 6.12.1.2 and assume functions without ACS do not have internal
>> loopback. This resolves the common issue with modern systems and MFD root
>> ports, but it makes the ACS quirks system less used. Instead we'd want
>> quirks that say self-loopback is actually present, not like today's quirks
>> that say it is absent. This is surely negative for older hardware, but
>> positive for new HW that complies with the spec.
>>
>> Use pci_reachable_set() in pci_device_group() to make the resulting
>> algorithm faster and easier to understand.
>>
>> Add pci_mfds_are_same_group() which specifically looks pair-wise at all
>> functions in the MFDs. Any function with ACS capabilities and non-isolated
>> aCS flags will become reachable to all other functions.
> 
> s/aCS/ACS/
> 
>> pci_reachable_set() does the calculations for figuring out the set of
>> devices under the pci_bus_sem, which is better than repeatedly searching
>> across all PCI devices.
>>
>> Once the set of devices is determined and the set has more than one device
>> use pci_get_slot() to search for any existing groups in the reachable set.
>>
>> Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>   drivers/iommu/iommu.c | 189 +++++++++++++++++++-----------------------
>>   1 file changed, 87 insertions(+), 102 deletions(-)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 543d6347c0e5e3..fc3c71b243a850 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_id);
>>   
>> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
>> -					       unsigned long *devfns);
>> -
>> -/*
>> - * For multifunction devices which are not isolated from each other, find
>> - * all the other non-isolated functions and look for existing groups.  For
>> - * each function, we also need to look for aliases to or from other devices
>> - * that may already have a group.
>> - */
>> -static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
>> -							unsigned long *devfns)
>> -{
>> -	struct pci_dev *tmp = NULL;
>> -	struct iommu_group *group;
>> -
>> -	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
>> -		return NULL;
>> -
>> -	for_each_pci_dev(tmp) {
>> -		if (tmp == pdev || tmp->bus != pdev->bus ||
>> -		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
>> -		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
>> -			continue;
>> -
>> -		group = get_pci_alias_group(tmp, devfns);
>> -		if (group) {
>> -			pci_dev_put(tmp);
>> -			return group;
>> -		}
>> -	}
>> -
>> -	return NULL;
>> -}
>> -
>> -/*
>> - * Look for aliases to or from the given device for existing groups. DMA
>> - * aliases are only supported on the same bus, therefore the search
>> - * space is quite small (especially since we're really only looking at pcie
>> - * device, and therefore only expect multiple slots on the root complex or
>> - * downstream switch ports).  It's conceivable though that a pair of
>> - * multifunction devices could have aliases between them that would cause a
>> - * loop.  To prevent this, we use a bitmap to track where we've been.
>> - */
>> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
>> -					       unsigned long *devfns)
>> -{
>> -	struct pci_dev *tmp = NULL;
>> -	struct iommu_group *group;
>> -
>> -	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
>> -		return NULL;
>> -
>> -	group = iommu_group_get(&pdev->dev);
>> -	if (group)
>> -		return group;
>> -
>> -	for_each_pci_dev(tmp) {
>> -		if (tmp == pdev || tmp->bus != pdev->bus)
>> -			continue;
>> -
>> -		/* We alias them or they alias us */
>> -		if (pci_devs_are_dma_aliases(pdev, tmp)) {
>> -			group = get_pci_alias_group(tmp, devfns);
>> -			if (group) {
>> -				pci_dev_put(tmp);
>> -				return group;
>> -			}
>> -
>> -			group = get_pci_function_alias_group(tmp, devfns);
>> -			if (group) {
>> -				pci_dev_put(tmp);
>> -				return group;
>> -			}
>> -		}
>> -	}
>> -
>> -	return NULL;
>> -}
>> -
>>   /*
>>    * Generic device_group call-back function. It just allocates one
>>    * iommu-group per device.
>> @@ -1534,44 +1455,108 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
>>   	return group;
>>   }
>>   
>> +/*
>> + * All functions in the MFD need to be isolated from each other and get their
>> + * own groups, otherwise the whole MFD will share a group.
>> + */
>> +static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
>> +{
>> +	/*
>> +	 * SRIOV VFs will use the group of the PF if it has
>> +	 * BUS_DATA_PCI_NON_ISOLATED. We don't support VFs that also have ACS
>> +	 * that are set to non-isolating.
> 
> "SR-IOV" is more typical in drivers/pci/.
> 
>> +	 */
>> +	if (deva->is_virtfn || devb->is_virtfn)
>> +		return false;
>> +
>> +	/* Are deva/devb functions in the same MFD? */
>> +	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
>> +		return false;
>> +	/* Don't understand what is happening, be conservative */
>> +	if (deva->multifunction != devb->multifunction)
>> +		return true;
>> +	if (!deva->multifunction)
>> +		return false;
>> +
>> +	/*
>> +	 * PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and
> 
> PCIe r7.0, sec 6.12.1.2
> 
>> +	 * Multi-Function Devices
>> +	 *   ...
>> +	 *   ACS P2P Request Redirect: must be implemented by Functions that
>> +	 *   support peer-to-peer traffic with other Functions.
>> +	 *
>> +	 * Therefore assume if a MFD has no ACS capability then it does not
>> +	 * support a loopback. This is the reverse of what Linux <= v6.16
>> +	 * assumed - that any MFD was capable of P2P and used quirks identify
>> +	 * devices that complied with the above.
>> +	 */
>> +	if (deva->acs_cap && !pci_acs_enabled(deva, PCI_ACS_ISOLATED))
>> +		return true;
>> +	if (devb->acs_cap && !pci_acs_enabled(devb, PCI_ACS_ISOLATED))
>> +		return true;
>> +	return false;
>> +}
>> +
>> +static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
>> +{
>> +	/*
>> +	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
>> +	 */
>> +	if (pci_devs_are_dma_aliases(deva, devb))
>> +		return true;
>> +
>> +	return pci_mfds_are_same_group(deva, devb);
>> +}
>> +
>>   /*
>>    * Return a group if the function has isolation restrictions related to
>>    * aliases or MFD ACS.
>>    */
>>   static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
>>   {
>> -	struct iommu_group *group;
>> -	DECLARE_BITMAP(devfns, 256) = {};
>> +	struct pci_reachable_set devfns;
>> +	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
>> +	unsigned int devfn;
>>   
>>   	/*
>> -	 * Look for existing groups on device aliases.  If we alias another
>> -	 * device or another device aliases us, use the same group.
>> +	 * Look for existing groups on device aliases and multi-function ACS. If
>> +	 * we alias another device or another device aliases us, use the same
>> +	 * group.
>> +	 *
>> +	 * pci_reachable_set() should return the same bitmap if called for any
>> +	 * device in the set and we want all devices in the set to have the same
>> +	 * group.
>>   	 */
>> -	group = get_pci_alias_group(pdev, devfns);
>> -	if (group)
>> -		return group;
>> +	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
>> +	/* start is known to have iommu_group_get() == NULL */
>> +	__clear_bit(pdev->devfn, devfns.devfns);
>>   
>>   	/*
>> -	 * Look for existing groups on non-isolated functions on the same
>> -	 * slot and aliases of those funcions, if any.  No need to clear
>> -	 * the search bitmap, the tested devfns are still valid.
>> -	 */
>> -	group = get_pci_function_alias_group(pdev, devfns);
>> -	if (group)
>> -		return group;
>> -
>> -	/*
>> -	 * When MFD's are included in the set due to ACS we assume that if ACS
>> -	 * permits an internal loopback between functions it also permits the
>> -	 * loopback to go downstream if a function is a bridge.
>> +	 * When MFD functions are included in the set due to ACS we assume that
>> +	 * if ACS permits an internal loopback between functions it also permits
>> +	 * the loopback to go downstream if any function is a bridge.
>>   	 *
>>   	 * It is less clear what aliases mean when applied to a bridge. For now
>>   	 * be conservative and also propagate the group downstream.
>>   	 */
>> -	__clear_bit(pdev->devfn & 0xFF, devfns);
>> -	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
>> -		return pci_group_alloc_non_isolated();
>> -	return NULL;
>> +	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
>> +		return NULL;
>> +
>> +	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
>> +		struct iommu_group *group;
>> +		struct pci_dev *pdev_slot;
>> +
>> +		pdev_slot = pci_get_slot(pdev->bus, devfn);
>> +		group = iommu_group_get(&pdev_slot->dev);
>> +		pci_dev_put(pdev_slot);
>> +		if (group) {
>> +			if (WARN_ON(!(group->bus_data &
>> +				      BUS_DATA_PCI_NON_ISOLATED)))
>> +				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
>> +			return group;
>> +		}
>> +	}
>> +	return pci_group_alloc_non_isolated();
>>   }
>>   
>>   /* Return a group if the upstream hierarchy has isolation restrictions. */
>> -- 
>> 2.43.0
>>
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/11] PCI: Add pci_reachable_set()
  2025-09-09 21:03   ` Bjorn Helgaas
@ 2025-09-10 16:13     ` Jason Gunthorpe
  2025-09-11 19:56     ` Donald Dutile
  1 sibling, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-10 16:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 04:03:36PM -0500, Bjorn Helgaas wrote:
> > +/**
> > + * pci_reachable_set - Generate a bitmap of devices within a reachability set
> > + * @start: First device in the set
> > + * @devfns: The set of devices on the bus
> 
> @devfns is a return parameter, right?  Maybe mention that somewhere?
> And the fact that the set only includes the *reachable* devices on the
> bus.

done

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-09-09 21:43   ` Bjorn Helgaas
@ 2025-09-10 17:34     ` Jason Gunthorpe
  2025-09-11 19:50       ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-10 17:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, Donald Dutile, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 04:43:50PM -0500, Bjorn Helgaas wrote:
> > +/*
> > + * The spec is not clear what it means if the capability bit is 0. One view is
> > + * that the device acts as though the ctrl bit is zero, another view is the
> > + * device behavior is undefined.
> > + *
> > + * Historically Linux has taken the position that the capability bit as 0 means
> > + * the device supports the most favorable interpretation of the spec - ie that
> > + * things like P2P RR are always on. As this is security sensitive we expect
> > + * devices that do not follow this rule to be quirked.
> 
> Interpreting a 0 Capability bit, i.e., per spec "the component does
> not implement the feature", as "the component behaves as though the
> feature is always enabled" sounds like a stretch to me.

I generally agree, but this is how it is implemented today.

I've revised this text, I think it is actually OK and supported by the
spec, but it is subtle:

/*
 * The spec has specific language about what bits must be supported in an ACS
 * capability. In some cases if the capability does not support the bit then it
 * really acts as though the bit is enabled. e.g.:
 *
 *    ACS P2P Request Redirect: must be implemented by Root Ports that support
 *     peer-to-peer traffic with other Root Ports
 *
 * Meaning if RR is not supported then P2P is definately not supported and the
 * device is effectively behaving as if RR is set.
 *
 * Summarizing the spec requirements:
 *      DSP   Root Port   MFD
 * SV    M        M        M
 * RR    M        E        E
 * CR    M        E        E
 * UF    M        E        N/A
 * TB    M        M        N/A
 * DT    M        E        E
 *   - M=Must Be Implemented
 *   - E=If not implemented the behavior is effecitvely as though it is enabled.
 *
 * Therefore take the simple approach and assume the above flags are enabled
 * if the cap is 0.
 *
 * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
 * and switch ports. If those ports have no MMIO then it is not relevant.
 * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
 * window that is not fully decoded by the downstream windows.
 *
 * Though the spec is written on the assumption that existing devices without
 * ACS Enhanced can do whatever they want, Linux has historically assumed what
 * is now codified as PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR | PCI_ACS_USP_MT_RB
 * | PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.
 *
 * Changing how Linux understands existing ACS prior to ACS Enhanced would break
 * alot of systems.
 *
 * Thus continue as historical Linux has always done if ACS Enhanced is not
 * supported, while if ACS Enhanced is supported follow it.
 *
 * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
 * those values would break old kernels on the edge cases they cover, the only
 * compatible thing for a new device to implement is ACS Enhanced supported with
 * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
 */

> Sounds like a mess and might be worth an ECR to clarify the spec.

IMHO alot of this is badly designed for an OS. PCI SIG favours not
rendering existing HW incompatible with new revs of the spec, which
generally means the OS has no idea WTF is going on anymore. 

For ACS it means the OS cannot accurately predict what the fabric
routing will be..

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-09-10  1:59     ` Donald Dutile
@ 2025-09-10 17:43       ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-10 17:43 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Alex Williamson, Lu Baolu, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 09, 2025 at 09:59:23PM -0400, Donald Dutile wrote:
> > Alex might remember more, but I kind of suspect the current system of
> > quirks is there because of devices that do internal loopback but have
> > no ACS Capability.
> > 
> and they are quirks b/c ... they violated the spec.... they are suppose
> to have an ACS Cap if they can do internal loopback p2p dma.

It is the reverse

Linux assumed all devices without ACS capability CAN do internal
loopback.

This captures a huge number of real devices that it seems don't
actually do internal loopback.

When people complained, ie for DPDK/etc then quirks saying they don't
do internal loopback were added. But this was never structured or
sensible, I have systems here were LOM E1000 is quirked and a few
generations later it is not quirked. I doubt it suddenly gained
loopback.

That said in doing so a few cases (AMD sound & GPU MFD comes to mind)
were found where the MFD actually does internal loopback.

So here we have to pick the least bad option:

1) Be pessimistic and assume internal loopback exists without ACS Cap
   and expand groups. Quirk devices determined to not have internal
   loopback. (as today, except due to bugs we don't expand the groups
   enough)

2) Be optimistic and assume no internal looback exists without ACS Cap
   and shrink groups. Quirk devices that are determiend to have
   internal loopback. (proposed here)

v2 of this series did my best attempt at #1, and there were too many
regressions because if you fix ACS to actually group the way it is
supposed to the internal MFD loopback pessimism breaks alot of real
systems.

Don pointed to the spec and says there is reasonable language to
assume that if the MFD has internal loopback it must have an ACS
capability.

> I'm assuming the quirks that the current system of quirks impacts the
> groups and/or reachability, such that the quirks are accounted for, and that
> history isn't lost (and we have another regression issue).

In this version effectively the quirks become ignored for iommu
grouping as we don't call the acs functions if there is no acs cap.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-09-10 17:34     ` Jason Gunthorpe
@ 2025-09-11 19:50       ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-11 19:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/10/25 1:34 PM, Jason Gunthorpe wrote:
> On Tue, Sep 09, 2025 at 04:43:50PM -0500, Bjorn Helgaas wrote:
>>> +/*
>>> + * The spec is not clear what it means if the capability bit is 0. One view is
>>> + * that the device acts as though the ctrl bit is zero, another view is the
>>> + * device behavior is undefined.
>>> + *
>>> + * Historically Linux has taken the position that the capability bit as 0 means
>>> + * the device supports the most favorable interpretation of the spec - ie that
>>> + * things like P2P RR are always on. As this is security sensitive we expect
>>> + * devices that do not follow this rule to be quirked.
>>
>> Interpreting a 0 Capability bit, i.e., per spec "the component does
>> not implement the feature", as "the component behaves as though the
>> feature is always enabled" sounds like a stretch to me.
> 
> I generally agree, but this is how it is implemented today.
> 
> I've revised this text, I think it is actually OK and supported by the
> spec, but it is subtle:
> 
> /*
>   * The spec has specific language about what bits must be supported in an ACS
>   * capability. In some cases if the capability does not support the bit then it
>   * really acts as though the bit is enabled. e.g.:
>   *
>   *    ACS P2P Request Redirect: must be implemented by Root Ports that support
>   *     peer-to-peer traffic with other Root Ports
>   *
>   * Meaning if RR is not supported then P2P is definately not supported and the
>   * device is effectively behaving as if RR is set.
>   *
>   * Summarizing the spec requirements:
>   *      DSP   Root Port   MFD
>   * SV    M        M        M
>   * RR    M        E        E
>   * CR    M        E        E
>   * UF    M        E        N/A
>   * TB    M        M        N/A
>   * DT    M        E        E
>   *   - M=Must Be Implemented
>   *   - E=If not implemented the behavior is effecitvely as though it is enabled.
>   *
>   * Therefore take the simple approach and assume the above flags are enabled
>   * if the cap is 0.
>   *
>   * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
>   * and switch ports. If those ports have no MMIO then it is not relevant.
>   * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
>   * window that is not fully decoded by the downstream windows.
>   *
>   * Though the spec is written on the assumption that existing devices without
>   * ACS Enhanced can do whatever they want, Linux has historically assumed what
>   * is now codified as PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR | PCI_ACS_USP_MT_RB
>   * | PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.
>   *
>   * Changing how Linux understands existing ACS prior to ACS Enhanced would break
>   * alot of systems.
>   *
>   * Thus continue as historical Linux has always done if ACS Enhanced is not
>   * supported, while if ACS Enhanced is supported follow it.
>   *
>   * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
>   * those values would break old kernels on the edge cases they cover, the only
>   * compatible thing for a new device to implement is ACS Enhanced supported with
>   * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
>   */
> 
>> Sounds like a mess and might be worth an ECR to clarify the spec.
> 
> IMHO alot of this is badly designed for an OS. PCI SIG favours not
> rendering existing HW incompatible with new revs of the spec, which
> generally means the OS has no idea WTF is going on anymore.
> 
> For ACS it means the OS cannot accurately predict what the fabric
> routing will be..
> 
> Jason
> 
Exec summary: the spec is clear as mud wrt RP/RCs. ;-p

The above summary captures the proposed update conclusions to the spec,
and enables a good reference if a future conclusion is made that should require
another change in this area.  Thanks for the added verbage for future reference.

- Don



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/11] PCI: Add pci_reachable_set()
  2025-09-09 21:03   ` Bjorn Helgaas
  2025-09-10 16:13     ` Jason Gunthorpe
@ 2025-09-11 19:56     ` Donald Dutile
  2025-09-15 13:38       ` Jason Gunthorpe
  1 sibling, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-11 19:56 UTC (permalink / raw)
  To: Bjorn Helgaas, Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/9/25 5:03 PM, Bjorn Helgaas wrote:
> On Fri, Sep 05, 2025 at 03:06:20PM -0300, Jason Gunthorpe wrote:
>> Implement pci_reachable_set() to efficiently compute a set of devices on
>> the same bus that are "reachable" from a starting device. The meaning of
>> reachability is defined by the caller through a callback function.
>>
>> This is a faster implementation of the same logic in
>> pci_device_group(). Being inside the PCI core allows use of pci_bus_sem so
>> it can use list_for_each_entry() on a small list of devices instead of the
>> expensive for_each_pci_dev(). Server systems can now have hundreds of PCI
>> devices, but typically only a very small number of devices per bus.
>>
>> An example of a reachability function would be pci_devs_are_dma_aliases()
>> which would compute a set of devices on the same bus that are
>> aliases. This would also be useful in future support for the ACS P2P
>> Egress Vector which has a similar reachability problem.
>>
>> This is effectively a graph algorithm where the set of devices on the bus
>> are vertexes and the reachable() function defines the edges. It returns a
>> set of vertexes that form a connected graph.
>>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>   drivers/pci/search.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/pci.h  | 12 ++++++
>>   2 files changed, 102 insertions(+)
>>
>> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
>> index fe6c07e67cb8ce..dac6b042fd5f5d 100644
>> --- a/drivers/pci/search.c
>> +++ b/drivers/pci/search.c
>> @@ -595,3 +595,93 @@ int pci_dev_present(const struct pci_device_id *ids)
>>   	return 0;
>>   }
>>   EXPORT_SYMBOL(pci_dev_present);
>> +
>> +/**
>> + * pci_reachable_set - Generate a bitmap of devices within a reachability set
>> + * @start: First device in the set
>> + * @devfns: The set of devices on the bus
> 
> @devfns is a return parameter, right?  Maybe mention that somewhere?
> And the fact that the set only includes the *reachable* devices on the
> bus.
> 
Yes, and for clarify, I'd prefer the fcn name to be 'pci_reachable_bus_set()' so
it's clear it (or its callers) are performing an intra-bus reachable result,
and not doing inter-bus reachability checking, although returning a 256-bit
devfns without a domain prefix indirectly indicates it.

>> + * @reachable: Callback to tell if two devices can reach each other
>> + *
>> + * Compute a bitmap where every set bit is a device on the bus that is reachable
>> + * from the start device, including the start device. Reachability between two
>> + * devices is determined by a callback function.
>> + *
>> + * This is a non-recursive implementation that invokes the callback once per
>> + * pair. The callback must be commutative:
>> + *    reachable(a, b) == reachable(b, a)
>> + * reachable() can form a cyclic graph:
>> + *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
>> + *
>> + * Since this function is limited to a single bus the largest set can be 256
>> + * devices large.
>> + */
>> +void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
>> +		       bool (*reachable)(struct pci_dev *deva,
>> +					 struct pci_dev *devb))
>> +{
>> +	struct pci_reachable_set todo_devfns = {};
>> +	struct pci_reachable_set next_devfns = {};
>> +	struct pci_bus *bus = start->bus;
>> +	bool again;
>> +
>> +	/* Assume devfn of all PCI devices is bounded by MAX_NR_DEVFNS */
>> +	static_assert(sizeof(next_devfns.devfns) * BITS_PER_BYTE >=
>> +		      MAX_NR_DEVFNS);
>> +
>> +	memset(devfns, 0, sizeof(devfns->devfns));
>> +	__set_bit(start->devfn, devfns->devfns);
>> +	__set_bit(start->devfn, next_devfns.devfns);
>> +
>> +	down_read(&pci_bus_sem);
>> +	while (true) {
>> +		unsigned int devfna;
>> +		unsigned int i;
>> +
>> +		/*
>> +		 * For each device that hasn't been checked compare every
>> +		 * device on the bus against it.
>> +		 */
>> +		again = false;
>> +		for_each_set_bit(devfna, next_devfns.devfns, MAX_NR_DEVFNS) {
>> +			struct pci_dev *deva = NULL;
>> +			struct pci_dev *devb;
>> +
>> +			list_for_each_entry(devb, &bus->devices, bus_list) {
>> +				if (devb->devfn == devfna)
>> +					deva = devb;
>> +
>> +				if (test_bit(devb->devfn, devfns->devfns))
>> +					continue;
>> +
>> +				if (!deva) {
>> +					deva = devb;
>> +					list_for_each_entry_continue(
>> +						deva, &bus->devices, bus_list)
>> +						if (deva->devfn == devfna)
>> +							break;
>> +				}
>> +
>> +				if (!reachable(deva, devb))
>> +					continue;
>> +
>> +				__set_bit(devb->devfn, todo_devfns.devfns);
>> +				again = true;
>> +			}
>> +		}
>> +
>> +		if (!again)
>> +			break;
>> +
>> +		/*
>> +		 * Every new bit adds a new deva to check, reloop the whole
>> +		 * thing. Expect this to be rare.
>> +		 */
>> +		for (i = 0; i != ARRAY_SIZE(devfns->devfns); i++) {
>> +			devfns->devfns[i] |= todo_devfns.devfns[i];
>> +			next_devfns.devfns[i] = todo_devfns.devfns[i];
>> +			todo_devfns.devfns[i] = 0;
>> +		}
>> +	}
>> +	up_read(&pci_bus_sem);
>> +}
>> +EXPORT_SYMBOL_GPL(pci_reachable_set);
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index fb9adf0562f8ef..21f6b20b487f8d 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -855,6 +855,10 @@ struct pci_dynids {
>>   	struct list_head	list;	/* For IDs added at runtime */
>>   };
>>   
>> +struct pci_reachable_set {
>> +	DECLARE_BITMAP(devfns, 256);
>> +};
>> +
>>   enum pci_bus_isolation {
>>   	/*
>>   	 * The bus is off a root port and the root port has isolated ACS flags
>> @@ -1269,6 +1273,9 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
>>   struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
>>   struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
>>   
>> +void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
>> +		       bool (*reachable)(struct pci_dev *deva,
>> +					 struct pci_dev *devb));
>>   enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
>>   
>>   int pci_dev_present(const struct pci_device_id *ids);
>> @@ -2084,6 +2091,11 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
>>   						 struct pci_dev *from)
>>   { return NULL; }
>>   
>> +static inline void
>> +pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
>> +		  bool (*reachable)(struct pci_dev *deva, struct pci_dev *devb))
>> +{ }
>> +
>>   static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
>>   { return PCIE_NON_ISOLATED; }
>>   
>> -- 
>> 2.43.0
>>
> 
For the rest...
Reviewed-by: Donald Dutile <ddutile@redhat.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2025-09-05 18:06 ` [PATCH v3 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
@ 2025-09-15  9:41 ` Cédric Le Goater
  2025-09-22 22:39 ` Alex Williamson
  12 siblings, 0 replies; 52+ messages in thread
From: Cédric Le Goater @ 2025-09-15  9:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On 9/5/25 20:06, Jason Gunthorpe wrote:
> The series patches have extensive descriptions as to the problem and
> solution, but in short the ACS flags are not analyzed according to the
> spec to form the iommu_groups that VFIO is expecting for security.
> 
> ACS is an egress control only. For a path the ACS flags on each hop only
> effect what other devices the TLP is allowed to reach. It does not prevent
> other devices from reaching into this path.
> 
> For VFIO if device A is permitted to access device B's MMIO then A and B
> must be grouped together. This says that even if a path has isolating ACS
> flags on each hop, off-path devices with non-isolating ACS can still reach
> into that path and must be grouped gother.
> 
> For switches, a PCIe topology like:
> 
>                                 -- DSP 02:00.0 -> End Point A
>   Root 00:00.0 -> USP 01:00.0 --|
>                                 -- DSP 02:03.0 -> End Point B
> 
> Will generate unique single device groups for every device even if ACS is
> not enabled on the two DSP ports. It should at least group A/B together
> because no ACS means A can reach the MMIO of B. This is a serious failure
> for the VFIO security model.
> 
> For multi-function-devices, a PCIe topology like:
> 
>                    -- MFD 00:1f.0 ACS not supported
>    Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
>                    |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
> a spec perspective each device should get its own group, because ACS not
> supported can assume no loopback is possible by spec.
> 
> For root-ports a PCIe topology like:
>                                           -- Dev 01:00.0
>    Root  00:00.00 --- Root Port 00:01.0 --|
>                    |                      -- Dev 01:00.1
> 		  |- Dev 00:17.0
> 
> Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
> ACS capability in the root port.
> 
> While ACS on root ports is underspecified in the spec, it should still
> function as an egress control and limit access to either the MMIO of the
> root port itself, or perhaps some other devices upstream of the root
> complex - 00:17.0 perhaps in this example.
> 
> Historically the grouping in Linux has assumed the root port routes all
> traffic into the TA/IOMMU and never bypasses the TA to go to other
> functions in the root complex. Following the new understanding that ACS is
> required for internal loopback also treat root ports with no ACS
> capability as lacking internal loopback as well.
> 
> There is also some confusing spec language about how ACS and SRIOV works
> which this series does not address.
> 
> 
> This entire series goes further and makes some additional improvements to
> the ACS validation found while studying this problem. The groups around a
> PCIe to PCI bridge are shrunk to not include the PCIe bridge.
> 
> The last patches implement "ACS Enhanced" on top of it. Due to how ACS
> Enhanced was defined as a non-backward compatible feature it is important
> to get SW support out there.
> 
> Due to the potential of iommu_groups becoming wider and thus non-usable
> for VFIO this should go to a linux-next tree to give it some more
> exposure.
> 
> I have now tested this a few systems I could get:
> 
>   - Various Intel client systems:
>     * Raptor Lake, with VMD enabled and using the real_dev mechanism
>     * 6/7th generation 100 Series/C320
>     * 5/6th generation 100 Series/C320 with a NIC MFD quirk
>     * Tiger Lake
>     * 5/6th generation Sunrise Point


FWIW, I have tested this series on some of the systems I use
for upstream VFIO  :

   Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
   Intel(R) Xeon(R) Silver 4514Y
   Intel(R) 12th Gen Core(TM) i7-12700K
   Neoverse-N1

I didn't see any regressions on IOMMU grouping like on v2.
Please ping me if you need more info on the PCI topology.

I also booted an IBM/S390 z16 LPAR with VFs to complete the
experiment. All good.



>    The 6/7th gen system has a root port without an ACS capability and it
>    becomes ungrouped as described above.
> 
>    All systems have changes, the MFDs in the root complex all become ungrouped.
> 
>   - NVIDIA Grace system with 5 different PCI switches from two vendors
>     Bug fix widening the iommu_groups works as expected here
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups
> 
> v3:
>   - Rebase to v6.17-rc4
>   - Drop the quirks related patches
>   - Change the MFD logic to process no ACS cap as meaning no internal
>     loopback. This avoids creating non-isolated groups for MFD root ports in
>     common AMD and Intel systems
>   - Fix matching MFDs to ignore SRIOV VFs
>   - Fix some kbuild splats
> v2: https://patch.msgid.link/r/0-v2-4a9b9c983431+10e2-pcie_switch_groups_jgg@nvidia.com
>   - Revise comments and commit messages
>   - Rename struct pci_alias_set to pci_reachable_set
>   - Make more sense of the special bus->self = NULL case for SRIOV
>   - Add pci_group_alloc_non_isolated() for readability
>   - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED
>   - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD
>     function is a bridge
>   - New patches to add pci_mfd_isolation() to retain more cases of narrow
>     groups on MFDs with missing ACS.
>   - Redescribe the MFD related change as a bug fix. For a MFD to be
>     isolated all functions must have egress control on their P2P.
> v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com
> 
> Cc: galshalom@nvidia.com
> Cc: tdave@nvidia.com
> Cc: maorg@nvidia.com
> Cc: kvm@vger.kernel.org
> Cc: Ceric Le Goater" <clg@redhat.com>

Curiously, I didn't get the email. weird.

Cheers,

C.



> Cc: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason Gunthorpe (11):
>    PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
>    PCI: Add pci_bus_isolated()
>    iommu: Compute iommu_groups properly for PCIe switches
>    iommu: Organize iommu_group by member size
>    PCI: Add pci_reachable_set()
>    iommu: Compute iommu_groups properly for PCIe MFDs
>    iommu: Validate that pci_for_each_dma_alias() matches the groups
>    PCI: Add the ACS Enhanced Capability definitions
>    PCI: Enable ACS Enhanced bits for enable_acs and config_acs
>    PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
>    PCI: Check ACS Extended flags for pci_bus_isolated()
> 
>   drivers/iommu/iommu.c         | 510 +++++++++++++++++++++++-----------
>   drivers/pci/ats.c             |   4 +-
>   drivers/pci/pci.c             |  73 ++++-
>   drivers/pci/search.c          | 274 ++++++++++++++++++
>   include/linux/pci.h           |  46 +++
>   include/uapi/linux/pci_regs.h |  18 ++
>   6 files changed, 759 insertions(+), 166 deletions(-)
> 
> 
> base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/11] PCI: Add pci_reachable_set()
  2025-09-11 19:56     ` Donald Dutile
@ 2025-09-15 13:38       ` Jason Gunthorpe
  2025-09-15 14:32         ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-15 13:38 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Alex Williamson, Lu Baolu, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Thu, Sep 11, 2025 at 03:56:50PM -0400, Donald Dutile wrote:

> Yes, and for clarify, I'd prefer the fcn name to be 'pci_reachable_bus_set()' so
> it's clear it (or its callers) are performing an intra-bus reachable result,
> and not doing inter-bus reachability checking, although returning a 256-bit
> devfns without a domain prefix indirectly indicates it.

Sure:

/**
 * pci_reachable_bus_set - Generate a bitmap of devices within a reachability set
 * @start: First device in the set
 * @devfns: Output set of devices on the bus reachable from start
 * @reachable: Callback to tell if two devices can reach each other
 *
 * Compute a bitmap @defvfns where every set bit is a device on the bus of
 * @start that is reachable from the @start device, including the start device.
 * Reachability between two devices is determined by a callback function.
 *
 * This is a non-recursive implementation that invokes the callback once per
 * pair. The callback must be commutative::
 *
 *    reachable(a, b) == reachable(b, a)
 *
 * reachable() can form a cyclic graph::
 *
 *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
 *
 * Since this function is limited to a single bus the largest set can be 256
 * devices large.
 */
void pci_reachable_bus_set(struct pci_dev *start,
			   struct pci_reachable_set *devfns,
			   bool (*reachable)(struct pci_dev *deva,
					     struct pci_dev *devb))

Thanks,
Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/11] PCI: Add pci_reachable_set()
  2025-09-15 13:38       ` Jason Gunthorpe
@ 2025-09-15 14:32         ` Donald Dutile
  0 siblings, 0 replies; 52+ messages in thread
From: Donald Dutile @ 2025-09-15 14:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Alex Williamson, Lu Baolu, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/15/25 9:38 AM, Jason Gunthorpe wrote:
> On Thu, Sep 11, 2025 at 03:56:50PM -0400, Donald Dutile wrote:
> 
>> Yes, and for clarify, I'd prefer the fcn name to be 'pci_reachable_bus_set()' so
>> it's clear it (or its callers) are performing an intra-bus reachable result,
>> and not doing inter-bus reachability checking, although returning a 256-bit
>> devfns without a domain prefix indirectly indicates it.
> 
> Sure:
> 
> /**
>   * pci_reachable_bus_set - Generate a bitmap of devices within a reachability set
>   * @start: First device in the set
>   * @devfns: Output set of devices on the bus reachable from start
>   * @reachable: Callback to tell if two devices can reach each other
>   *
>   * Compute a bitmap @defvfns where every set bit is a device on the bus of
>   * @start that is reachable from the @start device, including the start device.
>   * Reachability between two devices is determined by a callback function.
>   *
>   * This is a non-recursive implementation that invokes the callback once per
>   * pair. The callback must be commutative::
>   *
>   *    reachable(a, b) == reachable(b, a)
>   *
>   * reachable() can form a cyclic graph::
>   *
>   *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
>   *
>   * Since this function is limited to a single bus the largest set can be 256
>   * devices large.
>   */
> void pci_reachable_bus_set(struct pci_dev *start,
> 			   struct pci_reachable_set *devfns,
> 			   bool (*reachable)(struct pci_dev *deva,
> 					     struct pci_dev *devb))
> 
> Thanks,
> Jason
> 
Thanks... Don


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2025-09-15  9:41 ` [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Cédric Le Goater
@ 2025-09-22 22:39 ` Alex Williamson
  2025-09-23  1:44   ` Donald Dutile
  12 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2025-09-22 22:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, Donald Dutile, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri,  5 Sep 2025 15:06:15 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> The series patches have extensive descriptions as to the problem and
> solution, but in short the ACS flags are not analyzed according to the
> spec to form the iommu_groups that VFIO is expecting for security.
> 
> ACS is an egress control only. For a path the ACS flags on each hop only
> effect what other devices the TLP is allowed to reach. It does not prevent
> other devices from reaching into this path.
> 
> For VFIO if device A is permitted to access device B's MMIO then A and B
> must be grouped together. This says that even if a path has isolating ACS
> flags on each hop, off-path devices with non-isolating ACS can still reach
> into that path and must be grouped gother.
> 
> For switches, a PCIe topology like:
> 
>                                -- DSP 02:00.0 -> End Point A
>  Root 00:00.0 -> USP 01:00.0 --|
>                                -- DSP 02:03.0 -> End Point B
> 
> Will generate unique single device groups for every device even if ACS is
> not enabled on the two DSP ports. It should at least group A/B together
> because no ACS means A can reach the MMIO of B. This is a serious failure
> for the VFIO security model.
> 
> For multi-function-devices, a PCIe topology like:
> 
>                   -- MFD 00:1f.0 ACS not supported
>   Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
>                   |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
> a spec perspective each device should get its own group, because ACS not
> supported can assume no loopback is possible by spec.

I just dug through the thread with Don that I think tries to justify
this, but I have a lot of concerns about this.  I think the "must be
implemented by Functions that support peer-to-peer traffic with other
Functions" language is specifying that IF the device implements an ACS
capability AND does not implement the specific ACS P2P flag being
described, then and only then can we assume that form of P2P is not
supported.  OTOH, we cannot assume anything regarding internal P2P of an
MFD that does not implement an ACS capability at all.

I believe we even reached agreement with some NIC vendors in the early
days of IOMMU groups that they needed to implement an "empty" ACS
capability on their multifunction NICs such that they could describe in
this way that internal P2P is not supported by the device.  Thanks,

Alex

> 
> For root-ports a PCIe topology like:
>                                          -- Dev 01:00.0
>   Root  00:00.00 --- Root Port 00:01.0 --|
>                   |                      -- Dev 01:00.1
> 		  |- Dev 00:17.0
> 
> Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
> ACS capability in the root port.
> 
> While ACS on root ports is underspecified in the spec, it should still
> function as an egress control and limit access to either the MMIO of the
> root port itself, or perhaps some other devices upstream of the root
> complex - 00:17.0 perhaps in this example.
> 
> Historically the grouping in Linux has assumed the root port routes all
> traffic into the TA/IOMMU and never bypasses the TA to go to other
> functions in the root complex. Following the new understanding that ACS is
> required for internal loopback also treat root ports with no ACS
> capability as lacking internal loopback as well.
> 
> There is also some confusing spec language about how ACS and SRIOV works
> which this series does not address.
> 
> 
> This entire series goes further and makes some additional improvements to
> the ACS validation found while studying this problem. The groups around a
> PCIe to PCI bridge are shrunk to not include the PCIe bridge.
> 
> The last patches implement "ACS Enhanced" on top of it. Due to how ACS
> Enhanced was defined as a non-backward compatible feature it is important
> to get SW support out there.
> 
> Due to the potential of iommu_groups becoming wider and thus non-usable
> for VFIO this should go to a linux-next tree to give it some more
> exposure.
> 
> I have now tested this a few systems I could get:
> 
>  - Various Intel client systems:
>    * Raptor Lake, with VMD enabled and using the real_dev mechanism
>    * 6/7th generation 100 Series/C320
>    * 5/6th generation 100 Series/C320 with a NIC MFD quirk
>    * Tiger Lake
>    * 5/6th generation Sunrise Point
> 
>   The 6/7th gen system has a root port without an ACS capability and it
>   becomes ungrouped as described above.
> 
>   All systems have changes, the MFDs in the root complex all become ungrouped.
> 
>  - NVIDIA Grace system with 5 different PCI switches from two vendors
>    Bug fix widening the iommu_groups works as expected here
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups
> 
> v3:
>  - Rebase to v6.17-rc4
>  - Drop the quirks related patches
>  - Change the MFD logic to process no ACS cap as meaning no internal
>    loopback. This avoids creating non-isolated groups for MFD root ports in
>    common AMD and Intel systems
>  - Fix matching MFDs to ignore SRIOV VFs
>  - Fix some kbuild splats
> v2: https://patch.msgid.link/r/0-v2-4a9b9c983431+10e2-pcie_switch_groups_jgg@nvidia.com
>  - Revise comments and commit messages
>  - Rename struct pci_alias_set to pci_reachable_set
>  - Make more sense of the special bus->self = NULL case for SRIOV
>  - Add pci_group_alloc_non_isolated() for readability
>  - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED
>  - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD
>    function is a bridge
>  - New patches to add pci_mfd_isolation() to retain more cases of narrow
>    groups on MFDs with missing ACS.
>  - Redescribe the MFD related change as a bug fix. For a MFD to be
>    isolated all functions must have egress control on their P2P.
> v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com
> 
> Cc: galshalom@nvidia.com
> Cc: tdave@nvidia.com
> Cc: maorg@nvidia.com
> Cc: kvm@vger.kernel.org
> Cc: Ceric Le Goater" <clg@redhat.com>
> Cc: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason Gunthorpe (11):
>   PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
>   PCI: Add pci_bus_isolated()
>   iommu: Compute iommu_groups properly for PCIe switches
>   iommu: Organize iommu_group by member size
>   PCI: Add pci_reachable_set()
>   iommu: Compute iommu_groups properly for PCIe MFDs
>   iommu: Validate that pci_for_each_dma_alias() matches the groups
>   PCI: Add the ACS Enhanced Capability definitions
>   PCI: Enable ACS Enhanced bits for enable_acs and config_acs
>   PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
>   PCI: Check ACS Extended flags for pci_bus_isolated()
> 
>  drivers/iommu/iommu.c         | 510 +++++++++++++++++++++++-----------
>  drivers/pci/ats.c             |   4 +-
>  drivers/pci/pci.c             |  73 ++++-
>  drivers/pci/search.c          | 274 ++++++++++++++++++
>  include/linux/pci.h           |  46 +++
>  include/uapi/linux/pci_regs.h |  18 ++
>  6 files changed, 759 insertions(+), 166 deletions(-)
> 
> 
> base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-22 22:39 ` Alex Williamson
@ 2025-09-23  1:44   ` Donald Dutile
  2025-09-23  2:06     ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-23  1:44 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu



On 9/22/25 6:39 PM, Alex Williamson wrote:
> On Fri,  5 Sep 2025 15:06:15 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> The series patches have extensive descriptions as to the problem and
>> solution, but in short the ACS flags are not analyzed according to the
>> spec to form the iommu_groups that VFIO is expecting for security.
>>
>> ACS is an egress control only. For a path the ACS flags on each hop only
>> effect what other devices the TLP is allowed to reach. It does not prevent
>> other devices from reaching into this path.
>>
>> For VFIO if device A is permitted to access device B's MMIO then A and B
>> must be grouped together. This says that even if a path has isolating ACS
>> flags on each hop, off-path devices with non-isolating ACS can still reach
>> into that path and must be grouped gother.
>>
>> For switches, a PCIe topology like:
>>
>>                                 -- DSP 02:00.0 -> End Point A
>>   Root 00:00.0 -> USP 01:00.0 --|
>>                                 -- DSP 02:03.0 -> End Point B
>>
>> Will generate unique single device groups for every device even if ACS is
>> not enabled on the two DSP ports. It should at least group A/B together
>> because no ACS means A can reach the MMIO of B. This is a serious failure
>> for the VFIO security model.
>>
>> For multi-function-devices, a PCIe topology like:
>>
>>                    -- MFD 00:1f.0 ACS not supported
>>    Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
>>                    |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
>>
>> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
>> a spec perspective each device should get its own group, because ACS not
>> supported can assume no loopback is possible by spec.
> 
> I just dug through the thread with Don that I think tries to justify
> this, but I have a lot of concerns about this.  I think the "must be
> implemented by Functions that support peer-to-peer traffic with other
> Functions" language is specifying that IF the device implements an ACS
> capability AND does not implement the specific ACS P2P flag being
> described, then and only then can we assume that form of P2P is not
> supported.  OTOH, we cannot assume anything regarding internal P2P of an
> MFD that does not implement an ACS capability at all.
> 
The first, non-IF'd, non-AND'd req in PCIe spec 7.0, section 6.12.1.2 is:
"ACS P2P Request Redirect: must be implemented by Functions that support peer-to-peer traffic with other
Functions. This includes SR-IOV Virtual Functions (VFs)."
There is not further statement about control of peer-to-peer traffic, just the ability to do so, or not.

Note: ACS P2P Request Redirect.

Later in that section it says:
ACS P2P Completion Redirect: must be implemented by Functions that implement ACS P2P Request Redirect.

That can be read as an 'IF Request-Redirect is implemented, than ACS Completion Request must be implemented.
IOW, the Completion Direct control is required if Request Redirect is implemented, and not necessary if
Request Redirect is omitted.

If ACS P2P Require Redirect isn't implemented, than per the first requirement for MFDs,
the PCIe device does not support peer-to-peer traffic amongst its function or virtual functions.

It goes on...
ACS Direct Translated P2P: must be implemented if the Function supports Address Translation Services (ATS)
and also peer-to-peer traffic with other Functions.

If an MFD does not do peer-to-peer, and P2P Request Redirect would be implemented if it did,
than this ACS control does not have to be implemented either.

Egress control structures are either optional or dependent on Request Redirect &/or Direct Translated P2P control,
which have been addressed above as not needed if on peer-to-peer btwn functions in an MFD (and their VFs).


Now, if previous PCIe spec versions (which I didn't read & re-read & re-read like the 6.12 section of PCIe spec 7.0)
had more IF and ANDs, than that could be cause for less than clear specmanship enabling vendors of MFDs
to yield a non-PCIe-7.0 conformant MFD wrt ACS structures.
I searched section 6.12.1.2 for if/IF and AND/and, and did not yield any conditions not stated above.

> I believe we even reached agreement with some NIC vendors in the early
> days of IOMMU groups that they needed to implement an "empty" ACS
> capability on their multifunction NICs such that they could describe in
> this way that internal P2P is not supported by the device.  Thanks,
> 
In the early days -- gen1->gen3 (2009->2015) I could see that happening.
I think time (a decade) has closed those defaults to less-common quirks.
If 'empty ACS' is how they liked to do it back than, sure.
[A definition of empty ACS may be needed to fully appreciate that statement, though.]
If this patch series needs to support an 'empty ACS' for this older case, let's add it now,
or follow-up with another fix.

In summary, I still haven't found the IF and AND you refer to in section 6.12.1.2 for MFDs,
so if you want to quote those sections I mis-read, or mis-interpreted their (subtle?) existence,
than I'm not immovable on the spec interpretation.

- Don

> Alex
> 
>>
>> For root-ports a PCIe topology like:
>>                                           -- Dev 01:00.0
>>    Root  00:00.00 --- Root Port 00:01.0 --|
>>                    |                      -- Dev 01:00.1
>> 		  |- Dev 00:17.0
>>
>> Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
>> ACS capability in the root port.
>>
>> While ACS on root ports is underspecified in the spec, it should still
>> function as an egress control and limit access to either the MMIO of the
>> root port itself, or perhaps some other devices upstream of the root
>> complex - 00:17.0 perhaps in this example.
>>
>> Historically the grouping in Linux has assumed the root port routes all
>> traffic into the TA/IOMMU and never bypasses the TA to go to other
>> functions in the root complex. Following the new understanding that ACS is
>> required for internal loopback also treat root ports with no ACS
>> capability as lacking internal loopback as well.
>>
>> There is also some confusing spec language about how ACS and SRIOV works
>> which this series does not address.
>>
>>
>> This entire series goes further and makes some additional improvements to
>> the ACS validation found while studying this problem. The groups around a
>> PCIe to PCI bridge are shrunk to not include the PCIe bridge.
>>
>> The last patches implement "ACS Enhanced" on top of it. Due to how ACS
>> Enhanced was defined as a non-backward compatible feature it is important
>> to get SW support out there.
>>
>> Due to the potential of iommu_groups becoming wider and thus non-usable
>> for VFIO this should go to a linux-next tree to give it some more
>> exposure.
>>
>> I have now tested this a few systems I could get:
>>
>>   - Various Intel client systems:
>>     * Raptor Lake, with VMD enabled and using the real_dev mechanism
>>     * 6/7th generation 100 Series/C320
>>     * 5/6th generation 100 Series/C320 with a NIC MFD quirk
>>     * Tiger Lake
>>     * 5/6th generation Sunrise Point
>>
>>    The 6/7th gen system has a root port without an ACS capability and it
>>    becomes ungrouped as described above.
>>
>>    All systems have changes, the MFDs in the root complex all become ungrouped.
>>
>>   - NVIDIA Grace system with 5 different PCI switches from two vendors
>>     Bug fix widening the iommu_groups works as expected here
>>
>> This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups
>>
>> v3:
>>   - Rebase to v6.17-rc4
>>   - Drop the quirks related patches
>>   - Change the MFD logic to process no ACS cap as meaning no internal
>>     loopback. This avoids creating non-isolated groups for MFD root ports in
>>     common AMD and Intel systems
>>   - Fix matching MFDs to ignore SRIOV VFs
>>   - Fix some kbuild splats
>> v2: https://patch.msgid.link/r/0-v2-4a9b9c983431+10e2-pcie_switch_groups_jgg@nvidia.com
>>   - Revise comments and commit messages
>>   - Rename struct pci_alias_set to pci_reachable_set
>>   - Make more sense of the special bus->self = NULL case for SRIOV
>>   - Add pci_group_alloc_non_isolated() for readability
>>   - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED
>>   - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD
>>     function is a bridge
>>   - New patches to add pci_mfd_isolation() to retain more cases of narrow
>>     groups on MFDs with missing ACS.
>>   - Redescribe the MFD related change as a bug fix. For a MFD to be
>>     isolated all functions must have egress control on their P2P.
>> v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com
>>
>> Cc: galshalom@nvidia.com
>> Cc: tdave@nvidia.com
>> Cc: maorg@nvidia.com
>> Cc: kvm@vger.kernel.org
>> Cc: Ceric Le Goater" <clg@redhat.com>
>> Cc: Donald Dutile <ddutile@redhat.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>
>> Jason Gunthorpe (11):
>>    PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
>>    PCI: Add pci_bus_isolated()
>>    iommu: Compute iommu_groups properly for PCIe switches
>>    iommu: Organize iommu_group by member size
>>    PCI: Add pci_reachable_set()
>>    iommu: Compute iommu_groups properly for PCIe MFDs
>>    iommu: Validate that pci_for_each_dma_alias() matches the groups
>>    PCI: Add the ACS Enhanced Capability definitions
>>    PCI: Enable ACS Enhanced bits for enable_acs and config_acs
>>    PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
>>    PCI: Check ACS Extended flags for pci_bus_isolated()
>>
>>   drivers/iommu/iommu.c         | 510 +++++++++++++++++++++++-----------
>>   drivers/pci/ats.c             |   4 +-
>>   drivers/pci/pci.c             |  73 ++++-
>>   drivers/pci/search.c          | 274 ++++++++++++++++++
>>   include/linux/pci.h           |  46 +++
>>   include/uapi/linux/pci_regs.h |  18 ++
>>   6 files changed, 759 insertions(+), 166 deletions(-)
>>
>>
>> base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-23  1:44   ` Donald Dutile
@ 2025-09-23  2:06     ` Alex Williamson
  2025-09-23  2:42       ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2025-09-23  2:06 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Mon, 22 Sep 2025 21:44:27 -0400
Donald Dutile <ddutile@redhat.com> wrote:

> On 9/22/25 6:39 PM, Alex Williamson wrote:
> > On Fri,  5 Sep 2025 15:06:15 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> The series patches have extensive descriptions as to the problem and
> >> solution, but in short the ACS flags are not analyzed according to the
> >> spec to form the iommu_groups that VFIO is expecting for security.
> >>
> >> ACS is an egress control only. For a path the ACS flags on each hop only
> >> effect what other devices the TLP is allowed to reach. It does not prevent
> >> other devices from reaching into this path.
> >>
> >> For VFIO if device A is permitted to access device B's MMIO then A and B
> >> must be grouped together. This says that even if a path has isolating ACS
> >> flags on each hop, off-path devices with non-isolating ACS can still reach
> >> into that path and must be grouped gother.
> >>
> >> For switches, a PCIe topology like:
> >>
> >>                                 -- DSP 02:00.0 -> End Point A
> >>   Root 00:00.0 -> USP 01:00.0 --|
> >>                                 -- DSP 02:03.0 -> End Point B
> >>
> >> Will generate unique single device groups for every device even if ACS is
> >> not enabled on the two DSP ports. It should at least group A/B together
> >> because no ACS means A can reach the MMIO of B. This is a serious failure
> >> for the VFIO security model.
> >>
> >> For multi-function-devices, a PCIe topology like:
> >>
> >>                    -- MFD 00:1f.0 ACS not supported
> >>    Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
> >>                    |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> >>
> >> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
> >> a spec perspective each device should get its own group, because ACS not
> >> supported can assume no loopback is possible by spec.  
> > 
> > I just dug through the thread with Don that I think tries to justify
> > this, but I have a lot of concerns about this.  I think the "must be
> > implemented by Functions that support peer-to-peer traffic with other
> > Functions" language is specifying that IF the device implements an ACS
> > capability AND does not implement the specific ACS P2P flag being
> > described, then and only then can we assume that form of P2P is not
> > supported.  OTOH, we cannot assume anything regarding internal P2P of an
> > MFD that does not implement an ACS capability at all.
> >   
> The first, non-IF'd, non-AND'd req in PCIe spec 7.0, section 6.12.1.2 is:
> "ACS P2P Request Redirect: must be implemented by Functions that
> support peer-to-peer traffic with other Functions. This includes
> SR-IOV Virtual Functions (VFs)." There is not further statement about
> control of peer-to-peer traffic, just the ability to do so, or not.
> 
> Note: ACS P2P Request Redirect.
> 
> Later in that section it says:
> ACS P2P Completion Redirect: must be implemented by Functions that
> implement ACS P2P Request Redirect.
> 
> That can be read as an 'IF Request-Redirect is implemented, than ACS
> Completion Request must be implemented. IOW, the Completion Direct
> control is required if Request Redirect is implemented, and not
> necessary if Request Redirect is omitted.
> 
> If ACS P2P Require Redirect isn't implemented, than per the first
> requirement for MFDs, the PCIe device does not support peer-to-peer
> traffic amongst its function or virtual functions.
> 
> It goes on...
> ACS Direct Translated P2P: must be implemented if the Function
> supports Address Translation Services (ATS) and also peer-to-peer
> traffic with other Functions.
> 
> If an MFD does not do peer-to-peer, and P2P Request Redirect would be
> implemented if it did, than this ACS control does not have to be
> implemented either.
> 
> Egress control structures are either optional or dependent on Request
> Redirect &/or Direct Translated P2P control, which have been
> addressed above as not needed if on peer-to-peer btwn functions in an
> MFD (and their VFs).
> 
> 
> Now, if previous PCIe spec versions (which I didn't read & re-read &
> re-read like the 6.12 section of PCIe spec 7.0) had more IF and ANDs,
> than that could be cause for less than clear specmanship enabling
> vendors of MFDs to yield a non-PCIe-7.0 conformant MFD wrt ACS
> structures. I searched section 6.12.1.2 for if/IF and AND/and, and
> did not yield any conditions not stated above.

Back up to 6.12.1:

  ACS functionality is reported and managed via ACS Extended Capability
  structures. PCI Express components are permitted to implement ACS
  Extended Capability structures in some, none, or all of their
  applicable Functions. The extent of what is implemented is
  communicated through capability bits in each ACS Extended Capability
  structure. A given Function with an ACS Extended Capability structure
  may be required or forbidden to implement certain capabilities,
  depending upon the specific type of the Function and whether it is
  part of a Multi-Function Device.

What you're quoting are the requirements for the individual p2p
capabilities IF the ACS extended capability is implemented.

Section 6.12.1.1 describing ACS for downstream ports begins:

  This section applies to Root Ports and Switch Downstream Ports that
  implement an ACS Extended Capability structure.

Section 6.12.1.2 for SR-IOV, SIOV and MFDs begins:

  This section applies to Multi-Function Device ACS Functions, with the
  exception of Downstream Port Functions, which are covered in the
  preceding section.

While not as explicit, what is a Multi-Function Device ACS Function if
not a function of a MFD that implements ACS?

> > I believe we even reached agreement with some NIC vendors in the
> > early days of IOMMU groups that they needed to implement an "empty"
> > ACS capability on their multifunction NICs such that they could
> > describe in this way that internal P2P is not supported by the
> > device.  Thanks, 
> In the early days -- gen1->gen3 (2009->2015) I could see that
> happening. I think time (a decade) has closed those defaults to
> less-common quirks. If 'empty ACS' is how they liked to do it back
> than, sure. [A definition of empty ACS may be needed to fully
> appreciate that statement, though.] If this patch series needs to
> support an 'empty ACS' for this older case, let's add it now, or
> follow-up with another fix.

An "empty" ACS capability is an ACS extended capability where the ACS
capability register reads as zero, precisely to match the spec in
indicating that the device does not support p2p.  Again, I don't see
how time passing plays a role here.  A MFD must implement ACS to infer
anything about internal p2p behavior.
 
> In summary, I still haven't found the IF and AND you refer to in
> section 6.12.1.2 for MFDs, so if you want to quote those sections I
> mis-read, or mis-interpreted their (subtle?) existence, than I'm not
> immovable on the spec interpretation.

As above, I think it's covered by 6.12.1 and the introductory sentence
of 6.12.1.2 defining the requirements for ACS functions.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-23  2:06     ` Alex Williamson
@ 2025-09-23  2:42       ` Donald Dutile
  2025-09-23 22:23         ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-23  2:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 9/22/25 10:06 PM, Alex Williamson wrote:
> On Mon, 22 Sep 2025 21:44:27 -0400
> Donald Dutile <ddutile@redhat.com> wrote:
> 
>> On 9/22/25 6:39 PM, Alex Williamson wrote:
>>> On Fri,  5 Sep 2025 15:06:15 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>    
>>>> The series patches have extensive descriptions as to the problem and
>>>> solution, but in short the ACS flags are not analyzed according to the
>>>> spec to form the iommu_groups that VFIO is expecting for security.
>>>>
>>>> ACS is an egress control only. For a path the ACS flags on each hop only
>>>> effect what other devices the TLP is allowed to reach. It does not prevent
>>>> other devices from reaching into this path.
>>>>
>>>> For VFIO if device A is permitted to access device B's MMIO then A and B
>>>> must be grouped together. This says that even if a path has isolating ACS
>>>> flags on each hop, off-path devices with non-isolating ACS can still reach
>>>> into that path and must be grouped gother.
>>>>
>>>> For switches, a PCIe topology like:
>>>>
>>>>                                  -- DSP 02:00.0 -> End Point A
>>>>    Root 00:00.0 -> USP 01:00.0 --|
>>>>                                  -- DSP 02:03.0 -> End Point B
>>>>
>>>> Will generate unique single device groups for every device even if ACS is
>>>> not enabled on the two DSP ports. It should at least group A/B together
>>>> because no ACS means A can reach the MMIO of B. This is a serious failure
>>>> for the VFIO security model.
>>>>
>>>> For multi-function-devices, a PCIe topology like:
>>>>
>>>>                     -- MFD 00:1f.0 ACS not supported
>>>>     Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
>>>>                     |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
>>>>
>>>> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
>>>> a spec perspective each device should get its own group, because ACS not
>>>> supported can assume no loopback is possible by spec.
>>>
>>> I just dug through the thread with Don that I think tries to justify
>>> this, but I have a lot of concerns about this.  I think the "must be
>>> implemented by Functions that support peer-to-peer traffic with other
>>> Functions" language is specifying that IF the device implements an ACS
>>> capability AND does not implement the specific ACS P2P flag being
>>> described, then and only then can we assume that form of P2P is not
>>> supported.  OTOH, we cannot assume anything regarding internal P2P of an
>>> MFD that does not implement an ACS capability at all.
>>>    
>> The first, non-IF'd, non-AND'd req in PCIe spec 7.0, section 6.12.1.2 is:
>> "ACS P2P Request Redirect: must be implemented by Functions that
>> support peer-to-peer traffic with other Functions. This includes
>> SR-IOV Virtual Functions (VFs)." There is not further statement about
>> control of peer-to-peer traffic, just the ability to do so, or not.
>>
>> Note: ACS P2P Request Redirect.
>>
>> Later in that section it says:
>> ACS P2P Completion Redirect: must be implemented by Functions that
>> implement ACS P2P Request Redirect.
>>
>> That can be read as an 'IF Request-Redirect is implemented, than ACS
>> Completion Request must be implemented. IOW, the Completion Direct
>> control is required if Request Redirect is implemented, and not
>> necessary if Request Redirect is omitted.
>>
>> If ACS P2P Require Redirect isn't implemented, than per the first
>> requirement for MFDs, the PCIe device does not support peer-to-peer
>> traffic amongst its function or virtual functions.
>>
>> It goes on...
>> ACS Direct Translated P2P: must be implemented if the Function
>> supports Address Translation Services (ATS) and also peer-to-peer
>> traffic with other Functions.
>>
>> If an MFD does not do peer-to-peer, and P2P Request Redirect would be
>> implemented if it did, than this ACS control does not have to be
>> implemented either.
>>
>> Egress control structures are either optional or dependent on Request
>> Redirect &/or Direct Translated P2P control, which have been
>> addressed above as not needed if on peer-to-peer btwn functions in an
>> MFD (and their VFs).
>>
>>
>> Now, if previous PCIe spec versions (which I didn't read & re-read &
>> re-read like the 6.12 section of PCIe spec 7.0) had more IF and ANDs,
>> than that could be cause for less than clear specmanship enabling
>> vendors of MFDs to yield a non-PCIe-7.0 conformant MFD wrt ACS
>> structures. I searched section 6.12.1.2 for if/IF and AND/and, and
>> did not yield any conditions not stated above.
> 
> Back up to 6.12.1:
> 
>    ACS functionality is reported and managed via ACS Extended Capability
>    structures. PCI Express components are permitted to implement ACS
>    Extended Capability structures in some, none, or all of their
>    applicable Functions. The extent of what is implemented is
>    communicated through capability bits in each ACS Extended Capability
>    structure. A given Function with an ACS Extended Capability structure
>    may be required or forbidden to implement certain capabilities,
>    depending upon the specific type of the Function and whether it is
>    part of a Multi-Function Device.
> 
Right, depending on type of function or part of MFD.
Maybe I mis-understood your point, or vice-versa:
section 6.12.1.2 is for MFDs, and I was only discussing MFD ACS structs.
I did not mean to imply the sections I was quoting was for anything but an MFD.

> What you're quoting are the requirements for the individual p2p
> capabilities IF the ACS extended capability is implemented.
> 
No, I'm not.  I'm quoting 6.12.1.2, which is MFD-specific.

> Section 6.12.1.1 describing ACS for downstream ports begins:
> 
>    This section applies to Root Ports and Switch Downstream Ports that
>    implement an ACS Extended Capability structure.
> 
> Section 6.12.1.2 for SR-IOV, SIOV and MFDs begins:
> 
>    This section applies to Multi-Function Device ACS Functions, with the
>    exception of Downstream Port Functions, which are covered in the
>    preceding section.
> 
Right.  I wasn't discussing Downstream port functions.

> While not as explicit, what is a Multi-Function Device ACS Function if
> not a function of a MFD that implements ACS?
> 
I think you are inferring too much into that less-than-optimally worded section.

>>> I believe we even reached agreement with some NIC vendors in the
>>> early days of IOMMU groups that they needed to implement an "empty"
>>> ACS capability on their multifunction NICs such that they could
>>> describe in this way that internal P2P is not supported by the
>>> device.  Thanks,
>> In the early days -- gen1->gen3 (2009->2015) I could see that
>> happening. I think time (a decade) has closed those defaults to
>> less-common quirks. If 'empty ACS' is how they liked to do it back
>> than, sure. [A definition of empty ACS may be needed to fully
>> appreciate that statement, though.] If this patch series needs to
>> support an 'empty ACS' for this older case, let's add it now, or
>> follow-up with another fix.
> 
> An "empty" ACS capability is an ACS extended capability where the ACS
> capability register reads as zero, precisely to match the spec in
> indicating that the device does not support p2p.  Again, I don't see
> how time passing plays a role here.  A MFD must implement ACS to infer
> anything about internal p2p behavior.
>   
Again, I don't read the 'must' in the spec.
Although I'll agree that your definition of an empty ACS makes it unambiguous.

>> In summary, I still haven't found the IF and AND you refer to in
>> section 6.12.1.2 for MFDs, so if you want to quote those sections I
>> mis-read, or mis-interpreted their (subtle?) existence, than I'm not
>> immovable on the spec interpretation.
> 
> As above, I think it's covered by 6.12.1 and the introductory sentence
> of 6.12.1.2 defining the requirements for ACS functions.  Thanks,
> 
6.12.1 is not specific enough about what MFDs must or must not support;
it's a broad description of ACS in different PCIe functions.
As for 6.12.1.2, I stand by the statement that ACS P2P Request Redirect
must be implemented if peer-to-peer is implemented in an MFD.
It's not inferred, it's not unambiguous.
You are intepreting the first sentence in 6.12.1.2 as indirectly saying
that the reqs only apply to an MFD with ACS.  The title of the section is:
"ACS Functions in SR-IOV, SIOV, and Multi-Function Devices"  not
ACS requirements for ACS-controlled SR-IOV, SIOV, and Multi-Function Devices",
in which case, I could agree with the interpretation you gave of that first sentence.

I think it's time to reach out to the PCI-SIG, and the authors of this section
to dissect these interpretations and get some clarity.

- Don

> Alex
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-23  2:42       ` Donald Dutile
@ 2025-09-23 22:23         ` Alex Williamson
  2025-09-30 15:23           ` Donald Dutile
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2025-09-23 22:23 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Mon, 22 Sep 2025 22:42:37 -0400
Donald Dutile <ddutile@redhat.com> wrote:

> On 9/22/25 10:06 PM, Alex Williamson wrote:
> > On Mon, 22 Sep 2025 21:44:27 -0400
> > Donald Dutile <ddutile@redhat.com> wrote:
> >   
> >> On 9/22/25 6:39 PM, Alex Williamson wrote:  
> >>> On Fri,  5 Sep 2025 15:06:15 -0300
> >>> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>>      
> >>>> The series patches have extensive descriptions as to the problem and
> >>>> solution, but in short the ACS flags are not analyzed according to the
> >>>> spec to form the iommu_groups that VFIO is expecting for security.
> >>>>
> >>>> ACS is an egress control only. For a path the ACS flags on each hop only
> >>>> effect what other devices the TLP is allowed to reach. It does not prevent
> >>>> other devices from reaching into this path.
> >>>>
> >>>> For VFIO if device A is permitted to access device B's MMIO then A and B
> >>>> must be grouped together. This says that even if a path has isolating ACS
> >>>> flags on each hop, off-path devices with non-isolating ACS can still reach
> >>>> into that path and must be grouped gother.
> >>>>
> >>>> For switches, a PCIe topology like:
> >>>>
> >>>>                                  -- DSP 02:00.0 -> End Point A
> >>>>    Root 00:00.0 -> USP 01:00.0 --|
> >>>>                                  -- DSP 02:03.0 -> End Point B
> >>>>
> >>>> Will generate unique single device groups for every device even if ACS is
> >>>> not enabled on the two DSP ports. It should at least group A/B together
> >>>> because no ACS means A can reach the MMIO of B. This is a serious failure
> >>>> for the VFIO security model.
> >>>>
> >>>> For multi-function-devices, a PCIe topology like:
> >>>>
> >>>>                     -- MFD 00:1f.0 ACS not supported
> >>>>     Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
> >>>>                     |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> >>>>
> >>>> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
> >>>> a spec perspective each device should get its own group, because ACS not
> >>>> supported can assume no loopback is possible by spec.  
> >>>
> >>> I just dug through the thread with Don that I think tries to justify
> >>> this, but I have a lot of concerns about this.  I think the "must be
> >>> implemented by Functions that support peer-to-peer traffic with other
> >>> Functions" language is specifying that IF the device implements an ACS
> >>> capability AND does not implement the specific ACS P2P flag being
> >>> described, then and only then can we assume that form of P2P is not
> >>> supported.  OTOH, we cannot assume anything regarding internal P2P of an
> >>> MFD that does not implement an ACS capability at all.
> >>>      
> >> The first, non-IF'd, non-AND'd req in PCIe spec 7.0, section 6.12.1.2 is:
> >> "ACS P2P Request Redirect: must be implemented by Functions that
> >> support peer-to-peer traffic with other Functions. This includes
> >> SR-IOV Virtual Functions (VFs)." There is not further statement about
> >> control of peer-to-peer traffic, just the ability to do so, or not.
> >>
> >> Note: ACS P2P Request Redirect.
> >>
> >> Later in that section it says:
> >> ACS P2P Completion Redirect: must be implemented by Functions that
> >> implement ACS P2P Request Redirect.
> >>
> >> That can be read as an 'IF Request-Redirect is implemented, than ACS
> >> Completion Request must be implemented. IOW, the Completion Direct
> >> control is required if Request Redirect is implemented, and not
> >> necessary if Request Redirect is omitted.
> >>
> >> If ACS P2P Require Redirect isn't implemented, than per the first
> >> requirement for MFDs, the PCIe device does not support peer-to-peer
> >> traffic amongst its function or virtual functions.
> >>
> >> It goes on...
> >> ACS Direct Translated P2P: must be implemented if the Function
> >> supports Address Translation Services (ATS) and also peer-to-peer
> >> traffic with other Functions.
> >>
> >> If an MFD does not do peer-to-peer, and P2P Request Redirect would be
> >> implemented if it did, than this ACS control does not have to be
> >> implemented either.
> >>
> >> Egress control structures are either optional or dependent on Request
> >> Redirect &/or Direct Translated P2P control, which have been
> >> addressed above as not needed if on peer-to-peer btwn functions in an
> >> MFD (and their VFs).
> >>
> >>
> >> Now, if previous PCIe spec versions (which I didn't read & re-read &
> >> re-read like the 6.12 section of PCIe spec 7.0) had more IF and ANDs,
> >> than that could be cause for less than clear specmanship enabling
> >> vendors of MFDs to yield a non-PCIe-7.0 conformant MFD wrt ACS
> >> structures. I searched section 6.12.1.2 for if/IF and AND/and, and
> >> did not yield any conditions not stated above.  
> > 
> > Back up to 6.12.1:
> > 
> >    ACS functionality is reported and managed via ACS Extended Capability
> >    structures. PCI Express components are permitted to implement ACS
> >    Extended Capability structures in some, none, or all of their
> >    applicable Functions. The extent of what is implemented is
> >    communicated through capability bits in each ACS Extended Capability
> >    structure. A given Function with an ACS Extended Capability structure
> >    may be required or forbidden to implement certain capabilities,
> >    depending upon the specific type of the Function and whether it is
> >    part of a Multi-Function Device.
> >   
> Right, depending on type of function or part of MFD.
> Maybe I mis-understood your point, or vice-versa:
> section 6.12.1.2 is for MFDs, and I was only discussing MFD ACS structs.
> I did not mean to imply the sections I was quoting was for anything but an MFD.

I'm really going after the first half of that last sentence rather than
any specific device type:

  A given Function with an ACS Extended Capability structure may be
  required or forbidden to implement certain capabilities...

"...[WITH] an ACS Extended Capbility structure..."

"implement certain capabilities" is referring to the capabilities
exposed within the capability register of the overall ACS extended
capability structure.

Therefore when section 6.12.1.2 goes on to say:

  ACS P2P Request Redirect: must be implemented by Functions that
  support peer-to-peer traffic with other Functions.

This is saying this type of function _with_ an ACS extended capability
(carrying forward from 6.12.1) must implement the p2p RR bit of the ACS
capability register (a specific bit within the register, not the ACS
extended capability) if it is capable of p2p traffic with other
functions.  We can only infer the function is not capable of p2p traffic
with other functions if it both implements an ACS extended capability
and the p2p RR bit of the capability register is zero.

> > What you're quoting are the requirements for the individual p2p
> > capabilities IF the ACS extended capability is implemented.
> >   
> No, I'm not.  I'm quoting 6.12.1.2, which is MFD-specific.
> 
> > Section 6.12.1.1 describing ACS for downstream ports begins:
> > 
> >    This section applies to Root Ports and Switch Downstream Ports
> >   that implement an ACS Extended Capability structure.
> > 
> > Section 6.12.1.2 for SR-IOV, SIOV and MFDs begins:
> > 
> >    This section applies to Multi-Function Device ACS Functions,
> >   with the exception of Downstream Port Functions, which are
> >   covered in the preceding section.
> >   
> Right.  I wasn't discussing Downstream port functions.
> 
> > While not as explicit, what is a Multi-Function Device ACS Function
> >   if not a function of a MFD that implements ACS?
> >   
> I think you are inferring too much into that less-than-optimally
>   worded section.
> 
> >>> I believe we even reached agreement with some NIC vendors in the
> >>> early days of IOMMU groups that they needed to implement an
> >>>   "empty" ACS capability on their multifunction NICs such that
> >>>   they could describe in this way that internal P2P is not
> >>>   supported by the device.  Thanks,  
> >> In the early days -- gen1->gen3 (2009->2015) I could see that
> >> happening. I think time (a decade) has closed those defaults to
> >> less-common quirks. If 'empty ACS' is how they liked to do it back
> >> than, sure. [A definition of empty ACS may be needed to fully
> >> appreciate that statement, though.] If this patch series needs to
> >> support an 'empty ACS' for this older case, let's add it now, or
> >> follow-up with another fix.  
> > 
> > An "empty" ACS capability is an ACS extended capability where the
> >   ACS capability register reads as zero, precisely to match the
> >   spec in indicating that the device does not support p2p.  Again,
> >   I don't see how time passing plays a role here.  A MFD must
> >   implement ACS to infer anything about internal p2p behavior.
> >     
> Again, I don't read the 'must' in the spec.
> Although I'll agree that your definition of an empty ACS makes it
>   unambiguous.
> 
> >> In summary, I still haven't found the IF and AND you refer to in
> >> section 6.12.1.2 for MFDs, so if you want to quote those sections I
> >> mis-read, or mis-interpreted their (subtle?) existence, than I'm
> >>   not immovable on the spec interpretation.  
> > 
> > As above, I think it's covered by 6.12.1 and the introductory
> >   sentence of 6.12.1.2 defining the requirements for ACS functions.
> >    Thanks, 
> 6.12.1 is not specific enough about what MFDs must or must not
>   support; it's a broad description of ACS in different PCIe
>   functions. As for 6.12.1.2, I stand by the statement that ACS P2P
>   Request Redirect must be implemented if peer-to-peer is implemented
>   in an MFD. It's not inferred, it's not unambiguous.
> You are intepreting the first sentence in 6.12.1.2 as indirectly
>   saying that the reqs only apply to an MFD with ACS.  The title of
>   the section is: "ACS Functions in SR-IOV, SIOV, and Multi-Function
>   Devices"  not ACS requirements for ACS-controlled SR-IOV, SIOV, and
>   Multi-Function Devices", in which case, I could agree with the
>   interpretation you gave of that first sentence.
> 
> I think it's time to reach out to the PCI-SIG, and the authors of
>   this section to dissect these interpretations and get some clarity.

You're welcome to.  I think it's sufficiently clear.

The specification is stating that if a function exposes an ACS extended
capability and the function supports p2p with other functions, it must
implement that specific bit in the ACS capability register of the ACS
extended capability.

Therefore if the function implements an ACS extended capability and
does not implement this bit in the ACS capability register, we can
infer that the device does is not capable of p2p with other functions.
It would violate the spec otherwise.

However, if the function does not implement an ACS extended capability,
we can infer nothing.

It's logically impossible for the specification to add an optional
extended capability where the lack of a function implementing the
extended capability implies some specific behavior.  It's not backwards
compatible, which is a fundamental requirement of the PCI spec.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-23 22:23         ` Alex Williamson
@ 2025-09-30 15:23           ` Donald Dutile
  2025-09-30 16:21             ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Donald Dutile @ 2025-09-30 15:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu


note: I removed Joerg's suse email address & Tony's intel address as I keep getting numerous undeliverable emails when those are included in cc:

On 9/23/25 6:23 PM, Alex Williamson wrote:
> On Mon, 22 Sep 2025 22:42:37 -0400
> Donald Dutile <ddutile@redhat.com> wrote:
> 
>> On 9/22/25 10:06 PM, Alex Williamson wrote:
>>> On Mon, 22 Sep 2025 21:44:27 -0400
>>> Donald Dutile <ddutile@redhat.com> wrote:
>>>    
>>>> On 9/22/25 6:39 PM, Alex Williamson wrote:
>>>>> On Fri,  5 Sep 2025 15:06:15 -0300
>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>>       
>>>>>> The series patches have extensive descriptions as to the problem and
>>>>>> solution, but in short the ACS flags are not analyzed according to the
>>>>>> spec to form the iommu_groups that VFIO is expecting for security.
>>>>>>
>>>>>> ACS is an egress control only. For a path the ACS flags on each hop only
>>>>>> effect what other devices the TLP is allowed to reach. It does not prevent
>>>>>> other devices from reaching into this path.
>>>>>>
>>>>>> For VFIO if device A is permitted to access device B's MMIO then A and B
>>>>>> must be grouped together. This says that even if a path has isolating ACS
>>>>>> flags on each hop, off-path devices with non-isolating ACS can still reach
>>>>>> into that path and must be grouped gother.
>>>>>>
>>>>>> For switches, a PCIe topology like:
>>>>>>
>>>>>>                                   -- DSP 02:00.0 -> End Point A
>>>>>>     Root 00:00.0 -> USP 01:00.0 --|
>>>>>>                                   -- DSP 02:03.0 -> End Point B
>>>>>>
>>>>>> Will generate unique single device groups for every device even if ACS is
>>>>>> not enabled on the two DSP ports. It should at least group A/B together
>>>>>> because no ACS means A can reach the MMIO of B. This is a serious failure
>>>>>> for the VFIO security model.
>>>>>>
>>>>>> For multi-function-devices, a PCIe topology like:
>>>>>>
>>>>>>                      -- MFD 00:1f.0 ACS not supported
>>>>>>      Root 00:00.00 --|- MFD 00:1f.2 ACS not supported
>>>>>>                      |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
>>>>>>
>>>>>> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from
>>>>>> a spec perspective each device should get its own group, because ACS not
>>>>>> supported can assume no loopback is possible by spec.
>>>>>
>>>>> I just dug through the thread with Don that I think tries to justify
>>>>> this, but I have a lot of concerns about this.  I think the "must be
>>>>> implemented by Functions that support peer-to-peer traffic with other
>>>>> Functions" language is specifying that IF the device implements an ACS
>>>>> capability AND does not implement the specific ACS P2P flag being
>>>>> described, then and only then can we assume that form of P2P is not
>>>>> supported.  OTOH, we cannot assume anything regarding internal P2P of an
>>>>> MFD that does not implement an ACS capability at all.
>>>>>       
>>>> The first, non-IF'd, non-AND'd req in PCIe spec 7.0, section 6.12.1.2 is:
>>>> "ACS P2P Request Redirect: must be implemented by Functions that
>>>> support peer-to-peer traffic with other Functions. This includes
>>>> SR-IOV Virtual Functions (VFs)." There is not further statement about
>>>> control of peer-to-peer traffic, just the ability to do so, or not.
>>>>
>>>> Note: ACS P2P Request Redirect.
>>>>
>>>> Later in that section it says:
>>>> ACS P2P Completion Redirect: must be implemented by Functions that
>>>> implement ACS P2P Request Redirect.
>>>>
>>>> That can be read as an 'IF Request-Redirect is implemented, than ACS
>>>> Completion Request must be implemented. IOW, the Completion Direct
>>>> control is required if Request Redirect is implemented, and not
>>>> necessary if Request Redirect is omitted.
>>>>
>>>> If ACS P2P Require Redirect isn't implemented, than per the first
>>>> requirement for MFDs, the PCIe device does not support peer-to-peer
>>>> traffic amongst its function or virtual functions.
>>>>
>>>> It goes on...
>>>> ACS Direct Translated P2P: must be implemented if the Function
>>>> supports Address Translation Services (ATS) and also peer-to-peer
>>>> traffic with other Functions.
>>>>
>>>> If an MFD does not do peer-to-peer, and P2P Request Redirect would be
>>>> implemented if it did, than this ACS control does not have to be
>>>> implemented either.
>>>>
>>>> Egress control structures are either optional or dependent on Request
>>>> Redirect &/or Direct Translated P2P control, which have been
>>>> addressed above as not needed if on peer-to-peer btwn functions in an
>>>> MFD (and their VFs).
>>>>
>>>>
>>>> Now, if previous PCIe spec versions (which I didn't read & re-read &
>>>> re-read like the 6.12 section of PCIe spec 7.0) had more IF and ANDs,
>>>> than that could be cause for less than clear specmanship enabling
>>>> vendors of MFDs to yield a non-PCIe-7.0 conformant MFD wrt ACS
>>>> structures. I searched section 6.12.1.2 for if/IF and AND/and, and
>>>> did not yield any conditions not stated above.
>>>
>>> Back up to 6.12.1:
>>>
>>>     ACS functionality is reported and managed via ACS Extended Capability
>>>     structures. PCI Express components are permitted to implement ACS
>>>     Extended Capability structures in some, none, or all of their
>>>     applicable Functions. The extent of what is implemented is
>>>     communicated through capability bits in each ACS Extended Capability
>>>     structure. A given Function with an ACS Extended Capability structure
>>>     may be required or forbidden to implement certain capabilities,
>>>     depending upon the specific type of the Function and whether it is
>>>     part of a Multi-Function Device.
>>>    
>> Right, depending on type of function or part of MFD.
>> Maybe I mis-understood your point, or vice-versa:
>> section 6.12.1.2 is for MFDs, and I was only discussing MFD ACS structs.
>> I did not mean to imply the sections I was quoting was for anything but an MFD.
> 
> I'm really going after the first half of that last sentence rather than
> any specific device type:
> 
>    A given Function with an ACS Extended Capability structure may be
>    required or forbidden to implement certain capabilities...
> 
> "...[WITH] an ACS Extended Capbility structure..."
> 
> "implement certain capabilities" is referring to the capabilities
> exposed within the capability register of the overall ACS extended
> capability structure.
> 
> Therefore when section 6.12.1.2 goes on to say:
> 
>    ACS P2P Request Redirect: must be implemented by Functions that
>    support peer-to-peer traffic with other Functions.
> 
> This is saying this type of function _with_ an ACS extended capability
> (carrying forward from 6.12.1) must implement the p2p RR bit of the ACS
> capability register (a specific bit within the register, not the ACS
> extended capability) if it is capable of p2p traffic with other
> functions.  We can only infer the function is not capable of p2p traffic
> with other functions if it both implements an ACS extended capability
> and the p2p RR bit of the capability register is zero.
> 
>>> What you're quoting are the requirements for the individual p2p
>>> capabilities IF the ACS extended capability is implemented.
>>>    
>> No, I'm not.  I'm quoting 6.12.1.2, which is MFD-specific.
>>
>>> Section 6.12.1.1 describing ACS for downstream ports begins:
>>>
>>>     This section applies to Root Ports and Switch Downstream Ports
>>>    that implement an ACS Extended Capability structure.
>>>
>>> Section 6.12.1.2 for SR-IOV, SIOV and MFDs begins:
>>>
>>>     This section applies to Multi-Function Device ACS Functions,
>>>    with the exception of Downstream Port Functions, which are
>>>    covered in the preceding section.
>>>    
>> Right.  I wasn't discussing Downstream port functions.
>>
>>> While not as explicit, what is a Multi-Function Device ACS Function
>>>    if not a function of a MFD that implements ACS?
>>>    
>> I think you are inferring too much into that less-than-optimally
>>    worded section.
>>
>>>>> I believe we even reached agreement with some NIC vendors in the
>>>>> early days of IOMMU groups that they needed to implement an
>>>>>    "empty" ACS capability on their multifunction NICs such that
>>>>>    they could describe in this way that internal P2P is not
>>>>>    supported by the device.  Thanks,
>>>> In the early days -- gen1->gen3 (2009->2015) I could see that
>>>> happening. I think time (a decade) has closed those defaults to
>>>> less-common quirks. If 'empty ACS' is how they liked to do it back
>>>> than, sure. [A definition of empty ACS may be needed to fully
>>>> appreciate that statement, though.] If this patch series needs to
>>>> support an 'empty ACS' for this older case, let's add it now, or
>>>> follow-up with another fix.
>>>
>>> An "empty" ACS capability is an ACS extended capability where the
>>>    ACS capability register reads as zero, precisely to match the
>>>    spec in indicating that the device does not support p2p.  Again,
>>>    I don't see how time passing plays a role here.  A MFD must
>>>    implement ACS to infer anything about internal p2p behavior.
>>>      
>> Again, I don't read the 'must' in the spec.
>> Although I'll agree that your definition of an empty ACS makes it
>>    unambiguous.
>>
>>>> In summary, I still haven't found the IF and AND you refer to in
>>>> section 6.12.1.2 for MFDs, so if you want to quote those sections I
>>>> mis-read, or mis-interpreted their (subtle?) existence, than I'm
>>>>    not immovable on the spec interpretation.
>>>
>>> As above, I think it's covered by 6.12.1 and the introductory
>>>    sentence of 6.12.1.2 defining the requirements for ACS functions.
>>>     Thanks,
>> 6.12.1 is not specific enough about what MFDs must or must not
>>    support; it's a broad description of ACS in different PCIe
>>    functions. As for 6.12.1.2, I stand by the statement that ACS P2P
>>    Request Redirect must be implemented if peer-to-peer is implemented
>>    in an MFD. It's not inferred, it's not unambiguous.
>> You are intepreting the first sentence in 6.12.1.2 as indirectly
>>    saying that the reqs only apply to an MFD with ACS.  The title of
>>    the section is: "ACS Functions in SR-IOV, SIOV, and Multi-Function
>>    Devices"  not ACS requirements for ACS-controlled SR-IOV, SIOV, and
>>    Multi-Function Devices", in which case, I could agree with the
>>    interpretation you gave of that first sentence.
>>
>> I think it's time to reach out to the PCI-SIG, and the authors of
>>    this section to dissect these interpretations and get some clarity.
> 
> You're welcome to.  I think it's sufficiently clear.
> 
> The specification is stating that if a function exposes an ACS extended
> capability and the function supports p2p with other functions, it must
> implement that specific bit in the ACS capability register of the ACS
> extended capability.
> 
> Therefore if the function implements an ACS extended capability and
> does not implement this bit in the ACS capability register, we can
> infer that the device does is not capable of p2p with other functions.
> It would violate the spec otherwise.
> 
> However, if the function does not implement an ACS extended capability,
> we can infer nothing.
> 
> It's logically impossible for the specification to add an optional
> extended capability where the lack of a function implementing the
> extended capability implies some specific behavior.  It's not backwards
> compatible, which is a fundamental requirement of the PCI spec.  Thanks,
> 
> Alex
> 
This is boiling down to an interpretation of the spec.

If the latest PCI-v7 spec is not backward compatible, that a function within an MFD
not having an ACS struct must not be isolated from other functions within the MFD without an ACS struct,
then the current Linux implementation/interpretation, and the need for the acs quirks
when hw vendors improperly omit the ACS structure, is the safest/secure route to go.

Historical/legacy implementations of MFDs without ACS structs have bolstered that
position/interpretation/agreed-to-required-acs-quicks implementation.
The small number of acs quirks also seems to support that past interpretation.

I believe a definitive answer from the PCI-SIG would be best, especially wrt
backward compatibility.  Such a review & feedback is likely to take quite some
time.  Thus, taking the current-conservative approach for omitted ACS structs for
MFD functions would allow this series to progress with the numerous other bug-fixes
that are needed, with a minor change to the MFD iommu-group check/creation function
to use acs-quirks to create better isolation groups if a hw vendor interprets and
implements no ACS as having no p2p connectivity.

- Don



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS
  2025-09-30 15:23           ` Donald Dutile
@ 2025-09-30 16:21             ` Jason Gunthorpe
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2025-09-30 16:21 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Alex Williamson, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Sep 30, 2025 at 11:23:06AM -0400, Donald Dutile wrote:

> This is boiling down to an interpretation of the spec.

I think we all agree a reasonable intepretation of the spec is that no
ACS means *undefined*

But Linux cannot work on undefined. It needs to assume a definition
because it needs to make security decisions.

So I think this entirely a Linux discussion about how we should
respond to this undefined behavior in the spec.

Every heuristic option should be consider even if not directly
supported by the spec. Our current behavior isn't strongly spec
supported either :\

> I believe a definitive answer from the PCI-SIG would be best, especially wrt
> backward compatibility.  

IMHO they will say it is undefined. Device are allowed to implement
both internal loopback and block internal loopback.

This position is useless for the OS, but it is how PCI SIG has acted
historically. They don't retroactively make existing devices
non-compliant.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2025-09-30 16:21 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 18:06 [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
2025-09-05 18:06 ` [PATCH v3 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
2025-09-09  4:08   ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 02/11] PCI: Add pci_bus_isolated() Jason Gunthorpe
2025-09-09  4:09   ` Donald Dutile
2025-09-09 19:54   ` Bjorn Helgaas
2025-09-09 21:21     ` Jason Gunthorpe
2025-09-05 18:06 ` [PATCH v3 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
2025-09-09  4:14   ` Donald Dutile
2025-09-09 12:18     ` Jason Gunthorpe
2025-09-09 19:33       ` Donald Dutile
2025-09-09 20:27   ` Bjorn Helgaas
2025-09-09 21:21     ` Jason Gunthorpe
2025-09-05 18:06 ` [PATCH v3 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
2025-09-09  4:16   ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
2025-09-09 21:03   ` Bjorn Helgaas
2025-09-10 16:13     ` Jason Gunthorpe
2025-09-11 19:56     ` Donald Dutile
2025-09-15 13:38       ` Jason Gunthorpe
2025-09-15 14:32         ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
2025-09-09  4:57   ` Donald Dutile
2025-09-09 13:31     ` Jason Gunthorpe
2025-09-09 19:55       ` Donald Dutile
2025-09-09 21:24   ` Bjorn Helgaas
2025-09-09 23:20     ` Jason Gunthorpe
2025-09-10  1:59     ` Donald Dutile
2025-09-10 17:43       ` Jason Gunthorpe
2025-09-05 18:06 ` [PATCH v3 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
2025-09-09  5:00   ` Donald Dutile
2025-09-09 15:35     ` Jason Gunthorpe
2025-09-09 19:58       ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
2025-09-09  5:01   ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
2025-09-09  5:01   ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
2025-09-09  5:02   ` Donald Dutile
2025-09-09 21:43   ` Bjorn Helgaas
2025-09-10 17:34     ` Jason Gunthorpe
2025-09-11 19:50       ` Donald Dutile
2025-09-05 18:06 ` [PATCH v3 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
2025-09-09  5:04   ` Donald Dutile
2025-09-15  9:41 ` [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Cédric Le Goater
2025-09-22 22:39 ` Alex Williamson
2025-09-23  1:44   ` Donald Dutile
2025-09-23  2:06     ` Alex Williamson
2025-09-23  2:42       ` Donald Dutile
2025-09-23 22:23         ` Alex Williamson
2025-09-30 15:23           ` Donald Dutile
2025-09-30 16:21             ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).