[PATCH v2 00/16] Fix incorrect iommu

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
@ 2025-07-09 14:52 Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 01/16] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
                   ` (17 more replies)
  0 siblings, 18 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The series patches have extensive descriptions as to the problem and
solution, but in short the ACS flags are not analyzed according to the
spec to form the iommu_groups that VFIO is expecting for security.

ACS is an egress control only. For a path the ACS flags on each hop only
effect what other devices the TLP is allowed to reach. It does not prevent
other devices from reaching into this path.

For VFIO if device A is permitted to access device B's MMIO then A and B
must be grouped together. This says that even if a path has isolating ACS
flags on each hop, off-path devices with non-isolating ACS can still reach
into that path and must be grouped gother.

For switches, a PCIe topology like:

                               -- DSP 02:00.0 -> End Point A
 Root 00:00.0 -> USP 01:00.0 --|
                               -- DSP 02:03.0 -> End Point B

Will generate unique single device groups for every device even if ACS is
not enabled on the two DSP ports. It should at least group A/B together
because no ACS means A can reach the MMIO of B. This is a serious failure
for the VFIO security model.

For multi-function-devices, a PCIe topology like:

                  -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
  Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
                  |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS

Will group [1f.0, 1f.2] and 1f.6 gets a single device group. In many cases
we suspect that the MFD actually doesn't need ACS, so this is probably not
as important a security failure, but from a spec perspective the correct
answer is one group of [1f.0, 1f.2, 1f.6] beacuse 1f.0/2 have no ACS
preventing them from reaching the MMIO of 1f.6.

There is also some confusing spec language about how ACS and SRIOV works
which this series does not address.

This entire series goes further and makes some additional improvements to
the ACS validation found while studying this problem. The groups around a
PCIe to PCI bridge are shrunk to not include the PCIe bridge.

The last patches implement "ACS Enhanced" on top of it. Due to how ACS
Enhanced was defined as a non-backward compatible feature it is important
to get SW support out there.

Due to the potential of iommu_groups becoming winder and thus non-usable
for VFIO this should go to a linux-next tree to give it some more
exposure.

I have now tested this a few systems I could get:

 - Various Intel client systems:
   * Raptor Lake, with VMD enabled and using the real_dev mechanism
   * 6/7th generation 100 Series/C320
   * 5/6th generation 100 Series/C320 with a NIC MFD quirk
   * Tiger Lake
   * 5/6th generation Sunrise Point
  No change in grouping on any of these systems

 - NVIDIA Grace system with 5 different PCI switches from two vendors
   Bug fix widening the iommu_groups works as expected here

 - AMD Milan Starship/Matisse
   * Groups are similar, this series generates narrow groups because the
     dummy host bridges always get their own groups. Something forcibly
     disables ACS SV on one bridge which correctly causes one larger
     group.

This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups

v2:
 - Revise comments and commit messages
 - Rename struct pci_alias_set to pci_reachable_set
 - Make more sense of the special bus->self = NULL case for SRIOV
 - Add pci_group_alloc_non_isolated() for readability
 - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED
 - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD
   function is a bridge
 - New patches to add pci_mfd_isolation() to retain more cases of narrow
   groups on MFDs with missing ACS.
 - Redescribe the MFD related change as a bug fix. For a MFD to be
   isolated all functions must have egress control on their P2P.
v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com

Jason Gunthorpe (16):
  PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  PCI: Add pci_bus_isolation()
  iommu: Compute iommu_groups properly for PCIe switches
  iommu: Organize iommu_group by member size
  PCI: Add pci_reachable_set()
  PCI: Remove duplication in calling pci_acs_ctrl_enabled()
  PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs()
  PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs()
  PCI: Widen the acs_flags to u32 within the quirk callback
  PCI: Add pci_mfd_isolation()
  iommu: Compute iommu_groups properly for PCIe MFDs
  iommu: Validate that pci_for_each_dma_alias() matches the groups
  PCI: Add the ACS Enhanced Capability definitions
  PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  PCI: Check ACS Extended flags for pci_bus_isolated()

 drivers/iommu/iommu.c         | 486 +++++++++++++++++++++++-----------
 drivers/pci/ats.c             |   4 +-
 drivers/pci/pci.c             |  73 ++++-
 drivers/pci/pci.h             |   5 +
 drivers/pci/quirks.c          | 137 ++++++----
 drivers/pci/search.c          | 294 ++++++++++++++++++++
 include/linux/pci.h           |  50 ++++
 include/uapi/linux/pci_regs.h |  18 ++
 8 files changed, 846 insertions(+), 221 deletions(-)

base-commit: e04c78d86a9699d136910cfc0bdcf01087e3267e
-- 
2.43.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 01/16] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 02/16] PCI: Add pci_bus_isolation() Jason Gunthorpe
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The next patch wants to use this constant, share it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c         | 16 +++-------------
 include/uapi/linux/pci_regs.h | 10 ++++++++++
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a4b606c591da66..d265de874b14b6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1408,16 +1408,6 @@ EXPORT_SYMBOL_GPL(iommu_group_id);
 static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 					       unsigned long *devfns);
 
-/*
- * To consider a PCI device isolated, we require ACS to support Source
- * Validation, Request Redirection, Completer Redirection, and Upstream
- * Forwarding.  This effectively means that devices cannot spoof their
- * requester ID, requests and completions cannot be redirected, and all
- * transactions are forwarded upstream, even as it passes through a
- * bridge where the target device is downstream.
- */
-#define REQ_ACS_FLAGS   (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
-
 /*
  * For multifunction devices which are not isolated from each other, find
  * all the other non-isolated functions and look for existing groups.  For
@@ -1430,13 +1420,13 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
 	struct pci_dev *tmp = NULL;
 	struct iommu_group *group;
 
-	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
+	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
 		return NULL;
 
 	for_each_pci_dev(tmp) {
 		if (tmp == pdev || tmp->bus != pdev->bus ||
 		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, REQ_ACS_FLAGS))
+		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
 			continue;
 
 		group = get_pci_alias_group(tmp, devfns);
@@ -1580,7 +1570,7 @@ struct iommu_group *pci_device_group(struct device *dev)
 		if (!bus->self)
 			continue;
 
-		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
+		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
 			break;
 
 		pdev = bus->self;
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index a3a3e942dedffc..6edffd31cd8e2c 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1008,6 +1008,16 @@
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+/*
+ * To consider a PCI device isolated, we require ACS to support Source
+ * Validation, Request Redirection, Completer Redirection, and Upstream
+ * Forwarding.  This effectively means that devices cannot spoof their
+ * requester ID, requests and completions cannot be redirected, and all
+ * transactions are forwarded upstream, even as it passes through a
+ * bridge where the target device is downstream.
+ */
+#define PCI_ACS_ISOLATED (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
+
 /* SATA capability */
 #define PCI_SATA_REGS		4	/* SATA REGs specifier */
 #define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 02/16] PCI: Add pci_bus_isolation()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 01/16] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Prepare to move the calculation of the bus P2P isolation out of the iommu
code and into the PCI core. This allows using the faster list iteration
under the pci_bus_sem, and the code has a kinship with the logic in
pci_for_each_dma_alias().

Bus isolation is the concept that drives the iommu_groups for the purposes
of VFIO. Stated simply, if device A can send traffic to device B then they
must be in the same group.

Only PCIe provides isolation. The multi-drop electrical topology in
classical PCI allows any bus member to claim the transaction.

In PCIe isolation comes out of ACS. If a PCIe Switch and Root Complex has
ACS flags that prevent peer to peer traffic and funnel all operations to
the IOMMU then devices can be isolated.

If a multi-function devices has ACS flags preventing self loop back then
that device is isolated, though pci_bus_isolation() does not deal with
devices.

As a property of a bus, there are several positive cases:

 - The point to point "bus" on a physical PCIe link is isolated if the
   bridge/root device has something preventing self-access to its own
   MMIO.

 - A Root Port is usually isolated

 - A PCIe switch can be isolated if all it's Down Stream Ports have good
   ACS flags

pci_bus_isolation() implements these rules and returns an enum indicating
the level of isolation the bus has, with five possibilies:

 PCIE_ISOLATED: Traffic on this PCIE bus can not do any P2P.

 PCIE_SWITCH_DSP_NON_ISOLATED: The bus is the internal bus of a PCIE
     switch and the USP is isolated but the DSPs are not.

 PCIE_NON_ISOLATED: The PCIe bus has no isolation between the bridge or
     any downstream devices.

 PCI_BUS_NON_ISOLATED: It is a PCI/PCI-X but the bridge is PCIe, has no
     aliases and is isolated.

 PCI_BRIDGE_NON_ISOLATED: It is a PCI/PCI-X bus and has no isolation, the
     bridge is part of the group.

The calculation is done per-bus, so it is possible for a transactions from
a PCI device to travel through different bus isolation types on its way
upstream. PCIE_SWITCH_DSP_NON_ISOLATED/PCI_BUS_NON_ISOLATED and
PCIE_NON_ISOLATED/PCI_BRIDGE_NON_ISOLATED are the same for the purposes of
creating iommu groups. The distinction between PCIe and PCI allows for
easier understanding and debugging as to why the groups are chosen.

For the iommu groups if all busses on the upstream path are PCIE_ISOLATED
then the end device has a chance to have a single-device iommu_group. Once
any non-isolated bus segment is found that bus segment will have an
iommu_group that captures all downstream devices, and sometimes the
upstream bridge.

pci_bus_isolation() is principally about isolation, but there is an
overlap with grouping requirements for legacy PCI aliasing. For purely
legacy PCI environments pci_bus_isolation() returns
PCI_BRIDGE_NON_ISOLATED for everything and all devices within a hierarchy
are in one group. No need to worry about bridge aliasing.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 164 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  |  31 ++++++++
 2 files changed, 195 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 53840634fbfc2b..a13fad53e44df9 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -113,6 +113,170 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
 	return ret;
 }
 
+static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+
+	/*
+	 * Within a PCIe switch we have an interior bus that has the Upstream
+	 * port as the bridge and a set of Downstream port bridging to the
+	 * egress ports.
+	 *
+	 * Each DSP has an ACS setting which controls where its traffic is
+	 * permitted to go. Any DSP with a permissive ACS setting can send
+	 * traffic flowing upstream back downstream through another DSP.
+	 *
+	 * Thus any non-permissive DSP spoils the whole bus.
+	 */
+	guard(rwsem_read)(&pci_bus_sem);
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		/* Don't understand what this is, be conservative */
+		if (!pci_is_pcie(pdev) ||
+		    pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM ||
+		    pdev->dma_alias_mask)
+			return PCIE_NON_ISOLATED;
+
+		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
+			return PCIE_SWITCH_DSP_NON_ISOLATED;
+	}
+	return PCIE_ISOLATED;
+}
+
+static bool pci_has_mmio(struct pci_dev *pdev)
+{
+	unsigned int i;
+
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		struct resource *res = pci_resource_n(pdev, i);
+
+		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
+			return true;
+	}
+	return false;
+}
+
+/**
+ * pci_bus_isolated - Determine how isolated connected devices are
+ * @bus: The bus to check
+ *
+ * Isolation is the ability of devices to talk to each other. Full isolation
+ * means that a device can only communicate with the IOMMU and can not do peer
+ * to peer within the fabric.
+ *
+ * We consider isolation on a bus by bus basis. If the bus will permit a
+ * transaction originated downstream to complete on anything other than the
+ * IOMMU then the bus is not isolated.
+ *
+ * Non-isolation includes all the downstream devices on this bus, and it may
+ * include the upstream bridge or port that is creating this bus.
+ *
+ * The various cases are returned in an enum.
+ *
+ * Broadly speaking this function evaluates the ACS settings in a PCI switch to
+ * determine if a PCI switch is configured to have full isolation.
+ *
+ * Old PCI/PCI-X busses cannot have isolation due to their physical properties,
+ * but they do have some aliasing properties that effect group creation.
+ *
+ * pci_bus_isolated() does not consider loopback internal to devices, like
+ * multi-function devices performing a self-loopback. The caller must check
+ * this separately. It does not considering alasing within the bus.
+ *
+ * It does not currently support the ACS P2P Egress Control Vector, Linux does
+ * not yet have any way to enable this feature. EC will create subsets of the
+ * bus that are isolated from other subsets.
+ */
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
+{
+	struct pci_dev *bridge = bus->self;
+	int type;
+
+	/*
+	 * This bus was created by pci_register_host_bridge(). The spec provides
+	 * no way to tell what kind of bus this is, for PCIe we expect this to
+	 * be internal to the root complex and not covered by any spec behavior.
+	 * Linux has historically been optimistic about this bus and treated it
+	 * as isolating. Given that the behavior of the root complex and the ACS
+	 * behavior of RCiEP's is explicitly not specified we hope that the
+	 * implementation is directing everything that reaches the root bus to
+	 * the IOMMU.
+	 */
+	if (pci_is_root_bus(bus))
+		return PCIE_ISOLATED;
+
+	/*
+	 * bus->self is only NULL for SRIOV VFs, it represents a "virtual" bus
+	 * within Linux to hold any bus numbers consumed by VF RIDs. Caller must
+	 * use pci_physfn() to get the bus for calling this function.
+	 */
+	if (WARN_ON(!bridge))
+		return PCI_BRIDGE_NON_ISOLATED;
+
+	/*
+	 * The bridge is not a PCIe bridge therefore this bus is PCI/PCI-X.
+	 *
+	 * PCI does not have anything like ACS. Any down stream device can bus
+	 * master an address that any other downstream device can claim. No
+	 * isolation is possible.
+	 */
+	if (!pci_is_pcie(bridge)) {
+		if (bridge->dev_flags & PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS)
+			type = PCI_EXP_TYPE_PCI_BRIDGE;
+		else
+			return PCI_BRIDGE_NON_ISOLATED;
+	} else {
+		type = pci_pcie_type(bridge);
+	}
+
+	switch (type) {
+	/*
+	 * Since PCIe links are point to point root and downstream ports are
+	 * isolated if their own MMIO cannot be reached.
+	 */
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+		if (!pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
+			return PCIE_NON_ISOLATED;
+		return PCIE_ISOLATED;
+
+	/*
+	 * bus is the interior bus of a PCI-E switch where ACS rules apply.
+	 */
+	case PCI_EXP_TYPE_UPSTREAM:
+		return pcie_switch_isolated(bus);
+
+	/*
+	 * PCIe to PCI/PCI-X - this bus is PCI.
+	 */
+	case PCI_EXP_TYPE_PCI_BRIDGE:
+		/*
+		 * A PCIe express bridge will use the subordinate bus number
+		 * with a 0 devfn as the RID in some cases. This causes all
+		 * subordinate devfns to alias with 0, which is the same
+		 * grouping as PCI_BUS_NON_ISOLATED. The RID of the bridge
+		 * itself is only used by the bridge.
+		 *
+		 * However, if the bridge has MMIO then we will assume the MMIO
+		 * is not isolated due to no ACS controls on this bridge type.
+		 */
+		if (pci_has_mmio(bridge))
+			return PCI_BRIDGE_NON_ISOLATED;
+		return PCI_BUS_NON_ISOLATED;
+
+	/*
+	 * PCI/PCI-X to PCIe - this bus is PCIe. We already know there must be a
+	 * PCI bus upstream of this bus, so just return non-isolated. If
+	 * upstream is PCI-X the PCIe RID should be preserved, but for PCI the
+	 * RID will be lost.
+	 */
+	case PCI_EXP_TYPE_PCIE_BRIDGE:
+		return PCI_BRIDGE_NON_ISOLATED;
+
+	default:
+		return PCI_BRIDGE_NON_ISOLATED;
+	}
+}
+
 static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
 {
 	struct pci_bus *child;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f39238..0b1e28dcf9187d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -834,6 +834,32 @@ struct pci_dynids {
 	struct list_head	list;	/* For IDs added at runtime */
 };
 
+enum pci_bus_isolation {
+	/*
+	 * The bus is off a root port and the root port has isolated ACS flags
+	 * or the bus is part of a PCIe switch and the switch has isolated ACS
+	 * flags.
+	 */
+	PCIE_ISOLATED,
+	/*
+	 * The switch's DSP's are not isolated from each other but are isolated
+	 * from the USP.
+	 */
+	PCIE_SWITCH_DSP_NON_ISOLATED,
+	/* The above and the USP's MMIO is not isolated. */
+	PCIE_NON_ISOLATED,
+	/*
+	 * A PCI/PCI-X bus, no isolation. This is like
+	 * PCIE_SWITCH_DSP_NON_ISOLATED in that the upstream bridge is isolated
+	 * from the bus. The bus itself may also have a shared alias of devfn=0.
+	 */
+	PCI_BUS_NON_ISOLATED,
+	/*
+	 * The above and the bridge's MMIO is not isolated and the bridge's RID
+	 * may be an alias.
+	 */
+	PCI_BRIDGE_NON_ISOLATED,
+};
 
 /*
  * PCI Error Recovery System (PCI-ERS).  If a PCI device driver provides
@@ -1222,6 +1248,8 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
 
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
+
 int pci_dev_present(const struct pci_device_id *ids);
 
 int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
@@ -2035,6 +2063,9 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
 						 struct pci_dev *from)
 { return NULL; }
 
+static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
+{ return PCIE_NON_ISOLATED; }
+
 static inline int pci_dev_present(const struct pci_device_id *ids)
 { return 0; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 01/16] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 02/16] PCI: Add pci_bus_isolation() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-17 22:03   ` Donald Dutile
  2025-07-09 14:52 ` [PATCH v2 04/16] iommu: Organize iommu_group by member size Jason Gunthorpe
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. Alex suspects a multi-function
modeling of the DSPs may have hidden it, but I could not find a switch
using that scheme.

Most likely people were simply happy they had smaller groups oblivious to
the security problems.

For discussion lets consider a simple topology like the below:

                               -- DSP 02:00.0 -> End Point A
 Root 00:00.0 -> USP 01:00.0 --|
                               -- DSP 02:03.0 -> End Point B

If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.

If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.

If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.

Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSPs group.

The current algorithm has several issues:

 1) It implicitly depends on ordering. Since the existing group discovery
    only goes in the upstream direction discovering a downstream device
    before its upstream will cause the wrong creation of narrower groups.

 2) It assumes that if the path from the end point to the root is entirely
    ACS isolated then that end point is isolated. This misses cross-traffic
    in the asymmetric ACS case.

 3) When evaluating a non-isolated DSP it does not check peer DSPs for an
    already established group unless the multi-function feature does it.

 4) It does not understand the aliasing rule for PCIe to PCI bridges
    where the alias is to the subordinate bus. The bridge's RID on the
    primary bus is not aliased. This causes the PCIe to PCI bridge to be
    wrongly joined to the group with the downstream devices.

As grouping is a security property for VFIO creating incorrectly narrowed
groups is a security problem for the system.

Revise the design to solve these problems.

Explicitly require ordering, or return EPROBE_DEFER if things are out of
order. This avoids silent errors that created smaller groups and solves
problem #1.

Work on busses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all busses on the path to an end device are
isolated then the end device has a chance to make a single-device group.

Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches a lot of PCI
internals to get the information in the right format.

Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2

The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.

Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.

Finally, replace the initial pci_for_each_dma_alias() with a combination
of:

 - Directly checking pci_real_dma_dev() and enforcing ordering.
   The group should contain both pdev and pci_real_dma_dev(pdev) which is
   only possible if pdev is ordered after real_dma_dev. This solves a case
   of #1.

 - Indirectly relying on pci_bus_isolation() to report legacy PCI busses
   as non-isolated, with the enum including the distinction of the PCIe to
   PCI bridge being isolated from the downstream. This solves problem #4.

It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.

The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.

Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 274 ++++++++++++++++++++++++++++++++----------
 include/linux/pci.h   |   3 +
 2 files changed, 212 insertions(+), 65 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d265de874b14b6..8b152266f20104 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -65,8 +65,16 @@ struct iommu_group {
 	struct list_head entry;
 	unsigned int owner_cnt;
 	void *owner;
+
+	/* Used by the device_group() callbacks */
+	u32 bus_data;
 };
 
+/*
+ * Everything downstream of this group should share it.
+ */
+#define BUS_DATA_PCI_NON_ISOLATED BIT(0)
+
 struct group_device {
 	struct list_head list;
 	struct device *dev;
@@ -1484,25 +1492,6 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 	return NULL;
 }
 
-struct group_for_pci_data {
-	struct pci_dev *pdev;
-	struct iommu_group *group;
-};
-
-/*
- * DMA alias iterator callback, return the last seen device.  Stop and return
- * the IOMMU group if we find one along the way.
- */
-static int get_pci_alias_or_group(struct pci_dev *pdev, u16 alias, void *opaque)
-{
-	struct group_for_pci_data *data = opaque;
-
-	data->pdev = pdev;
-	data->group = iommu_group_get(&pdev->dev);
-
-	return data->group != NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1534,57 +1523,27 @@ struct iommu_group *generic_single_device_group(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(generic_single_device_group);
 
-/*
- * Use standard PCI bus topology, isolation features, and DMA alias quirks
- * to find or create an IOMMU group for a device.
- */
-struct iommu_group *pci_device_group(struct device *dev)
+static struct iommu_group *pci_group_alloc_non_isolated(void)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct group_for_pci_data data;
-	struct pci_bus *bus;
-	struct iommu_group *group = NULL;
-	u64 devfns[4] = { 0 };
+	struct iommu_group *group;
 
-	if (WARN_ON(!dev_is_pci(dev)))
-		return ERR_PTR(-EINVAL);
+	group = iommu_group_alloc();
+	if (IS_ERR(group))
+		return group;
+	group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+	return group;
+}
 
-	/*
-	 * Find the upstream DMA alias for the device.  A device must not
-	 * be aliased due to topology in order to have its own IOMMU group.
-	 * If we find an alias along the way that already belongs to a
-	 * group, use it.
-	 */
-	if (pci_for_each_dma_alias(pdev, get_pci_alias_or_group, &data))
-		return data.group;
-
-	pdev = data.pdev;
-
-	/*
-	 * Continue upstream from the point of minimum IOMMU granularity
-	 * due to aliases to the point where devices are protected from
-	 * peer-to-peer DMA by PCI ACS.  Again, if we find an existing
-	 * group, use it.
-	 */
-	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
-		if (!bus->self)
-			continue;
-
-		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
-			break;
-
-		pdev = bus->self;
-
-		group = iommu_group_get(&pdev->dev);
-		if (group)
-			return group;
-	}
+static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
+{
+	struct iommu_group *group;
+	DECLARE_BITMAP(devfns, 256) = {};
 
 	/*
 	 * Look for existing groups on device aliases.  If we alias another
 	 * device or another device aliases us, use the same group.
 	 */
-	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
+	group = get_pci_alias_group(pdev, devfns);
 	if (group)
 		return group;
 
@@ -1593,12 +1552,197 @@ struct iommu_group *pci_device_group(struct device *dev)
 	 * slot and aliases of those funcions, if any.  No need to clear
 	 * the search bitmap, the tested devfns are still valid.
 	 */
-	group = get_pci_function_alias_group(pdev, (unsigned long *)devfns);
+	group = get_pci_function_alias_group(pdev, devfns);
 	if (group)
 		return group;
 
-	/* No shared group found, allocate new */
-	return iommu_group_alloc();
+	/*
+	 * When MFD's are included in the set due to ACS we assume that if ACS
+	 * permits an internal loopback between functions it also permits the
+	 * loopback to go downstream if a function is a bridge.
+	 *
+	 * It is less clear what aliases mean when applied to a bridge. For now
+	 * be conservative and also propagate the group downstream.
+	 */
+	__clear_bit(pdev->devfn & 0xFF, devfns);
+	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
+		return pci_group_alloc_non_isolated();
+	return NULL;
+}
+
+static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
+{
+	/*
+	 * SRIOV functions may resid on a virtual bus, jump directly to the PFs
+	 * bus in all cases.
+	 */
+	struct pci_bus *bus = pci_physfn(pdev)->bus;
+	struct iommu_group *group;
+
+	/* Nothing upstream of this */
+	if (pci_is_root_bus(bus))
+		return NULL;
+
+	/*
+	 * !self is only for SRIOV virtual busses which should have been
+	 * excluded above.
+	 */
+	if (WARN_ON(!bus->self))
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(&bus->self->dev);
+	if (!group) {
+		/*
+		 * If the upstream bridge needs the same group as pdev then
+		 * there is no way for it's pci_device_group() to discover it.
+		 */
+		dev_err(&pdev->dev,
+			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
+			pci_name(bus->self));
+		return ERR_PTR(-EPROBE_DEFER);
+	}
+	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
+		return group;
+	iommu_group_put(group);
+	return NULL;
+}
+
+/*
+ * For legacy PCI we have two main considerations when forming groups:
+ *
+ *  1) In PCI we can loose the RID inside the fabric, or some devices will use
+ *     the wrong RID. The PCI core calls this aliasing, but from an IOMMU
+ *     perspective it means that a PCI device may have multiple RIDs and a
+ *     single RID may represent many PCI devices. This effectively means all the
+ *     aliases must share a translation, thus group, because the IOMMU cannot
+ *     tell devices apart.
+ *
+ *  2) PCI permits a bus segment to claim an address even if the transaction
+ *     originates from an end point not the CPU. When it happens it is called
+ *     peer to peer. Claiming a transaction in the middle of the bus hierarchy
+ *     bypasses the IOMMU translation. The IOMMU subsystem rules require these
+ *     devices to be placed in the same group because they lack isolation from
+ *     each other. In PCI Express the ACS system can be used to inhibit this and
+ *     force transactions to go to the IOMMU.
+ *
+ *     From a PCI perspective any given PCI bus is either isolating or
+ *     non-isolating. Isolating means downstream originated transactions always
+ *     progress toward the CPU and do not go to other devices on the bus
+ *     segment, while non-isolating means downstream originated transactions can
+ *     progress back downstream through another device on the bus segment.
+ *
+ *     Beyond buses a multi-function device or bridge can also allow
+ *     transactions to loop back internally from one function to another.
+ *
+ *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
+ *     that bus becomes a single group.
+ */
+struct iommu_group *pci_device_group(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct iommu_group *group;
+	struct pci_dev *real_pdev;
+
+	if (WARN_ON(!dev_is_pci(dev)))
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * Arches can supply a completely different PCI device that actually
+	 * does DMA.
+	 */
+	real_pdev = pci_real_dma_dev(pdev);
+	if (real_pdev != pdev) {
+		group = iommu_group_get(&real_pdev->dev);
+		if (!group) {
+			/*
+			 * The real_pdev has not had an iommu probed to it. We
+			 * can't create a new group here because there is no way
+			 * for pci_device_group(real_pdev) to pick it up.
+			 */
+			dev_err(dev,
+				"PCI device is probing out of order, real device of %s is not probed yet\n",
+				pci_name(real_pdev));
+			return ERR_PTR(-EPROBE_DEFER);
+		}
+		return group;
+	}
+
+	if (pdev->dev_flags & PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT)
+		return iommu_group_alloc();
+
+	/* Anything upstream of this enforcing non-isolated? */
+	group = pci_hierarchy_group(pdev);
+	if (group)
+		return group;
+
+	switch (pci_bus_isolated(pci_physfn(pdev)->bus)) {
+	case PCIE_ISOLATED:
+		/* Check multi-function groups and same-bus devfn aliases */
+		group = pci_get_alias_group(pdev);
+		if (group)
+			return group;
+
+		/* No shared group found, allocate new */
+		return iommu_group_alloc();
+
+	/*
+	 * On legacy PCI there is no RID at an electrical level. On PCI-X the
+	 * RID of the bridge may be used in some cases instead of the
+	 * downstream's RID. This creates aliasing problems. PCI/PCI-X doesn't
+	 * provide isolation either. The end result is that as soon as we hit a
+	 * PCI/PCI-X bus we switch to non-isolated for the whole downstream for
+	 * both aliasing and isolation reasons. The bridge has to be included in
+	 * the group because of the aliasing.
+	 */
+	case PCI_BRIDGE_NON_ISOLATED:
+	/* A PCIe switch where the USP has MMIO and is not isolated. */
+	case PCIE_NON_ISOLATED:
+		group = iommu_group_get(&pdev->bus->self->dev);
+		if (WARN_ON(!group))
+			return ERR_PTR(-EINVAL);
+		/*
+		 * No need to be concerned with aliases here since we are going
+		 * to put the entire downstream tree in the bridge/USP's group.
+		 */
+		group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+		return group;
+
+	/*
+	 * It is a PCI bus and the upstream bridge/port does not alias or allow
+	 * P2P.
+	 */
+	case PCI_BUS_NON_ISOLATED:
+	/*
+	 * It is a PCIe switch and the DSP cannot reach the USP. The DSP's
+	 * are not isolated from each other and share a group.
+	 */
+	case PCIE_SWITCH_DSP_NON_ISOLATED: {
+		struct pci_dev *piter = NULL;
+
+		/*
+		 * All the downstream devices on the bus share a group. If this
+		 * is a PCIe switch then they will all be DSPs
+		 */
+		for_each_pci_dev(piter) {
+			if (piter->bus != pdev->bus)
+				continue;
+			group = iommu_group_get(&piter->dev);
+			if (group) {
+				pci_dev_put(piter);
+				if (WARN_ON(!(group->bus_data &
+					      BUS_DATA_PCI_NON_ISOLATED)))
+					group->bus_data |=
+						BUS_DATA_PCI_NON_ISOLATED;
+				return group;
+			}
+		}
+		return pci_group_alloc_non_isolated();
+	}
+	default:
+		break;
+	}
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
 }
 EXPORT_SYMBOL_GPL(pci_device_group);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0b1e28dcf9187d..517800206208b5 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2072,6 +2072,9 @@ static inline int pci_dev_present(const struct pci_device_id *ids)
 #define no_pci_devices()	(1)
 #define pci_dev_put(dev)	do { } while (0)
 
+static inline struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
+{ return dev; }
+
 static inline void pci_set_master(struct pci_dev *dev) { }
 static inline void pci_clear_master(struct pci_dev *dev) { }
 static inline int pci_enable_device(struct pci_dev *dev) { return -EIO; }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 04/16] iommu: Organize iommu_group by member size
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 05/16] PCI: Add pci_reachable_set() Jason Gunthorpe
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

To avoid some internal padding.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8b152266f20104..7b407065488296 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -58,13 +58,13 @@ struct iommu_group {
 	void *iommu_data;
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
-	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
-	unsigned int owner_cnt;
 	void *owner;
+	unsigned int owner_cnt;
+	int id;
 
 	/* Used by the device_group() callbacks */
 	u32 bus_data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 05/16] PCI: Add pci_reachable_set()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 04/16] iommu: Organize iommu_group by member size Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-17 22:04   ` Donald Dutile
  2025-07-09 14:52 ` [PATCH v2 06/16] PCI: Remove duplication in calling pci_acs_ctrl_enabled() Jason Gunthorpe
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Implement pci_reachable_set() to efficiently compute a set of devices on
the same bus that are "reachable" from a starting device. The meaning of
reachability is defined by the caller through a callback function.

This is a faster implementation of the same logic in
pci_device_group(). Being inside the PCI core allows use of pci_bus_sem so
it can use list_for_each_entry() on a small list of devices instead of the
expensive for_each_pci_dev(). Server systems can now have hundreds of PCI
devices, but typically only a very small number of devices per bus.

An example of a reachability function would be pci_devs_are_dma_aliases()
which would compute a set of devices on the same bus that are
aliases. This would also be useful in future support for the ACS P2P
Egress Vector which has a similar reachability problem.

This is effectively a graph algorithm where the set of devices on the bus
are vertexes and the reachable() function defines the edges. It returns a
set of vertexes that form a connected graph.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  | 12 ++++++
 2 files changed, 102 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index a13fad53e44df9..dc816dc4505c6d 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -585,3 +585,93 @@ int pci_dev_present(const struct pci_device_id *ids)
 	return 0;
 }
 EXPORT_SYMBOL(pci_dev_present);
+
+/**
+ * pci_reachable_set - Generate a bitmap of devices within a reachability set
+ * @start: First device in the set
+ * @devfns: The set of devices on the bus
+ * @reachable: Callback to tell if two devices can reach each other
+ *
+ * Compute a bitmap where every set bit is a device on the bus that is reachable
+ * from the start device, including the start device. Reachability between two
+ * devices is determined by a callback function.
+ *
+ * This is a non-recursive implementation that invokes the callback once per
+ * pair. The callback must be commutative:
+ *    reachable(a, b) == reachable(b, a)
+ * reachable() can form a cyclic graph:
+ *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
+ *
+ * Since this function is limited to a single bus the largest set can be 256
+ * devices large.
+ */
+void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb))
+{
+	struct pci_reachable_set todo_devfns = {};
+	struct pci_reachable_set next_devfns = {};
+	struct pci_bus *bus = start->bus;
+	bool again;
+
+	/* Assume devfn of all PCI devices is bounded by MAX_NR_DEVFNS */
+	static_assert(sizeof(next_devfns.devfns) * BITS_PER_BYTE >=
+		      MAX_NR_DEVFNS);
+
+	memset(devfns, 0, sizeof(devfns->devfns));
+	__set_bit(start->devfn, devfns->devfns);
+	__set_bit(start->devfn, next_devfns.devfns);
+
+	down_read(&pci_bus_sem);
+	while (true) {
+		unsigned int devfna;
+		unsigned int i;
+
+		/*
+		 * For each device that hasn't been checked compare every
+		 * device on the bus against it.
+		 */
+		again = false;
+		for_each_set_bit(devfna, next_devfns.devfns, MAX_NR_DEVFNS) {
+			struct pci_dev *deva = NULL;
+			struct pci_dev *devb;
+
+			list_for_each_entry(devb, &bus->devices, bus_list) {
+				if (devb->devfn == devfna)
+					deva = devb;
+
+				if (test_bit(devb->devfn, devfns->devfns))
+					continue;
+
+				if (!deva) {
+					deva = devb;
+					list_for_each_entry_continue(
+						deva, &bus->devices, bus_list)
+						if (deva->devfn == devfna)
+							break;
+				}
+
+				if (!reachable(deva, devb))
+					continue;
+
+				__set_bit(devb->devfn, todo_devfns.devfns);
+				again = true;
+			}
+		}
+
+		if (!again)
+			break;
+
+		/*
+		 * Every new bit adds a new deva to check, reloop the whole
+		 * thing. Expect this to be rare.
+		 */
+		for (i = 0; i != ARRAY_SIZE(devfns->devfns); i++) {
+			devfns->devfns[i] |= todo_devfns.devfns[i];
+			next_devfns.devfns[i] = todo_devfns.devfns[i];
+			todo_devfns.devfns[i] = 0;
+		}
+	}
+	up_read(&pci_bus_sem);
+}
+EXPORT_SYMBOL_GPL(pci_reachable_set);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 517800206208b5..2e629087539101 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -834,6 +834,10 @@ struct pci_dynids {
 	struct list_head	list;	/* For IDs added at runtime */
 };
 
+struct pci_reachable_set {
+	DECLARE_BITMAP(devfns, 256);
+};
+
 enum pci_bus_isolation {
 	/*
 	 * The bus is off a root port and the root port has isolated ACS flags
@@ -1248,6 +1252,9 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
 
+void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb));
 enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
 
 int pci_dev_present(const struct pci_device_id *ids);
@@ -2063,6 +2070,11 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
 						 struct pci_dev *from)
 { return NULL; }
 
+static inline void
+pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
+		  bool (*reachable)(struct pci_dev *deva, struct pci_dev *devb))
+{ }
+
 static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 { return PCIE_NON_ISOLATED; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 06/16] PCI: Remove duplication in calling pci_acs_ctrl_enabled()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 05/16] PCI: Add pci_reachable_set() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 07/16] PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs() Jason Gunthorpe
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Many places use PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF
as the flags, consolidate this into a little helper.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/quirks.c | 36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index d7f4ee634263c2..8a9ab76dd81494 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4634,6 +4634,12 @@ static int pci_acs_ctrl_enabled(u16 acs_ctrl_req, u16 acs_ctrl_ena)
 	return 0;
 }
 
+static int pci_acs_ctrl_isolated(u16 acs_flags)
+{
+	return pci_acs_ctrl_enabled(acs_flags,
+		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+}
+
 /*
  * AMD has indicated that the devices below do not support peer-to-peer
  * in any system where they are found in the southbridge with an AMD
@@ -4717,8 +4723,7 @@ static int pci_quirk_cavium_acs(struct pci_dev *dev, u16 acs_flags)
 	 * hardware implements and enables equivalent ACS functionality for
 	 * these flags.
 	 */
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 static int pci_quirk_xgene_acs(struct pci_dev *dev, u16 acs_flags)
@@ -4728,8 +4733,7 @@ static int pci_quirk_xgene_acs(struct pci_dev *dev, u16 acs_flags)
 	 * transactions with others, allowing masking out these bits as if they
 	 * were unimplemented in the ACS capability.
 	 */
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 /*
@@ -4752,8 +4756,7 @@ static int pci_quirk_zhaoxin_pcie_ports_acs(struct pci_dev *dev, u16 acs_flags)
 	case 0x0710 ... 0x071e:
 	case 0x0721:
 	case 0x0723 ... 0x0752:
-		return pci_acs_ctrl_enabled(acs_flags,
-			PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+		return pci_acs_ctrl_isolated(acs_flags);
 	}
 
 	return false;
@@ -4814,8 +4817,7 @@ static int pci_quirk_intel_pch_acs(struct pci_dev *dev, u16 acs_flags)
 		return -ENOTTY;
 
 	if (dev->dev_flags & PCI_DEV_FLAGS_ACS_ENABLED_QUIRK)
-		return pci_acs_ctrl_enabled(acs_flags,
-			PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+		return pci_acs_ctrl_isolated(acs_flags);
 
 	return pci_acs_ctrl_enabled(acs_flags, 0);
 }
@@ -4832,8 +4834,7 @@ static int pci_quirk_intel_pch_acs(struct pci_dev *dev, u16 acs_flags)
  */
 static int pci_quirk_qcom_rp_acs(struct pci_dev *dev, u16 acs_flags)
 {
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 /*
@@ -4844,8 +4845,7 @@ static int pci_quirk_qcom_rp_acs(struct pci_dev *dev, u16 acs_flags)
  */
 static int pci_quirk_nxp_rp_acs(struct pci_dev *dev, u16 acs_flags)
 {
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 static int pci_quirk_al_acs(struct pci_dev *dev, u16 acs_flags)
@@ -4975,8 +4975,7 @@ static int pci_quirk_rciep_acs(struct pci_dev *dev, u16 acs_flags)
 	if (pci_pcie_type(dev) != PCI_EXP_TYPE_RC_END)
 		return -ENOTTY;
 
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
@@ -4987,8 +4986,7 @@ static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
 	 * Allow each Root Port to be in a separate IOMMU group by masking
 	 * SV/RR/CR/UF bits.
 	 */
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 static int pci_quirk_loongson_acs(struct pci_dev *dev, u16 acs_flags)
@@ -4999,8 +4997,7 @@ static int pci_quirk_loongson_acs(struct pci_dev *dev, u16 acs_flags)
 	 * Allow each Root Port to be in a separate IOMMU group by masking
 	 * SV/RR/CR/UF bits.
 	 */
-	return pci_acs_ctrl_enabled(acs_flags,
-		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 /*
@@ -5019,8 +5016,7 @@ static int  pci_quirk_wangxun_nic_acs(struct pci_dev *dev, u16 acs_flags)
 	case 0x1001: case 0x2001: /* SP */
 	case 0x5010: case 0x5025: case 0x5040: /* AML */
 	case 0x5110: case 0x5125: case 0x5140: /* AML */
-		return pci_acs_ctrl_enabled(acs_flags,
-			PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
+		return pci_acs_ctrl_isolated(acs_flags);
 	}
 
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 07/16] PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 06/16] PCI: Remove duplication in calling pci_acs_ctrl_enabled() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 08/16] PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs() Jason Gunthorpe
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

AMD is simply trying to say that it's MFD doesn't do P2P, use the
existing common helper for this.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/quirks.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 8a9ab76dd81494..963074286cc2a9 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4640,6 +4640,8 @@ static int pci_acs_ctrl_isolated(u16 acs_flags)
 		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
 }
 
+static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u16 acs_flags);
+
 /*
  * AMD has indicated that the devices below do not support peer-to-peer
  * in any system where they are found in the southbridge with an AMD
@@ -4682,10 +4684,7 @@ static int pci_quirk_amd_sb_acs(struct pci_dev *dev, u16 acs_flags)
 
 	acpi_put_table(header);
 
-	/* Filter out flags not applicable to multifunction */
-	acs_flags &= (PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC | PCI_ACS_DT);
-
-	return pci_acs_ctrl_enabled(acs_flags, PCI_ACS_RR | PCI_ACS_CR);
+	return pci_quirk_mf_endpoint_acs(dev, acs_flags);
 #else
 	return -ENODEV;
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 08/16] PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 07/16] PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 09/16] PCI: Widen the acs_flags to u32 within the quirk callback Jason Gunthorpe
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

This is the same logic.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/quirks.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 963074286cc2a9..d78876b4a2b106 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4860,9 +4860,7 @@ static int pci_quirk_al_acs(struct pci_dev *dev, u16 acs_flags)
 	 *
 	 * Additionally, the root ports cannot send traffic to each other.
 	 */
-	acs_flags &= ~(PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
-
-	return acs_flags ? 0 : 1;
+	return pci_acs_ctrl_isolated(acs_flags);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 09/16] PCI: Widen the acs_flags to u32 within the quirk callback
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 08/16] PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 10/16] PCI: Add pci_mfd_isolation() Jason Gunthorpe
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

This will allow passing some software flags in the upper 16 bits in the
next patch.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/quirks.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index d78876b4a2b106..71b9550e46eb06 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4627,20 +4627,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
  * in @acs_ctrl_ena, i.e., the device provides all the access controls the
  * caller desires.  Return 0 otherwise.
  */
-static int pci_acs_ctrl_enabled(u16 acs_ctrl_req, u16 acs_ctrl_ena)
+static int pci_acs_ctrl_enabled(u32 acs_ctrl_req, u32 acs_ctrl_ena)
 {
 	if ((acs_ctrl_req & acs_ctrl_ena) == acs_ctrl_req)
 		return 1;
 	return 0;
 }
 
-static int pci_acs_ctrl_isolated(u16 acs_flags)
+static int pci_acs_ctrl_isolated(u32 acs_flags)
 {
 	return pci_acs_ctrl_enabled(acs_flags,
 		PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
 }
 
-static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u16 acs_flags);
+static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u32 acs_flags);
 
 /*
  * AMD has indicated that the devices below do not support peer-to-peer
@@ -4667,7 +4667,7 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u16 acs_flags);
  * 1022:780f [AMD] FCH PCI Bridge
  * 1022:7809 [AMD] FCH USB OHCI Controller
  */
-static int pci_quirk_amd_sb_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_amd_sb_acs(struct pci_dev *dev, u32 acs_flags)
 {
 #ifdef CONFIG_ACPI
 	struct acpi_table_header *header = NULL;
@@ -4709,7 +4709,7 @@ static bool pci_quirk_cavium_acs_match(struct pci_dev *dev)
 	}
 }
 
-static int pci_quirk_cavium_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_cavium_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	if (!pci_quirk_cavium_acs_match(dev))
 		return -ENOTTY;
@@ -4725,7 +4725,7 @@ static int pci_quirk_cavium_acs(struct pci_dev *dev, u16 acs_flags)
 	return pci_acs_ctrl_isolated(acs_flags);
 }
 
-static int pci_quirk_xgene_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_xgene_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	/*
 	 * X-Gene Root Ports matching this quirk do not allow peer-to-peer
@@ -4740,7 +4740,7 @@ static int pci_quirk_xgene_acs(struct pci_dev *dev, u16 acs_flags)
  * But the implementation could block peer-to-peer transactions between them
  * and provide ACS-like functionality.
  */
-static int pci_quirk_zhaoxin_pcie_ports_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_zhaoxin_pcie_ports_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	if (!pci_is_pcie(dev) ||
 	    ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
@@ -4810,7 +4810,7 @@ static bool pci_quirk_intel_pch_acs_match(struct pci_dev *dev)
 	return false;
 }
 
-static int pci_quirk_intel_pch_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_intel_pch_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	if (!pci_quirk_intel_pch_acs_match(dev))
 		return -ENOTTY;
@@ -4831,7 +4831,7 @@ static int pci_quirk_intel_pch_acs(struct pci_dev *dev, u16 acs_flags)
  * Port to pass traffic to another Root Port.  All PCIe transactions are
  * terminated inside the Root Port.
  */
-static int pci_quirk_qcom_rp_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_qcom_rp_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	return pci_acs_ctrl_isolated(acs_flags);
 }
@@ -4842,12 +4842,12 @@ static int pci_quirk_qcom_rp_acs(struct pci_dev *dev, u16 acs_flags)
  * and validate bus numbers in requests, but does not provide an ACS
  * capability.
  */
-static int pci_quirk_nxp_rp_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_nxp_rp_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	return pci_acs_ctrl_isolated(acs_flags);
 }
 
-static int pci_quirk_al_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_al_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	if (pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
 		return -ENOTTY;
@@ -4925,7 +4925,7 @@ static bool pci_quirk_intel_spt_pch_acs_match(struct pci_dev *dev)
 
 #define INTEL_SPT_ACS_CTRL (PCI_ACS_CAP + 4)
 
-static int pci_quirk_intel_spt_pch_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_intel_spt_pch_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	int pos;
 	u32 cap, ctrl;
@@ -4946,7 +4946,7 @@ static int pci_quirk_intel_spt_pch_acs(struct pci_dev *dev, u16 acs_flags)
 	return pci_acs_ctrl_enabled(acs_flags, ctrl);
 }
 
-static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	/*
 	 * SV, TB, and UF are not relevant to multifunction endpoints.
@@ -4962,7 +4962,7 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u16 acs_flags)
 		PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
 }
 
-static int pci_quirk_rciep_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_rciep_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	/*
 	 * Intel RCiEP's are required to allow p2p only on translated
@@ -4975,7 +4975,7 @@ static int pci_quirk_rciep_acs(struct pci_dev *dev, u16 acs_flags)
 	return pci_acs_ctrl_isolated(acs_flags);
 }
 
-static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_brcm_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	/*
 	 * iProc PAXB Root Ports don't advertise an ACS capability, but
@@ -4986,7 +4986,7 @@ static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
 	return pci_acs_ctrl_isolated(acs_flags);
 }
 
-static int pci_quirk_loongson_acs(struct pci_dev *dev, u16 acs_flags)
+static int pci_quirk_loongson_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	/*
 	 * Loongson PCIe Root Ports don't advertise an ACS capability, but
@@ -5006,7 +5006,7 @@ static int pci_quirk_loongson_acs(struct pci_dev *dev, u16 acs_flags)
  * RP1000/RP2000 10G NICs(sp).
  * FF5xxx 40G/25G/10G NICs(aml).
  */
-static int  pci_quirk_wangxun_nic_acs(struct pci_dev *dev, u16 acs_flags)
+static int  pci_quirk_wangxun_nic_acs(struct pci_dev *dev, u32 acs_flags)
 {
 	switch (dev->device) {
 	case 0x0100 ... 0x010F: /* EM */
@@ -5022,7 +5022,7 @@ static int  pci_quirk_wangxun_nic_acs(struct pci_dev *dev, u16 acs_flags)
 static const struct pci_dev_acs_enabled {
 	u16 vendor;
 	u16 device;
-	int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
+	int (*acs_enabled)(struct pci_dev *dev, u32 acs_flags);
 } pci_dev_acs_enabled[] = {
 	{ PCI_VENDOR_ID_ATI, 0x4385, pci_quirk_amd_sb_acs },
 	{ PCI_VENDOR_ID_ATI, 0x439c, pci_quirk_amd_sb_acs },
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 10/16] PCI: Add pci_mfd_isolation()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 09/16] PCI: Widen the acs_flags to u32 within the quirk callback Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-08-20 17:21   ` Keith Busch
  2025-07-09 14:52 ` [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Introduce a new ACS related quirk that indicates a single function in a
MFD is isolated from other functions in the MFD. In otherwords the
function has no internal loopback to other functions.

This is different from the existing ACS quirk since ACS only indicates if
the function's egress can reach other functions in the MFD for P2P, while
leaving the ingress question up to the ACS of the other
functions. pci_mfd_isolation() reports both the ingress and egress
behavior together, if both are blocked then it can return true.

This quirk matches the historical behavior of the
pci_dev_specific_acs_enabled() quirk that would allow 'adding' ACS to a
single function to force that function into its own iommu_group. Many
quirks were added on this assumption to give specific functions their own
IOMMU groups. For example some Intel systems have a NIC in a MFD with
other functions and no ACS capabilities. The NIC has been quirked, while
the rest of the MFD functions are unquirked.

Add an internal flag, PCI_ACS_QUIRK_ACS_ISOLATED, in the upper 16 bits of
the acs_flags. If the quirk implementation sees the flag and decides the
function should be isolated, then it can return success. The additional
flag automatically makes the existing quirks all return failure as it is
never masked off.

Have pci_quirk_mf_endpoint_acs() support the flag with some expectation
this will grown down the road.

Apply the MFD isolation override to a general case of host bridges inside
MFDs that match some AMD systems:

00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        NUMA node: 3
        IOMMU group: 52

00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] (prog-if 00 [Normal decode])
	[..]
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	[..]
        Capabilities: [2a0 v1] Access Control Services

The logic being that a host bridge with no mmio has no downstream devices
and cannot source or sink any P2P MMIO operations, thus it does not
contribute to the ACS isolation calculation for the MFD.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/pci.h    |  5 ++++
 drivers/pci/quirks.c | 58 +++++++++++++++++++++++++++++++++++---------
 drivers/pci/search.c | 30 +++++++++++++++++++++++
 include/linux/pci.h  |  4 +++
 4 files changed, 85 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12215ee72afb68..78651025096bcd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -786,6 +786,7 @@ int pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags);
 int pci_dev_specific_enable_acs(struct pci_dev *dev);
 int pci_dev_specific_disable_acs_redir(struct pci_dev *dev);
 int pcie_failed_link_retrain(struct pci_dev *dev);
+bool pci_dev_specific_mfd_isolated(struct pci_dev *dev);
 #else
 static inline int pci_dev_specific_acs_enabled(struct pci_dev *dev,
 					       u16 acs_flags)
@@ -804,6 +805,10 @@ static inline int pcie_failed_link_retrain(struct pci_dev *dev)
 {
 	return -ENOTTY;
 }
+static inline bool pci_dev_specific_mfd_isolated(struct pci_dev *dev)
+{
+	return false;
+}
 #endif
 
 /* PCI error reporting and recovery */
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 71b9550e46eb06..85c786d66646a8 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4616,6 +4616,8 @@ static void quirk_chelsio_T5_disable_root_port_attributes(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
 			 quirk_chelsio_T5_disable_root_port_attributes);
 
+#define PCI_ACS_QUIRK_ACS_ISOLATED BIT(16)
+
 /*
  * pci_acs_ctrl_enabled - compare desired ACS controls with those provided
  *			  by a device
@@ -4948,6 +4950,13 @@ static int pci_quirk_intel_spt_pch_acs(struct pci_dev *dev, u32 acs_flags)
 
 static int pci_quirk_mf_endpoint_acs(struct pci_dev *dev, u32 acs_flags)
 {
+	/*
+	 * The function cannot get P2P MMIO from the other functions in the MFD
+	 * either even if the other functions do not have ACS or ACS quirks.
+	 */
+	if (acs_flags & PCI_ACS_QUIRK_ACS_ISOLATED)
+		return 1;
+
 	/*
 	 * SV, TB, and UF are not relevant to multifunction endpoints.
 	 *
@@ -5186,18 +5195,7 @@ static const struct pci_dev_acs_enabled {
 	{ 0 }
 };
 
-/*
- * pci_dev_specific_acs_enabled - check whether device provides ACS controls
- * @dev:	PCI device
- * @acs_flags:	Bitmask of desired ACS controls
- *
- * Returns:
- *   -ENOTTY:	No quirk applies to this device; we can't tell whether the
- *		device provides the desired controls
- *   0:		Device does not provide all the desired controls
- *   >0:	Device provides all the controls in @acs_flags
- */
-int pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
+static int pci_dev_call_acs_enabled(struct pci_dev *dev, u32 acs_flags)
 {
 	const struct pci_dev_acs_enabled *i;
 	int ret;
@@ -5222,6 +5220,42 @@ int pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
 	return -ENOTTY;
 }
 
+/*
+ * pci_dev_specific_acs_enabled - check whether device provides ACS controls
+ * @dev:	PCI device
+ * @acs_flags:	Bitmask of desired ACS controls
+ *
+ * Returns:
+ *   -ENOTTY:	No quirk applies to this device; we can't tell whether the
+ *		device provides the desired controls
+ *   0:		Device does not provide all the desired controls
+ *   >0:	Device provides all the controls in @acs_flags
+ */
+int pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
+{
+	return pci_dev_call_acs_enabled(dev, acs_flags);
+}
+
+/*
+ * pci_dev_specific_mfd_isolated- check whether a MFD function is isolated
+ * @dev:	PCI device
+ *
+ * pci_dev_specific_acs_enabled() emulates the ACS flags using a quirk however
+ * historically Linux has not quirked every function in a MFD.
+ * pci_dev_specific_mfd_isolated() overrides the other function MFD checks and
+ * can consider a single function fully isolated from all other functions both
+ * for egress and ingress directions.
+ *
+ * Returns:
+ *   false:	No override, use normal PCI defined mechanisms
+ *   true:	Function is isolated from P2P to other functions in the device
+ */
+bool pci_dev_specific_mfd_isolated(struct pci_dev *dev)
+{
+	return pci_dev_call_acs_enabled(dev, PCI_ACS_QUIRK_ACS_ISOLATED |
+						     PCI_ACS_ISOLATED) > 0;
+}
+
 /* Config space offset of Root Complex Base Address register */
 #define INTEL_LPC_RCBA_REG 0xf0
 /* 31:14 RCBA address */
diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index dc816dc4505c6d..2be6881087b335 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -277,6 +277,36 @@ enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 	}
 }
 
+/*
+ * pci_mfd_isolated- check whether a MFD function is isolated
+ * @dev:	PCI device
+ *
+ * True if the dev function on a MFD should be considered isolated from all
+ * other functions in the MFD. This is used to override ACS checks that might
+ * otherwise indicate the MFD function participates in an internal loopback.
+ *
+ * Returns:
+ *   false:	No override, use normal PCI defined mechanisms
+ *   true:	Function is isolated from P2P to other functions in the device
+ */
+bool pci_mfd_isolated(struct pci_dev *dev)
+{
+	/*
+	 * For some reason AMD likes to put "dummy functions" in their PCI
+	 * hierarchy as part of a multi function device. These are notable
+	 * because they can't do anything. No BARs and no downstream bus. Since
+	 * they cannot accept P2P or initiate any MMIO we consider them to be
+	 * isolated from the rest of MFD. Since they often accompany a real PCI
+	 * bridge with downstream devices it is important that the MFD be
+	 * considered isolated. Annoyingly there is no ACS capability reported
+	 * so we assume that a host bridge in a MFD with no MMIO has the special
+	 * property of never accepting or initiating P2P operations.
+	 */
+	if (dev->class >> 8 == PCI_CLASS_BRIDGE_HOST && !pci_has_mmio(dev))
+		return true;
+	return pci_dev_specific_mfd_isolated(dev);
+}
+
 static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
 {
 	struct pci_bus *child;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2e629087539101..d95a983c835666 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1256,6 +1256,7 @@ void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
 		       bool (*reachable)(struct pci_dev *deva,
 					 struct pci_dev *devb));
 enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
+bool pci_mfd_isolated(struct pci_dev *dev);
 
 int pci_dev_present(const struct pci_device_id *ids);
 
@@ -2078,6 +2079,9 @@ pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
 static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 { return PCIE_NON_ISOLATED; }
 
+bool pci_mfd_isolated(struct pci_dev *dev)
+{ return false; }
+
 static inline int pci_dev_present(const struct pci_device_id *ids)
 { return 0; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 10/16] PCI: Add pci_mfd_isolation() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-28  9:47   ` Cédric Le Goater
  2025-07-09 14:52 ` [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The current algorithm does not work if the ACS is not uniform across the
entire MFD. This seems to mostly happen in the real world because of
Linux's quirk systems that sometimes quirks only one function in a MFD,
creating an asymmetric situation.

For discussion let's consider a simple MFD topology like the below:

                      -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
      Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
                      |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS

This asymmetric ACS could be created using the config_acs kernel command
line parameter, from quirks, or from a poorly thought out device that has
ACS flags only on some functions.

Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
function. Thus we expect an iommu_group to contain all three
devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
single device group of 1f.6.

The current algorithm sees the good ACS flags on 00:1f.6 and does not
consider ACS on any other MFD functions.

For path properties the ACS flags say that 00:1f.6 is safe to use with
PASID and supports SVA as it will not have any portions of its address
space routed away from the IOMMU, this part of the ACS system is working
correctly.

This is a problematic fix because this historical mistake has created an
ecosystem around it. We now have quirks that assume single function is
enough to quirk and it seems there are PCI root complexes that make the
same non-compliant assumption.

The new helper pci_mfd_isolation() retains the existing quirks and we
will probably need to add additional HW quirks for PCI root complexes that
have not followed the spec but have been silently working today.

Use pci_reachable_set() in pci_device_group() to make the resulting
algorithm faster and easier to understand.

Add pci_mfds_are_same_group() which specifically looks pair-wise at all
functions in the MFDs. Any function without ACS isolation will become
reachable to all other functions.

pci_reachable_set() does the calculations for figuring out the set of
devices under the pci_bus_sem, which is better than repeatedly searching
across all PCI devices.

Once the set of devices is determined and the set has more than one device
use pci_get_slot() to search for any existing groups in the reachable set.

Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 173 +++++++++++++++++-------------------------
 1 file changed, 71 insertions(+), 102 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7b407065488296..cd26b43916e8be 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_group_id);
 
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns);
-
-/*
- * For multifunction devices which are not isolated from each other, find
- * all the other non-isolated functions and look for existing groups.  For
- * each function, we also need to look for aliases to or from other devices
- * that may already have a group.
- */
-static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
-							unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
-		return NULL;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus ||
-		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
-			continue;
-
-		group = get_pci_alias_group(tmp, devfns);
-		if (group) {
-			pci_dev_put(tmp);
-			return group;
-		}
-	}
-
-	return NULL;
-}
-
-/*
- * Look for aliases to or from the given device for existing groups. DMA
- * aliases are only supported on the same bus, therefore the search
- * space is quite small (especially since we're really only looking at pcie
- * device, and therefore only expect multiple slots on the root complex or
- * downstream switch ports).  It's conceivable though that a pair of
- * multifunction devices could have aliases between them that would cause a
- * loop.  To prevent this, we use a bitmap to track where we've been.
- */
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
-		return NULL;
-
-	group = iommu_group_get(&pdev->dev);
-	if (group)
-		return group;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus)
-			continue;
-
-		/* We alias them or they alias us */
-		if (pci_devs_are_dma_aliases(pdev, tmp)) {
-			group = get_pci_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-
-			group = get_pci_function_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-		}
-	}
-
-	return NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1534,40 +1455,88 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
 	return group;
 }
 
+/*
+ * Ignoring quirks, all functions in the MFD need to be isolated from each other
+ * and get their own groups, otherwise the whole MFD will share a group. Any
+ * function that lacks explicit ACS isolation is assumed to be able to P2P
+ * access any other function in the MFD.
+ */
+static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/* Are deva/devb functions in the same MFD? */
+	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
+		return false;
+	/* Don't understand what is happening, be conservative */
+	if (deva->multifunction != devb->multifunction)
+		return true;
+	if (!deva->multifunction)
+		return false;
+
+	/* Quirks can inhibit single MFD functions from combining into groups */
+	if (pci_mfd_isolated(deva) || pci_mfd_isolated(devb))
+		return false;
+
+	/* Can they reach each other's MMIO through P2P? */
+	return !pci_acs_enabled(deva, PCI_ACS_ISOLATED) ||
+	       !pci_acs_enabled(devb, PCI_ACS_ISOLATED);
+}
+
+static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/*
+	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
+	 */
+	if (pci_devs_are_dma_aliases(deva, devb))
+		return true;
+
+	return pci_mfds_are_same_group(deva, devb);
+}
+
 static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
 {
-	struct iommu_group *group;
-	DECLARE_BITMAP(devfns, 256) = {};
+	struct pci_reachable_set devfns;
+	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
+	unsigned int devfn;
 
 	/*
-	 * Look for existing groups on device aliases.  If we alias another
-	 * device or another device aliases us, use the same group.
+	 * Look for existing groups on device aliases and multi-function ACS. If
+	 * we alias another device or another device aliases us, use the same
+	 * group.
+	 *
+	 * pci_reachable_set() should return the same bitmap if called for any
+	 * device in the set and we want all devices in the set to have the same
+	 * group.
 	 */
-	group = get_pci_alias_group(pdev, devfns);
-	if (group)
-		return group;
+	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
+	/* start is known to have iommu_group_get() == NULL */
+	__clear_bit(pdev->devfn, devfns.devfns);
 
 	/*
-	 * Look for existing groups on non-isolated functions on the same
-	 * slot and aliases of those funcions, if any.  No need to clear
-	 * the search bitmap, the tested devfns are still valid.
-	 */
-	group = get_pci_function_alias_group(pdev, devfns);
-	if (group)
-		return group;
-
-	/*
-	 * When MFD's are included in the set due to ACS we assume that if ACS
-	 * permits an internal loopback between functions it also permits the
-	 * loopback to go downstream if a function is a bridge.
+	 * When MFD functions are included in the set due to ACS we assume that
+	 * if ACS permits an internal loopback between functions it also permits
+	 * the loopback to go downstream if any function is a bridge.
 	 *
 	 * It is less clear what aliases mean when applied to a bridge. For now
 	 * be conservative and also propagate the group downstream.
 	 */
-	__clear_bit(pdev->devfn & 0xFF, devfns);
-	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
-		return pci_group_alloc_non_isolated();
-	return NULL;
+	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
+		return NULL;
+
+	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
+		struct iommu_group *group;
+		struct pci_dev *pdev_slot;
+
+		pdev_slot = pci_get_slot(pdev->bus, devfn);
+		group = iommu_group_get(&pdev_slot->dev);
+		pci_dev_put(pdev_slot);
+		if (group) {
+			if (WARN_ON(!(group->bus_data &
+				      BUS_DATA_PCI_NON_ISOLATED)))
+				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+			return group;
+		}
+	}
+	return pci_group_alloc_non_isolated();
 }
 
 static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-17 22:07   ` Donald Dutile
  2025-07-09 14:52 ` [PATCH v2 13/16] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Directly check that the devices touched by pci_for_each_dma_alias() match
the groups that were built by pci_device_group(). This helps validate that
pci_for_each_dma_alias() and pci_bus_isolated() are consistent.

This should eventually be hidden behind a debug kconfig, but for now it is
good to get feedback from more diverse systems if there are any problems.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 73 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index cd26b43916e8be..e4ae1d064554e4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1606,7 +1606,7 @@ static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
  *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
  *     that bus becomes a single group.
  */
-struct iommu_group *pci_device_group(struct device *dev)
+static struct iommu_group *__pci_device_group(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct iommu_group *group;
@@ -1713,6 +1713,77 @@ struct iommu_group *pci_device_group(struct device *dev)
 	WARN_ON(true);
 	return ERR_PTR(-EINVAL);
 }
+
+struct check_group_aliases_data {
+	struct pci_dev *pdev;
+	struct iommu_group *group;
+};
+
+static void pci_check_group(const struct check_group_aliases_data *data,
+			    u16 alias, struct pci_dev *pdev)
+{
+	struct iommu_group *group;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return;
+
+	if (group != data->group)
+		dev_err(&data->pdev->dev,
+			"During group construction alias processing needed dev %s alias %x to have the same group but %u != %u\n",
+			pci_name(pdev), alias, data->group->id, group->id);
+	iommu_group_put(group);
+}
+
+static int pci_check_group_aliases(struct pci_dev *pdev, u16 alias,
+				   void *opaque)
+{
+	const struct check_group_aliases_data *data = opaque;
+
+	/*
+	 * Sometimes when a PCIe-PCI bridge is performing transactions on behalf
+	 * of its subordinate bus it uses devfn=0 on the subordinate bus as the
+	 * alias. This means that 0 will alias with all devfns on the
+	 * subordinate bus and so we expect to see those in the same group. pdev
+	 * in this case is the bridge itself and pdev->bus is the primary bus of
+	 * the bridge.
+	 */
+	if (pdev->bus->number != PCI_BUS_NUM(alias)) {
+		struct pci_dev *piter = NULL;
+
+		for_each_pci_dev(piter) {
+			if (pci_domain_nr(pdev->bus) ==
+				    pci_domain_nr(piter->bus) &&
+			    PCI_BUS_NUM(alias) == pdev->bus->number)
+				pci_check_group(data, alias, piter);
+		}
+	} else {
+		pci_check_group(data, alias, pdev);
+	}
+
+	return 0;
+}
+
+struct iommu_group *pci_device_group(struct device *dev)
+{
+	struct iommu_group *group = __pci_device_group(dev);
+
+	if (!IS_ERR(group)) {
+		struct check_group_aliases_data data = {
+			.pdev = to_pci_dev(dev), .group = group
+		};
+
+		/*
+		 * The IOMMU driver should use pci_for_each_dma_alias() to
+		 * figure out what RIDs to program and the core requires all the
+		 * RIDs to fall within the same group. Validate that everything
+		 * worked properly.
+		 */
+		pci_for_each_dma_alias(data.pdev, pci_check_group_aliases,
+				       &data);
+	}
+	return group;
+}
 EXPORT_SYMBOL_GPL(pci_device_group);
 
 /* Get the IOMMU group for device on fsl-mc bus */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 13/16] PCI: Add the ACS Enhanced Capability definitions
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 14/16] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

This brings the definitions up to PCI Express revision 5.0:

 * ACS I/O Request Blocking Enable
 * ACS DSP Memory Target Access Control
 * ACS USP Memory Target Access Control
 * ACS Unclaimed Request Redirect

Support for this entire grouping is advertised by the ACS Enhanced
Capability bit.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/uapi/linux/pci_regs.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 6edffd31cd8e2c..79d8d317b3c17e 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1004,8 +1004,16 @@
 #define  PCI_ACS_UF		0x0010	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x0020	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x0040	/* Direct Translated P2P */
+#define  PCI_ACS_ENHANCED	0x0080  /* IORB, DSP_MT_xx, USP_MT_XX. Capability only */
+#define  PCI_ACS_EGRESS_CTL_SZ	GENMASK(15, 8) /* Egress Control Vector Size */
 #define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
+#define  PCI_ACS_IORB		0x0080  /* I/O Request Blocking */
+#define  PCI_ACS_DSP_MT_RB	0x0100  /* DSP Memory Target Access Control Request Blocking */
+#define  PCI_ACS_DSP_MT_RR	0x0200  /* DSP Memory Target Access Control Request Redirect */
+#define  PCI_ACS_USP_MT_RB	0x0400  /* USP Memory Target Access Control Request Blocking */
+#define  PCI_ACS_USP_MT_RR	0x0800  /* USP Memory Target Access Control Request Redirect */
+#define  PCI_ACS_UNCLAIMED_RR	0x1000  /* Unclaimed Request Redirect Control */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 14/16] PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (12 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 13/16] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-09 14:52 ` [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The ACS Enhanced bits are intended to address a lack of precision in the
spec about what ACS P2P Request Redirect is supposed to do. While Linux
has long assumed that PCI_ACS_RR would cover MMIO BARs located in the root
port and PCIe Switch ports, the spec took the position that it is
implementation specific.

To get the behavior Linux has long assumed it should be setting:

  PCI_ACS_RR | PCI_ACS_DSP_MT_RR | PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAMED_RR

Follow this guidance in enable_acs and set the additional bits if ACS
Enhanced is supported.

Allow config_acs to control these bits if the device has ACS Enhanced.

The spec permits the HW to wire the bits, so after setting them
pci_acs_flags_enabled() does do a pci_read_config_word() to read the
actual value in effect.

Note that currently Linux sets these bits to 0, so any new HW that comes
supporting ACS Enhanced will end up with historical Linux disabling these
functions. Devices wanting to be compatible with old Linux will need to
wire the ctrl bits to follow ACS_RR. Devices that implement ACS Enhanced
and support the ctrl=0 behavior will break PASID SVA support and VFIO
isolation when ACS_RR is enabled.

Due to the above I strongly encourage backporting this change otherwise
old kernels may have issues with new generations of PCI switches.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/pci.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113bdf..d16b92f3a0c881 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -957,6 +957,7 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
 			     const char *p, const u16 acs_mask, const u16 acs_flags)
 {
 	u16 flags = acs_flags;
+	u16 supported_flags;
 	u16 mask = acs_mask;
 	char *delimit;
 	int ret = 0;
@@ -1001,8 +1002,14 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
 			}
 		}

-		if (mask & ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
-			    PCI_ACS_UF | PCI_ACS_EC | PCI_ACS_DT)) {
+		supported_flags = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+				  PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_EC |
+				  PCI_ACS_DT;
+		if (caps->cap & PCI_ACS_ENHANCED)
+			supported_flags |= PCI_ACS_USP_MT_RR |
+					   PCI_ACS_DSP_MT_RR |
+					   PCI_ACS_UNCLAIMED_RR;
+		if (mask & ~supported_flags) {
 			pci_err(dev, "Invalid ACS flags specified\n");
 			return;
 		}
@@ -1062,6 +1069,14 @@ static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps)
 	/* Upstream Forwarding */
 	caps->ctrl |= (caps->cap & PCI_ACS_UF);

+	/*
+	 * USP/DSP Memory Target Access Control and Unclaimed Request Redirect
+	 */
+	if (caps->cap & PCI_ACS_ENHANCED) {
+		caps->ctrl |= PCI_ACS_USP_MT_RR | PCI_ACS_DSP_MT_RR |
+			      PCI_ACS_UNCLAIMED_RR;
+	}
+
 	/* Enable Translation Blocking for external devices and noats */
 	if (pci_ats_disabled() || dev->external_facing || dev->untrusted)
 		caps->ctrl |= (caps->cap & PCI_ACS_TB);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (13 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 14/16] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-17 22:17   ` Donald Dutile
  2025-08-05  4:39   ` Askar Safin
  2025-07-09 14:52 ` [PATCH v2 16/16] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Switches ignore the PASID when routing TLPs. This means the path from the
PASID issuing end point to the IOMMU must be direct with no possibility
for another device to claim the addresses.

This is done using ACS flags and pci_enable_pasid() checks for this.

The new ACS Enhanced bits clarify some undefined behaviors in the spec
around what P2P Request Redirect means.

Linux has long assumed that PCI_ACS_RR implies PCI_ACS_DSP_MT_RR |
PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.

If the device supports ACS Enhanced then use the information it reports to
determine if PASID SVA is supported or not.

 PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
                    flowing transactions

 PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
                    flowing transactions

 PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
                       to all the DSP bridge windows from generating a
                       error.

Each of these cases would poke a hole in the PASID address space which is
not permitted.

Enhance the comments around pci_acs_flags_enabled() to better explain the
reasoning for its logic. Continue to take the approach of assuming the
device is doing the "right ACS" if it does not explicitly declare
otherwise.

Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/ats.c |  4 +++-
 drivers/pci/pci.c | 54 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index ec6c8dbdc5e9c9..00603c2c4ff0ea 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -416,7 +416,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
 	if (!pasid)
 		return -EINVAL;
 
-	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
+	if (!pci_acs_path_enabled(pdev, NULL,
+				  PCI_ACS_RR | PCI_ACS_UF | PCI_ACS_USP_MT_RR |
+				  PCI_ACS_DSP_MT_RR | PCI_ACS_UNCLAIMED_RR))
 		return -EINVAL;
 
 	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d16b92f3a0c881..e49370c90ec890 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3601,6 +3601,52 @@ void pci_configure_ari(struct pci_dev *dev)
 	}
 }
 
+
+/*
+ * The spec is not clear what it means if the capability bit is 0. One view is
+ * that the device acts as though the ctrl bit is zero, another view is the
+ * device behavior is undefined.
+ *
+ * Historically Linux has taken the position that the capability bit as 0 means
+ * the device supports the most favorable interpritation of the spec - ie that
+ * things like P2P RR are always on. As this is security sensitive we expect
+ * devices that do not follow this rule to be quirked.
+ *
+ * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
+ * and switch ports. If those ports have no MMIO then it is not relavent.
+ * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
+ * window that is not fully decoded by the downstream windows.
+ *
+ * This takes the same approach with ACS Enhanced, if the device does not
+ * support it then we assume the ACS P2P RR has all the enhanced behaviors too.
+ *
+ * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
+ * those values would break old kernels on the edge cases they cover, the only
+ * compatible thing for a new device to implement is ACS Enhanced supported with
+ * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
+ */
+static u16 pci_acs_ctrl_mask(struct pci_dev *pdev, u16 hw_cap)
+{
+	/*
+	 * Egress Control enables use of the Egress Control Vector which is not
+	 * present without the cap.
+	 */
+	u16 mask = PCI_ACS_EC;
+
+	mask = hw_cap & (PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+				      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
+
+	/*
+	 * If ACS Enhanced is supported the device reports what it is doing
+	 * through these bits which may not be settable.
+	 */
+	if (hw_cap & PCI_ACS_ENHANCED)
+		mask |= PCI_ACS_IORB | PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR |
+			PCI_ACS_USP_MT_RB | PCI_ACS_USP_MT_RR |
+			PCI_ACS_UNCLAIMED_RR;
+	return mask;
+}
+
 static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
 {
 	int pos;
@@ -3610,15 +3656,9 @@ static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
 	if (!pos)
 		return false;
 
-	/*
-	 * Except for egress control, capabilities are either required
-	 * or only required if controllable.  Features missing from the
-	 * capability field can therefore be assumed as hard-wired enabled.
-	 */
 	pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
-	acs_flags &= (cap | PCI_ACS_EC);
-
 	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+	acs_flags &= pci_acs_ctrl_mask(pdev, cap);
 	return (ctrl & acs_flags) == acs_flags;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 16/16] PCI: Check ACS Extended flags for pci_bus_isolated()
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (14 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
@ 2025-07-09 14:52 ` Jason Gunthorpe
  2025-07-18 21:29 ` [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Alex Williamson
  2025-08-02  1:45 ` Ethan Zhao
  17 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-09 14:52 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

When looking at a PCIe switch we want to see that the USP/DSP MMIO have
request redirect enabled. Detect the case where the USP is expressly not
isolated from the DSP and ensure the USP is included in the group.

The DSP Memory Target also applies to the Root Port, check it there
too. If upstream directed transactions can reach the root port MMIO then
it is not isolated.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 2be6881087b335..7425680fe92d60 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -127,6 +127,8 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
 	 * traffic flowing upstream back downstream through another DSP.
 	 *
 	 * Thus any non-permissive DSP spoils the whole bus.
+	 * PCI_ACS_UNCLAIMED_RR is not required since rejecting requests with
+	 * error is still isolation.
 	 */
 	guard(rwsem_read)(&pci_bus_sem);
 	list_for_each_entry(pdev, &bus->devices, bus_list) {
@@ -136,8 +138,14 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
 		    pdev->dma_alias_mask)
 			return PCIE_NON_ISOLATED;
 
-		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
+		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED |
+						   PCI_ACS_DSP_MT_RR |
+						   PCI_ACS_USP_MT_RR)) {
+			/* The USP is isolated from the DSP */
+			if (!pci_acs_enabled(pdev, PCI_ACS_USP_MT_RR))
+				return PCIE_NON_ISOLATED;
 			return PCIE_SWITCH_DSP_NON_ISOLATED;
+		}
 	}
 	return PCIE_ISOLATED;
 }
@@ -231,11 +239,13 @@ enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 	switch (type) {
 	/*
 	 * Since PCIe links are point to point root and downstream ports are
-	 * isolated if their own MMIO cannot be reached.
+	 * isolated if their own MMIO cannot be reached. The root port
+	 * uses DSP_MT_RR for its own MMIO.
 	 */
 	case PCI_EXP_TYPE_ROOT_PORT:
 	case PCI_EXP_TYPE_DOWNSTREAM:
-		if (!pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
+		if (!pci_acs_enabled(bridge,
+				     PCI_ACS_ISOLATED | PCI_ACS_DSP_MT_RR))
 			return PCIE_NON_ISOLATED;
 		return PCIE_ISOLATED;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-09 14:52 ` [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
@ 2025-07-17 22:03   ` Donald Dutile
  2025-07-18 18:09     ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Donald Dutile @ 2025-07-17 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Jason,
Hey!
Function naming nit below...

On 7/9/25 10:52 AM, Jason Gunthorpe wrote:
> The current algorithm does not work if ACS is turned off, and it is not
> clear how this has been missed for so long. Alex suspects a multi-function
> modeling of the DSPs may have hidden it, but I could not find a switch
> using that scheme.
> 
> Most likely people were simply happy they had smaller groups oblivious to
> the security problems.
> 
> For discussion lets consider a simple topology like the below:
> 
>                                 -- DSP 02:00.0 -> End Point A
>   Root 00:00.0 -> USP 01:00.0 --|
>                                 -- DSP 02:03.0 -> End Point B
> 
> If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
> B to all have unique single device groups.
> 
> If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
> unique single device groups while 02:00.0, 02:03.0, A, B are part of one
> multi-device group.
> 
> If the DSPs have asymmetric ACS, with one fully isolating and one
> non-isolating we also expect the above multi-device group result.
> 
> Instead the current algorithm always creates unique single device groups
> for this topology. It happens because the pci_device_group(DSP)
> immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
> true and decides the DSP can get a unique group. The pci_device_group(A)
> immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
> then takes the DSPs group.
> 
> The current algorithm has several issues:
> 
>   1) It implicitly depends on ordering. Since the existing group discovery
>      only goes in the upstream direction discovering a downstream device
>      before its upstream will cause the wrong creation of narrower groups.
> 
>   2) It assumes that if the path from the end point to the root is entirely
>      ACS isolated then that end point is isolated. This misses cross-traffic
>      in the asymmetric ACS case.
> 
>   3) When evaluating a non-isolated DSP it does not check peer DSPs for an
>      already established group unless the multi-function feature does it.
> 
>   4) It does not understand the aliasing rule for PCIe to PCI bridges
>      where the alias is to the subordinate bus. The bridge's RID on the
>      primary bus is not aliased. This causes the PCIe to PCI bridge to be
>      wrongly joined to the group with the downstream devices.
> 
> As grouping is a security property for VFIO creating incorrectly narrowed
> groups is a security problem for the system.
> 
> Revise the design to solve these problems.
> 
> Explicitly require ordering, or return EPROBE_DEFER if things are out of
> order. This avoids silent errors that created smaller groups and solves
> problem #1.
> 
> Work on busses, not devices. Isolation is a property of the bus, and the
> first non-isolated bus should form a group containing all devices
> downstream of that bus. If all busses on the path to an end device are
> isolated then the end device has a chance to make a single-device group.
> 
> Use pci_bus_isolation() to compute the bus's isolation status based on the
> ACS flags and technology. pci_bus_isolation() touches a lot of PCI
> internals to get the information in the right format.
> 
> Add a new flag in the iommu_group to record that the group contains a
> non-isolated bus. Any downstream pci_device_group() will see
> bus->self->iommu_group is non-isolated and unconditionally join it. This
> makes the first non-isolation apply to all downstream devices and solves
> problem #2
> 
> The bus's non-isolated iommu_group will be stored in either the DSP of
> PCIe switch or the bus->self upstream device, depending on the situation.
> When storing in the DSP all the DSPs are checked first for a pre-existing
> non-isolated iommu_group. When stored in the upstream the flag forces it
> to all downstreams. This solves problem #3.
> 
> Put the handling of end-device aliases and MFD into pci_get_alias_group()
> and only call it in cases where we have a fully isolated path. Otherwise
> every downstream device on the bus is going to be joined to the group of
> bus->self.
> 
> Finally, replace the initial pci_for_each_dma_alias() with a combination
> of:
> 
>   - Directly checking pci_real_dma_dev() and enforcing ordering.
>     The group should contain both pdev and pci_real_dma_dev(pdev) which is
>     only possible if pdev is ordered after real_dma_dev. This solves a case
>     of #1.
> 
>   - Indirectly relying on pci_bus_isolation() to report legacy PCI busses
>     as non-isolated, with the enum including the distinction of the PCIe to
>     PCI bridge being isolated from the downstream. This solves problem #4.
> 
> It is very likely this is going to expand iommu_group membership in
> existing systems. After all that is the security bug that is being
> fixed. Expanding the iommu_groups risks problems for users using VFIO.
> 
> The intention is to have a more accurate reflection of the security
> properties in the system and should be seen as a security fix. However
> people who have ACS disabled may now need to enable it. As such users may
> have had good reason for ACS to be disabled I strongly recommend that
> backporting of this also include the new config_acs option so that such
> users can potentially minimally enable ACS only where needed.
> 
> Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 274 ++++++++++++++++++++++++++++++++----------
>   include/linux/pci.h   |   3 +
>   2 files changed, 212 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d265de874b14b6..8b152266f20104 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -65,8 +65,16 @@ struct iommu_group {
>   	struct list_head entry;
>   	unsigned int owner_cnt;
>   	void *owner;
> +
> +	/* Used by the device_group() callbacks */
> +	u32 bus_data;
>   };
>   
> +/*
> + * Everything downstream of this group should share it.
> + */
> +#define BUS_DATA_PCI_NON_ISOLATED BIT(0)
> +
>   struct group_device {
>   	struct list_head list;
>   	struct device *dev;
> @@ -1484,25 +1492,6 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
>   	return NULL;
>   }
>   
> -struct group_for_pci_data {
> -	struct pci_dev *pdev;
> -	struct iommu_group *group;
> -};
> -
> -/*
> - * DMA alias iterator callback, return the last seen device.  Stop and return
> - * the IOMMU group if we find one along the way.
> - */
> -static int get_pci_alias_or_group(struct pci_dev *pdev, u16 alias, void *opaque)
> -{
> -	struct group_for_pci_data *data = opaque;
> -
> -	data->pdev = pdev;
> -	data->group = iommu_group_get(&pdev->dev);
> -
> -	return data->group != NULL;
> -}
> -
>   /*
>    * Generic device_group call-back function. It just allocates one
>    * iommu-group per device.
> @@ -1534,57 +1523,27 @@ struct iommu_group *generic_single_device_group(struct device *dev)
>   }
>   EXPORT_SYMBOL_GPL(generic_single_device_group);
>   
> -/*
> - * Use standard PCI bus topology, isolation features, and DMA alias quirks
> - * to find or create an IOMMU group for a device.
> - */
> -struct iommu_group *pci_device_group(struct device *dev)
> +static struct iommu_group *pci_group_alloc_non_isolated(void)
>   {
> -	struct pci_dev *pdev = to_pci_dev(dev);
> -	struct group_for_pci_data data;
> -	struct pci_bus *bus;
> -	struct iommu_group *group = NULL;
> -	u64 devfns[4] = { 0 };
> +	struct iommu_group *group;
>   
> -	if (WARN_ON(!dev_is_pci(dev)))
> -		return ERR_PTR(-EINVAL);
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group))
> +		return group;
> +	group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +	return group;
> +}
>   
> -	/*
> -	 * Find the upstream DMA alias for the device.  A device must not
> -	 * be aliased due to topology in order to have its own IOMMU group.
> -	 * If we find an alias along the way that already belongs to a
> -	 * group, use it.
> -	 */
> -	if (pci_for_each_dma_alias(pdev, get_pci_alias_or_group, &data))
> -		return data.group;
> -
> -	pdev = data.pdev;
> -
> -	/*
> -	 * Continue upstream from the point of minimum IOMMU granularity
> -	 * due to aliases to the point where devices are protected from
> -	 * peer-to-peer DMA by PCI ACS.  Again, if we find an existing
> -	 * group, use it.
> -	 */
> -	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
> -		if (!bus->self)
> -			continue;
> -
> -		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
> -			break;
> -
> -		pdev = bus->self;
> -
> -		group = iommu_group_get(&pdev->dev);
> -		if (group)
> -			return group;
> -	}
> +static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
So, the former pci_device_group() is completely re-written below,
and what it use to do is renamed pci_get_alias_group(), which shouldn't be
(easily) confused with ...

> +{
> +	struct iommu_group *group;
> +	DECLARE_BITMAP(devfns, 256) = {};
>   
>   	/*
>   	 * Look for existing groups on device aliases.  If we alias another
>   	 * device or another device aliases us, use the same group.
>   	 */
> -	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
> +	group = get_pci_alias_group(pdev, devfns);
... get_pci_alias_group() ?

... and it's only used for PCIe case below (in pci_device_group), so
should it be named 'pcie_get_alias_group()' ?

>   	if (group)
>   		return group;
>   
> @@ -1593,12 +1552,197 @@ struct iommu_group *pci_device_group(struct device *dev)
>   	 * slot and aliases of those funcions, if any.  No need to clear
>   	 * the search bitmap, the tested devfns are still valid.
>   	 */
> -	group = get_pci_function_alias_group(pdev, (unsigned long *)devfns);
> +	group = get_pci_function_alias_group(pdev, devfns);
>   	if (group)
>   		return group;
>   
> -	/* No shared group found, allocate new */
> -	return iommu_group_alloc();
> +	/*
> +	 * When MFD's are included in the set due to ACS we assume that if ACS
> +	 * permits an internal loopback between functions it also permits the
> +	 * loopback to go downstream if a function is a bridge.
> +	 *
> +	 * It is less clear what aliases mean when applied to a bridge. For now
> +	 * be conservative and also propagate the group downstream.
> +	 */
> +	__clear_bit(pdev->devfn & 0xFF, devfns);
> +	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
> +		return pci_group_alloc_non_isolated();
> +	return NULL;
> +}
> +
> +static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
although static, could you provide a function description for its purpose ?

> +{
> +	/*
> +	 * SRIOV functions may resid on a virtual bus, jump directly to the PFs
nit: resid -> reside

> +	 * bus in all cases.
> +	 */
> +	struct pci_bus *bus = pci_physfn(pdev)->bus;
> +	struct iommu_group *group;
> +
> +	/* Nothing upstream of this */
> +	if (pci_is_root_bus(bus))
> +		return NULL;
> +
> +	/*
> +	 * !self is only for SRIOV virtual busses which should have been
> +	 * excluded above.
by pci_is_root_bus() ?? -- that checks if bus->parent exists...
not sure how that excludes the case of !bus->self ...

> +	 */
> +	if (WARN_ON(!bus->self))
> +		return ERR_PTR(-EINVAL);
> +
> +	group = iommu_group_get(&bus->self->dev);
> +	if (!group) {
> +		/*
> +		 * If the upstream bridge needs the same group as pdev then
> +		 * there is no way for it's pci_device_group() to discover it.
> +		 */
> +		dev_err(&pdev->dev,
> +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
> +			pci_name(bus->self));
> +		return ERR_PTR(-EPROBE_DEFER);
> +	}
> +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
> +		return group;
> +	iommu_group_put(group);
> +	return NULL;
... and w/o the function description, I don't follow:
-- rtn an iommu-group if it has NON_ISOLATED property ... but rtn null if all devices below it are isolated?

> +}
> +
> +/*
> + * For legacy PCI we have two main considerations when forming groups:
> + *
> + *  1) In PCI we can loose the RID inside the fabric, or some devices will use
> + *     the wrong RID. The PCI core calls this aliasing, but from an IOMMU
> + *     perspective it means that a PCI device may have multiple RIDs and a
> + *     single RID may represent many PCI devices. This effectively means all the
> + *     aliases must share a translation, thus group, because the IOMMU cannot
> + *     tell devices apart.
> + *
> + *  2) PCI permits a bus segment to claim an address even if the transaction
> + *     originates from an end point not the CPU. When it happens it is called
> + *     peer to peer. Claiming a transaction in the middle of the bus hierarchy
> + *     bypasses the IOMMU translation. The IOMMU subsystem rules require these
> + *     devices to be placed in the same group because they lack isolation from
> + *     each other. In PCI Express the ACS system can be used to inhibit this and
> + *     force transactions to go to the IOMMU.
> + *
> + *     From a PCI perspective any given PCI bus is either isolating or
> + *     non-isolating. Isolating means downstream originated transactions always
> + *     progress toward the CPU and do not go to other devices on the bus
> + *     segment, while non-isolating means downstream originated transactions can
> + *     progress back downstream through another device on the bus segment.
> + *
> + *     Beyond buses a multi-function device or bridge can also allow
> + *     transactions to loop back internally from one function to another.
> + *
> + *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
> + *     that bus becomes a single group.
> + */
> +struct iommu_group *pci_device_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct iommu_group *group;
> +	struct pci_dev *real_pdev;
> +
> +	if (WARN_ON(!dev_is_pci(dev)))
> +		return ERR_PTR(-EINVAL);
> +
> +	/*
> +	 * Arches can supply a completely different PCI device that actually
> +	 * does DMA.
> +	 */
> +	real_pdev = pci_real_dma_dev(pdev);
> +	if (real_pdev != pdev) {
> +		group = iommu_group_get(&real_pdev->dev);
> +		if (!group) {
> +			/*
> +			 * The real_pdev has not had an iommu probed to it. We
> +			 * can't create a new group here because there is no way
> +			 * for pci_device_group(real_pdev) to pick it up.
> +			 */
> +			dev_err(dev,
> +				"PCI device is probing out of order, real device of %s is not probed yet\n",
> +				pci_name(real_pdev));
> +			return ERR_PTR(-EPROBE_DEFER);
> +		}
> +		return group;
> +	}
> +
> +	if (pdev->dev_flags & PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT)
> +		return iommu_group_alloc();
> +
> +	/* Anything upstream of this enforcing non-isolated? */
> +	group = pci_hierarchy_group(pdev);
> +	if (group)
> +		return group;
> +
> +	switch (pci_bus_isolated(pci_physfn(pdev)->bus)) {
> +	case PCIE_ISOLATED:
> +		/* Check multi-function groups and same-bus devfn aliases */
> +		group = pci_get_alias_group(pdev);
> +		if (group)
> +			return group;
> +
> +		/* No shared group found, allocate new */
> +		return iommu_group_alloc();
> +
> +	/*
> +	 * On legacy PCI there is no RID at an electrical level. On PCI-X the
> +	 * RID of the bridge may be used in some cases instead of the
> +	 * downstream's RID. This creates aliasing problems. PCI/PCI-X doesn't
> +	 * provide isolation either. The end result is that as soon as we hit a
> +	 * PCI/PCI-X bus we switch to non-isolated for the whole downstream for
> +	 * both aliasing and isolation reasons. The bridge has to be included in
> +	 * the group because of the aliasing.
> +	 */
> +	case PCI_BRIDGE_NON_ISOLATED:
> +	/* A PCIe switch where the USP has MMIO and is not isolated. */
> +	case PCIE_NON_ISOLATED:
> +		group = iommu_group_get(&pdev->bus->self->dev);
> +		if (WARN_ON(!group))
> +			return ERR_PTR(-EINVAL);
> +		/*
> +		 * No need to be concerned with aliases here since we are going
> +		 * to put the entire downstream tree in the bridge/USP's group.
> +		 */
> +		group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +		return group;
> +
> +	/*
> +	 * It is a PCI bus and the upstream bridge/port does not alias or allow
> +	 * P2P.
> +	 */
> +	case PCI_BUS_NON_ISOLATED:
> +	/*
> +	 * It is a PCIe switch and the DSP cannot reach the USP. The DSP's
> +	 * are not isolated from each other and share a group.
> +	 */
> +	case PCIE_SWITCH_DSP_NON_ISOLATED: {
> +		struct pci_dev *piter = NULL;
> +
> +		/*
> +		 * All the downstream devices on the bus share a group. If this
> +		 * is a PCIe switch then they will all be DSPs
> +		 */
> +		for_each_pci_dev(piter) {
> +			if (piter->bus != pdev->bus)
> +				continue;
> +			group = iommu_group_get(&piter->dev);
> +			if (group) {
> +				pci_dev_put(piter);
> +				if (WARN_ON(!(group->bus_data &
> +					      BUS_DATA_PCI_NON_ISOLATED)))
> +					group->bus_data |=
> +						BUS_DATA_PCI_NON_ISOLATED;
> +				return group;
> +			}
> +		}
> +		return pci_group_alloc_non_isolated();
> +	}
> +	default:
> +		break;
> +	}
> +	WARN_ON(true);
> +	return ERR_PTR(-EINVAL);
>   }
>   EXPORT_SYMBOL_GPL(pci_device_group);
>   
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0b1e28dcf9187d..517800206208b5 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2072,6 +2072,9 @@ static inline int pci_dev_present(const struct pci_device_id *ids)
>   #define no_pci_devices()	(1)
>   #define pci_dev_put(dev)	do { } while (0)
>   
> +static inline struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
> +{ return dev; }
> +
>   static inline void pci_set_master(struct pci_dev *dev) { }
>   static inline void pci_clear_master(struct pci_dev *dev) { }
>   static inline int pci_enable_device(struct pci_dev *dev) { return -EIO; }


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/16] PCI: Add pci_reachable_set()
  2025-07-09 14:52 ` [PATCH v2 05/16] PCI: Add pci_reachable_set() Jason Gunthorpe
@ 2025-07-17 22:04   ` Donald Dutile
  2025-07-18 17:49     ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Donald Dutile @ 2025-07-17 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Jason,
Hi!

general question below, that you may know given your efficient compute log below.

On 7/9/25 10:52 AM, Jason Gunthorpe wrote:
> Implement pci_reachable_set() to efficiently compute a set of devices on
> the same bus that are "reachable" from a starting device. The meaning of
> reachability is defined by the caller through a callback function.
> 
This comment made me review get_pci_alias_group(), which states in its description:
* Look for aliases to or from the given device for existing groups. DMA
  * aliases are only supported on the same bus, therefore the search
  * space is quite small

So why does it do the for loop:
   for_each_pci_dev(tmp) {

vs getting the pdev->bus->devices -- list of devices on that bus, and only
scan that smaller list, vs all pci devices on the system?


> This is a faster implementation of the same logic in
> pci_device_group(). Being inside the PCI core allows use of pci_bus_sem so
> it can use list_for_each_entry() on a small list of devices instead of the
> expensive for_each_pci_dev(). Server systems can now have hundreds of PCI
> devices, but typically only a very small number of devices per bus.
> 
> An example of a reachability function would be pci_devs_are_dma_aliases()
> which would compute a set of devices on the same bus that are
> aliases. This would also be useful in future support for the ACS P2P
> Egress Vector which has a similar reachability problem.
> 
> This is effectively a graph algorithm where the set of devices on the bus
> are vertexes and the reachable() function defines the edges. It returns a
> set of vertexes that form a connected graph.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Could we move this to just before patch 11 where it is used?
or could this be used to improve get_pci_alias_group() and get_pci_function_alias_group() ?

> ---
>   drivers/pci/search.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
>   include/linux/pci.h  | 12 ++++++
>   2 files changed, 102 insertions(+)
> 
> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
> index a13fad53e44df9..dc816dc4505c6d 100644
> --- a/drivers/pci/search.c
> +++ b/drivers/pci/search.c
> @@ -585,3 +585,93 @@ int pci_dev_present(const struct pci_device_id *ids)
>   	return 0;
>   }
>   EXPORT_SYMBOL(pci_dev_present);
> +
> +/**
> + * pci_reachable_set - Generate a bitmap of devices within a reachability set
> + * @start: First device in the set
> + * @devfns: The set of devices on the bus
> + * @reachable: Callback to tell if two devices can reach each other
> + *
> + * Compute a bitmap where every set bit is a device on the bus that is reachable
> + * from the start device, including the start device. Reachability between two
> + * devices is determined by a callback function.
> + *
> + * This is a non-recursive implementation that invokes the callback once per
> + * pair. The callback must be commutative:
> + *    reachable(a, b) == reachable(b, a)
> + * reachable() can form a cyclic graph:
> + *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
> + *
> + * Since this function is limited to a single bus the largest set can be 256
> + * devices large.
> + */
> +void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
> +		       bool (*reachable)(struct pci_dev *deva,
> +					 struct pci_dev *devb))
> +{
> +	struct pci_reachable_set todo_devfns = {};
> +	struct pci_reachable_set next_devfns = {};
> +	struct pci_bus *bus = start->bus;
> +	bool again;
> +
> +	/* Assume devfn of all PCI devices is bounded by MAX_NR_DEVFNS */
> +	static_assert(sizeof(next_devfns.devfns) * BITS_PER_BYTE >=
> +		      MAX_NR_DEVFNS);
> +
> +	memset(devfns, 0, sizeof(devfns->devfns));
> +	__set_bit(start->devfn, devfns->devfns);
> +	__set_bit(start->devfn, next_devfns.devfns);
> +
> +	down_read(&pci_bus_sem);
> +	while (true) {
> +		unsigned int devfna;
> +		unsigned int i;
> +
> +		/*
> +		 * For each device that hasn't been checked compare every
> +		 * device on the bus against it.
> +		 */
> +		again = false;
> +		for_each_set_bit(devfna, next_devfns.devfns, MAX_NR_DEVFNS) {
> +			struct pci_dev *deva = NULL;
> +			struct pci_dev *devb;
> +
> +			list_for_each_entry(devb, &bus->devices, bus_list) {
> +				if (devb->devfn == devfna)
> +					deva = devb;
> +
> +				if (test_bit(devb->devfn, devfns->devfns))
> +					continue;
> +
> +				if (!deva) {
> +					deva = devb;
> +					list_for_each_entry_continue(
> +						deva, &bus->devices, bus_list)
> +						if (deva->devfn == devfna)
> +							break;
> +				}
> +
> +				if (!reachable(deva, devb))
> +					continue;
> +
> +				__set_bit(devb->devfn, todo_devfns.devfns);
> +				again = true;
> +			}
> +		}
> +
> +		if (!again)
> +			break;
> +
> +		/*
> +		 * Every new bit adds a new deva to check, reloop the whole
> +		 * thing. Expect this to be rare.
> +		 */
> +		for (i = 0; i != ARRAY_SIZE(devfns->devfns); i++) {
> +			devfns->devfns[i] |= todo_devfns.devfns[i];
> +			next_devfns.devfns[i] = todo_devfns.devfns[i];
> +			todo_devfns.devfns[i] = 0;
> +		}
> +	}
> +	up_read(&pci_bus_sem);
> +}
> +EXPORT_SYMBOL_GPL(pci_reachable_set);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 517800206208b5..2e629087539101 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -834,6 +834,10 @@ struct pci_dynids {
>   	struct list_head	list;	/* For IDs added at runtime */
>   };
>   
> +struct pci_reachable_set {
> +	DECLARE_BITMAP(devfns, 256);
> +};
> +
>   enum pci_bus_isolation {
>   	/*
>   	 * The bus is off a root port and the root port has isolated ACS flags
> @@ -1248,6 +1252,9 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
>   struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
>   struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
>   
> +void pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
> +		       bool (*reachable)(struct pci_dev *deva,
> +					 struct pci_dev *devb));
>   enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
>   
>   int pci_dev_present(const struct pci_device_id *ids);
> @@ -2063,6 +2070,11 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
>   						 struct pci_dev *from)
>   { return NULL; }
>   
> +static inline void
> +pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
> +		  bool (*reachable)(struct pci_dev *deva, struct pci_dev *devb))
> +{ }
> +
>   static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
>   { return PCIE_NON_ISOLATED; }
>   


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-07-09 14:52 ` [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
@ 2025-07-17 22:07   ` Donald Dutile
  0 siblings, 0 replies; 46+ messages in thread
From: Donald Dutile @ 2025-07-17 22:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 7/9/25 10:52 AM, Jason Gunthorpe wrote:
> Directly check that the devices touched by pci_for_each_dma_alias() match
> the groups that were built by pci_device_group(). This helps validate that
> pci_for_each_dma_alias() and pci_bus_isolated() are consistent.
> 
> This should eventually be hidden behind a debug kconfig, but for now it is
> good to get feedback from more diverse systems if there are any problems.
> 
+1 ... good idea!

> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 73 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index cd26b43916e8be..e4ae1d064554e4 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1606,7 +1606,7 @@ static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
>    *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
>    *     that bus becomes a single group.
>    */
> -struct iommu_group *pci_device_group(struct device *dev)
> +static struct iommu_group *__pci_device_group(struct device *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct iommu_group *group;
> @@ -1713,6 +1713,77 @@ struct iommu_group *pci_device_group(struct device *dev)
>   	WARN_ON(true);
>   	return ERR_PTR(-EINVAL);
>   }
> +
> +struct check_group_aliases_data {
> +	struct pci_dev *pdev;
> +	struct iommu_group *group;
> +};
> +
> +static void pci_check_group(const struct check_group_aliases_data *data,
> +			    u16 alias, struct pci_dev *pdev)
> +{
> +	struct iommu_group *group;
> +
> +	group = iommu_group_get(&pdev->dev);
> +	if (!group)
> +		return;
> +
> +	if (group != data->group)
> +		dev_err(&data->pdev->dev,
> +			"During group construction alias processing needed dev %s alias %x to have the same group but %u != %u\n",
> +			pci_name(pdev), alias, data->group->id, group->id);
> +	iommu_group_put(group);
> +}
> +
> +static int pci_check_group_aliases(struct pci_dev *pdev, u16 alias,
> +				   void *opaque)
> +{
> +	const struct check_group_aliases_data *data = opaque;
> +
> +	/*
> +	 * Sometimes when a PCIe-PCI bridge is performing transactions on behalf
> +	 * of its subordinate bus it uses devfn=0 on the subordinate bus as the
> +	 * alias. This means that 0 will alias with all devfns on the
> +	 * subordinate bus and so we expect to see those in the same group. pdev
> +	 * in this case is the bridge itself and pdev->bus is the primary bus of
> +	 * the bridge.
> +	 */
> +	if (pdev->bus->number != PCI_BUS_NUM(alias)) {
> +		struct pci_dev *piter = NULL;
> +
> +		for_each_pci_dev(piter) {
> +			if (pci_domain_nr(pdev->bus) ==
> +				    pci_domain_nr(piter->bus) &&
> +			    PCI_BUS_NUM(alias) == pdev->bus->number)
> +				pci_check_group(data, alias, piter);
> +		}
> +	} else {
> +		pci_check_group(data, alias, pdev);
> +	}
> +
> +	return 0;
> +}
> +
> +struct iommu_group *pci_device_group(struct device *dev)
> +{
> +	struct iommu_group *group = __pci_device_group(dev);
> +
> +	if (!IS_ERR(group)) {
> +		struct check_group_aliases_data data = {
> +			.pdev = to_pci_dev(dev), .group = group
> +		};
> +
> +		/*
> +		 * The IOMMU driver should use pci_for_each_dma_alias() to
> +		 * figure out what RIDs to program and the core requires all the
> +		 * RIDs to fall within the same group. Validate that everything
> +		 * worked properly.
> +		 */
> +		pci_for_each_dma_alias(data.pdev, pci_check_group_aliases,
> +				       &data);
> +	}
> +	return group;
> +}
>   EXPORT_SYMBOL_GPL(pci_device_group);
>   
>   /* Get the IOMMU group for device on fsl-mc bus */


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-07-09 14:52 ` [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
@ 2025-07-17 22:17   ` Donald Dutile
  2025-07-18 17:52     ` Jason Gunthorpe
  2025-08-05  4:39   ` Askar Safin
  1 sibling, 1 reply; 46+ messages in thread
From: Donald Dutile @ 2025-07-17 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 7/9/25 10:52 AM, Jason Gunthorpe wrote:
> Switches ignore the PASID when routing TLPs. This means the path from the
> PASID issuing end point to the IOMMU must be direct with no possibility
> for another device to claim the addresses.
> 
> This is done using ACS flags and pci_enable_pasid() checks for this.
> 

So a PASID device is a MFD, and it has ACS caps & bits to check?

> The new ACS Enhanced bits clarify some undefined behaviors in the spec
> around what P2P Request Redirect means.
> 
> Linux has long assumed that PCI_ACS_RR implies PCI_ACS_DSP_MT_RR |
> PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.
> 
> If the device supports ACS Enhanced then use the information it reports to
> determine if PASID SVA is supported or not.
> 
>   PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
>                      flowing transactions
> 
>   PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
>                      flowing transactions
> 
>   PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
>                         to all the DSP bridge windows from generating a
>                         error.
> 
> Each of these cases would poke a hole in the PASID address space which is
> not permitted.
> 
> Enhance the comments around pci_acs_flags_enabled() to better explain the
> reasoning for its logic. Continue to take the approach of assuming the
> device is doing the "right ACS" if it does not explicitly declare
> otherwise.
> 
> Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/pci/ats.c |  4 +++-
>   drivers/pci/pci.c | 54 +++++++++++++++++++++++++++++++++++++++++------
>   2 files changed, 50 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index ec6c8dbdc5e9c9..00603c2c4ff0ea 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -416,7 +416,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
>   	if (!pasid)
>   		return -EINVAL;
>   
> -	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
> +	if (!pci_acs_path_enabled(pdev, NULL,
> +				  PCI_ACS_RR | PCI_ACS_UF | PCI_ACS_USP_MT_RR |
> +				  PCI_ACS_DSP_MT_RR | PCI_ACS_UNCLAIMED_RR))
>   		return -EINVAL;
>   
>   	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d16b92f3a0c881..e49370c90ec890 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3601,6 +3601,52 @@ void pci_configure_ari(struct pci_dev *dev)
>   	}
>   }
>   
> +
> +/*
> + * The spec is not clear what it means if the capability bit is 0. One view is
> + * that the device acts as though the ctrl bit is zero, another view is the
> + * device behavior is undefined.
> + *
> + * Historically Linux has taken the position that the capability bit as 0 means
> + * the device supports the most favorable interpritation of the spec - ie that
> + * things like P2P RR are always on. As this is security sensitive we expect
> + * devices that do not follow this rule to be quirked.
> + *
> + * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
> + * and switch ports. If those ports have no MMIO then it is not relavent.
> + * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
> + * window that is not fully decoded by the downstream windows.
> + *
> + * This takes the same approach with ACS Enhanced, if the device does not
> + * support it then we assume the ACS P2P RR has all the enhanced behaviors too.
> + *
> + * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
> + * those values would break old kernels on the edge cases they cover, the only
> + * compatible thing for a new device to implement is ACS Enhanced supported with
> + * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
> + */
> +static u16 pci_acs_ctrl_mask(struct pci_dev *pdev, u16 hw_cap)
> +{
> +	/*
> +	 * Egress Control enables use of the Egress Control Vector which is not
> +	 * present without the cap.
> +	 */
> +	u16 mask = PCI_ACS_EC;
> +
> +	mask = hw_cap & (PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
> +				      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
> +
> +	/*
> +	 * If ACS Enhanced is supported the device reports what it is doing
> +	 * through these bits which may not be settable.
> +	 */
> +	if (hw_cap & PCI_ACS_ENHANCED)
> +		mask |= PCI_ACS_IORB | PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR |
> +			PCI_ACS_USP_MT_RB | PCI_ACS_USP_MT_RR |
> +			PCI_ACS_UNCLAIMED_RR;
> +	return mask;
> +}
> +
>   static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
>   {
>   	int pos;
> @@ -3610,15 +3656,9 @@ static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
>   	if (!pos)
>   		return false;
>   
> -	/*
> -	 * Except for egress control, capabilities are either required
> -	 * or only required if controllable.  Features missing from the
> -	 * capability field can therefore be assumed as hard-wired enabled.
> -	 */
>   	pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
> -	acs_flags &= (cap | PCI_ACS_EC);
> -
>   	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +	acs_flags &= pci_acs_ctrl_mask(pdev, cap);
>   	return (ctrl & acs_flags) == acs_flags;
>   }
>   


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/16] PCI: Add pci_reachable_set()
  2025-07-17 22:04   ` Donald Dutile
@ 2025-07-18 17:49     ` Jason Gunthorpe
  2025-07-18 19:10       ` Donald Dutile
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 17:49 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Thu, Jul 17, 2025 at 06:04:04PM -0400, Donald Dutile wrote:
> > Implement pci_reachable_set() to efficiently compute a set of devices on
> > the same bus that are "reachable" from a starting device. The meaning of
> > reachability is defined by the caller through a callback function.
> > 
> This comment made me review get_pci_alias_group(), which states in its description:
> * Look for aliases to or from the given device for existing groups. DMA
>  * aliases are only supported on the same bus, therefore the search
>  * space is quite small
> 
> So why does it do the for loop:
>   for_each_pci_dev(tmp) {
> 
> vs getting the pdev->bus->devices -- list of devices on that bus, and only
> scan that smaller list, vs all pci devices on the system?

Because it can't access the required lock pci_bus_sem to use that
list.

The lock is only available within the PCI core itself which is why I
moved a few functions over there so they can use the lock.

> Could we move this to just before patch 11 where it is used?

Yes

> or could this be used to improve get_pci_alias_group() and get_pci_function_alias_group() ?

IMHO it is not really worth the churn

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-07-17 22:17   ` Donald Dutile
@ 2025-07-18 17:52     ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 17:52 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Thu, Jul 17, 2025 at 06:17:11PM -0400, Donald Dutile wrote:
> 
> 
> On 7/9/25 10:52 AM, Jason Gunthorpe wrote:
> > Switches ignore the PASID when routing TLPs. This means the path from the
> > PASID issuing end point to the IOMMU must be direct with no possibility
> > for another device to claim the addresses.
> > 
> > This is done using ACS flags and pci_enable_pasid() checks for this.
> 
> So a PASID device is a MFD, and it has ACS caps & bits to check?

Not a MFD, but the MFD loopback would follow the same rules for
routing PASID TLPs. For a MFD none of the switch specific ACS
Enhanced flags would make sense:

> >   PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
> >                      flowing transactions
> > 
> >   PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
> >                      flowing transactions
> > 
> >   PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
> >                         to all the DSP bridge windows from generating a
> >                         error.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-17 22:03   ` Donald Dutile
@ 2025-07-18 18:09     ` Jason Gunthorpe
  2025-07-18 19:00       ` Donald Dutile
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 18:09 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Thu, Jul 17, 2025 at 06:03:42PM -0400, Donald Dutile wrote:
> > +static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
> So, the former pci_device_group() is completely re-written below,
> and what it use to do is renamed pci_get_alias_group(), which shouldn't be
> (easily) confused with ...
> 
> > +{
> > +	struct iommu_group *group;
> > +	DECLARE_BITMAP(devfns, 256) = {};
> >   	/*
> >   	 * Look for existing groups on device aliases.  If we alias another
> >   	 * device or another device aliases us, use the same group.
> >   	 */
> > -	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
> > +	group = get_pci_alias_group(pdev, devfns);
> ... get_pci_alias_group() ?
> 
> ... and it's only used for PCIe case below (in pci_device_group), so
> should it be named 'pcie_get_alias_group()' ?

Well, the naming is alot better after this is reworked with the
reachable set patch and these two functions are removed.

But even then I guess it is not a great name.

How about:

/*
 * Return a group if the function has isolation restrictions related to
 * aliases or MFD ACS.
 */
static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
{

> > +static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
> although static, could you provide a function description for its purpose ?

/* Return a group if the upstream hierarchy has isolation restrictions. */

> > +	/*
> > +	 * !self is only for SRIOV virtual busses which should have been
> > +	 * excluded above.
> by pci_is_root_bus() ?? -- that checks if bus->parent exists...
> not sure how that excludes the case of !bus->self ...

Should be this:

	/*
	 * !self is only for SRIOV virtual busses which should have been
	 * excluded by pci_physfn()
	 */
	if (WARN_ON(!bus->self))

> > +	 */
> > +	if (WARN_ON(!bus->self))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	group = iommu_group_get(&bus->self->dev);
> > +	if (!group) {
> > +		/*
> > +		 * If the upstream bridge needs the same group as pdev then
> > +		 * there is no way for it's pci_device_group() to discover it.
> > +		 */
> > +		dev_err(&pdev->dev,
> > +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
> > +			pci_name(bus->self));
> > +		return ERR_PTR(-EPROBE_DEFER);
> > +	}
> > +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
> > +		return group;
> > +	iommu_group_put(group);
> > +	return NULL;
> ... and w/o the function description, I don't follow:
> -- rtn an iommu-group if it has NON_ISOLATED property ... but rtn null if all devices below it are isolated?

Yes. For all these internal functions non null means we found a group
to join, NULL means to keep checking isolation rules.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-18 18:09     ` Jason Gunthorpe
@ 2025-07-18 19:00       ` Donald Dutile
  2025-07-18 20:19         ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Donald Dutile @ 2025-07-18 19:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

Jason,
Thanks for replies, clarifications...
Couple questions below.

On 7/18/25 2:09 PM, Jason Gunthorpe wrote:
> On Thu, Jul 17, 2025 at 06:03:42PM -0400, Donald Dutile wrote:
>>> +static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
>> So, the former pci_device_group() is completely re-written below,
>> and what it use to do is renamed pci_get_alias_group(), which shouldn't be
>> (easily) confused with ...
>>
>>> +{
>>> +	struct iommu_group *group;
>>> +	DECLARE_BITMAP(devfns, 256) = {};
>>>    	/*
>>>    	 * Look for existing groups on device aliases.  If we alias another
>>>    	 * device or another device aliases us, use the same group.
>>>    	 */
>>> -	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
>>> +	group = get_pci_alias_group(pdev, devfns);
>> ... get_pci_alias_group() ?
>>
>> ... and it's only used for PCIe case below (in pci_device_group), so
>> should it be named 'pcie_get_alias_group()' ?
> 
> Well, the naming is alot better after this is reworked with the
> reachable set patch and these two functions are removed.
> 
Didn't notice that... will re-look.

> But even then I guess it is not a great name.
> 
> How about:
> 
> /*
>   * Return a group if the function has isolation restrictions related to
>   * aliases or MFD ACS.
>   */
> static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
> {
> 
sure...

>>> +static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
>> although static, could you provide a function description for its purpose ?
> 
> /* Return a group if the upstream hierarchy has isolation restrictions. */
> 
>>> +	/*
>>> +	 * !self is only for SRIOV virtual busses which should have been
>>> +	 * excluded above.
>> by pci_is_root_bus() ?? -- that checks if bus->parent exists...
>> not sure how that excludes the case of !bus->self ...
> 
> Should be this:
> 
> 	/*
> 	 * !self is only for SRIOV virtual busses which should have been
> 	 * excluded by pci_physfn()
> 	 */
> 	if (WARN_ON(!bus->self))
> 
my Linux tree says its this:
static inline bool pci_is_root_bus(struct pci_bus *pbus)
{
         return !(pbus->parent);
}

is there a change to pci_is_root_bus() in a -next branch?

>>> +	 */
>>> +	if (WARN_ON(!bus->self))
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	group = iommu_group_get(&bus->self->dev);
>>> +	if (!group) {
>>> +		/*
>>> +		 * If the upstream bridge needs the same group as pdev then
>>> +		 * there is no way for it's pci_device_group() to discover it.
>>> +		 */
>>> +		dev_err(&pdev->dev,
>>> +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
>>> +			pci_name(bus->self));
>>> +		return ERR_PTR(-EPROBE_DEFER);
>>> +	}
>>> +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
>>> +		return group;
>>> +	iommu_group_put(group);
>>> +	return NULL;
>> ... and w/o the function description, I don't follow:
>> -- rtn an iommu-group if it has NON_ISOLATED property ... but rtn null if all devices below it are isolated?
> 
> Yes. For all these internal functions non null means we found a group
> to join, NULL means to keep checking isolation rules.
> 
ah, so !group == keep looking for for non-isolated conditions.. got it.
Could that lead to two iommu-groups being created that could/should be one larger one?

> Thanks,
> Jason
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 05/16] PCI: Add pci_reachable_set()
  2025-07-18 17:49     ` Jason Gunthorpe
@ 2025-07-18 19:10       ` Donald Dutile
  0 siblings, 0 replies; 46+ messages in thread
From: Donald Dutile @ 2025-07-18 19:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 7/18/25 1:49 PM, Jason Gunthorpe wrote:
> On Thu, Jul 17, 2025 at 06:04:04PM -0400, Donald Dutile wrote:
>>> Implement pci_reachable_set() to efficiently compute a set of devices on
>>> the same bus that are "reachable" from a starting device. The meaning of
>>> reachability is defined by the caller through a callback function.
>>>
>> This comment made me review get_pci_alias_group(), which states in its description:
>> * Look for aliases to or from the given device for existing groups. DMA
>>   * aliases are only supported on the same bus, therefore the search
>>   * space is quite small
>>
>> So why does it do the for loop:
>>    for_each_pci_dev(tmp) {
>>
>> vs getting the pdev->bus->devices -- list of devices on that bus, and only
>> scan that smaller list, vs all pci devices on the system?
> 
> Because it can't access the required lock pci_bus_sem to use that
> list.
ah, i see; it's only declared in drivers/pci/search.c, and it isn't
a semaphone in a per-bus struct. :-/
... so move the function to search.c ? /me runs...
I know, not worth the churn; already have 'polluted' iommu w/pci, but not vice-versa.
(although iommu-groups is really a bus-op (would be a different op, for say, platform devices going through another iommu).

> 
> The lock is only available within the PCI core itself which is why I
> moved a few functions over there so they can use the lock.
> 
>> Could we move this to just before patch 11 where it is used?
> 
> Yes
> 
>> or could this be used to improve get_pci_alias_group() and get_pci_function_alias_group() ?
> 
> IMHO it is not really worth the churn
> 
Hey, your churning at the moment, so I figured you may want to churn some more! ;-)
just a comment/suggestion; not required.

> Jason
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-18 19:00       ` Donald Dutile
@ 2025-07-18 20:19         ` Jason Gunthorpe
  2025-07-18 21:41           ` Donald Dutile
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 20:19 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Jul 18, 2025 at 03:00:28PM -0400, Donald Dutile wrote:
> > > > +	/*
> > > > +	 * !self is only for SRIOV virtual busses which should have been
> > > > +	 * excluded above.
> > > by pci_is_root_bus() ?? -- that checks if bus->parent exists...
> > > not sure how that excludes the case of !bus->self ...
> > 
> > Should be this:
> > 
> > 	/*
> > 	 * !self is only for SRIOV virtual busses which should have been
> > 	 * excluded by pci_physfn()
> > 	 */
> > 	if (WARN_ON(!bus->self))
> > 
> my Linux tree says its this:
> static inline bool pci_is_root_bus(struct pci_bus *pbus)
> {
>         return !(pbus->parent);
> }
> 
> is there a change to pci_is_root_bus() in a -next branch?

Not that, at the start of the function there is a pci_physfn(), the
entire function never works on a VF, so bus is never a VF's bus.

> > > > +	 */
> > > > +	if (WARN_ON(!bus->self))
> > > > +		return ERR_PTR(-EINVAL);
> > > > +
> > > > +	group = iommu_group_get(&bus->self->dev);
> > > > +	if (!group) {
> > > > +		/*
> > > > +		 * If the upstream bridge needs the same group as pdev then
> > > > +		 * there is no way for it's pci_device_group() to discover it.
> > > > +		 */
> > > > +		dev_err(&pdev->dev,
> > > > +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
> > > > +			pci_name(bus->self));
> > > > +		return ERR_PTR(-EPROBE_DEFER);
> > > > +	}
> > > > +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
> > > > +		return group;
> > > > +	iommu_group_put(group);
> > > > +	return NULL;
> > > ... and w/o the function description, I don't follow:
> > > -- rtn an iommu-group if it has NON_ISOLATED property ... but rtn null if all devices below it are isolated?
> > 
> > Yes. For all these internal functions non null means we found a group
> > to join, NULL means to keep checking isolation rules.
> > 
> ah, so !group == keep looking for for non-isolated conditions.. got it.
> Could that lead to two iommu-groups being created that could/should be one larger one?

The insistence on doing things in order should prevent that from
happening. So long as the larger group is present in the upstream
direction, or within the current bus, then it can be joined up.

This doesn't work if it randomly applies to PCI devices, it is why the
above has added the "PCI device is probing out of order" detection.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (15 preceding siblings ...)
  2025-07-09 14:52 ` [PATCH v2 16/16] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
@ 2025-07-18 21:29 ` Alex Williamson
  2025-07-18 22:59   ` Jason Gunthorpe
  2025-08-02  1:45 ` Ethan Zhao
  17 siblings, 1 reply; 46+ messages in thread
From: Alex Williamson @ 2025-07-18 21:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Wed,  9 Jul 2025 11:52:03 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> The series patches have extensive descriptions as to the problem and
> solution, but in short the ACS flags are not analyzed according to the
> spec to form the iommu_groups that VFIO is expecting for security.
> 
> ACS is an egress control only. For a path the ACS flags on each hop only
> effect what other devices the TLP is allowed to reach. It does not prevent
> other devices from reaching into this path.
> 
> For VFIO if device A is permitted to access device B's MMIO then A and B
> must be grouped together. This says that even if a path has isolating ACS
> flags on each hop, off-path devices with non-isolating ACS can still reach
> into that path and must be grouped gother.
> 
> For switches, a PCIe topology like:
> 
>                                -- DSP 02:00.0 -> End Point A
>  Root 00:00.0 -> USP 01:00.0 --|
>                                -- DSP 02:03.0 -> End Point B
> 
> Will generate unique single device groups for every device even if ACS is
> not enabled on the two DSP ports. It should at least group A/B together
> because no ACS means A can reach the MMIO of B. This is a serious failure
> for the VFIO security model.
> 
> For multi-function-devices, a PCIe topology like:
> 
>                   -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>   Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>                   |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. In many cases
> we suspect that the MFD actually doesn't need ACS, so this is probably not
> as important a security failure, but from a spec perspective the correct
> answer is one group of [1f.0, 1f.2, 1f.6] beacuse 1f.0/2 have no ACS
> preventing them from reaching the MMIO of 1f.6.

This will break various LOM configurations where the NIC is a function
within a MFD RCiEP which has or quirks ACS while the other functions
have no ACS.

> There is also some confusing spec language about how ACS and SRIOV works
> which this series does not address.
> 
> This entire series goes further and makes some additional improvements to
> the ACS validation found while studying this problem. The groups around a
> PCIe to PCI bridge are shrunk to not include the PCIe bridge.
> 
> The last patches implement "ACS Enhanced" on top of it. Due to how ACS
> Enhanced was defined as a non-backward compatible feature it is important
> to get SW support out there.
> 
> Due to the potential of iommu_groups becoming winder and thus non-usable
> for VFIO this should go to a linux-next tree to give it some more
> exposure.
> 
> I have now tested this a few systems I could get:
> 
>  - Various Intel client systems:
>    * Raptor Lake, with VMD enabled and using the real_dev mechanism
>    * 6/7th generation 100 Series/C320
>    * 5/6th generation 100 Series/C320 with a NIC MFD quirk
>    * Tiger Lake
>    * 5/6th generation Sunrise Point
>   No change in grouping on any of these systems

Sorry I haven't had much time to look at this, but it would still cause
a regression on my AlderLake system.  I get the following new
mega-group:

IOMMU Group 12:
	0000:00:1c.0 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port #1 [8086:7a38] (rev 11)
		Express Root Port (Slot+)
	0000:00:1c.1 PCI bridge [0604]: Intel Corporation Device [8086:7a39] (rev 11)
		Express Root Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.2 PCI bridge [0604]: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 [8086:7a3a] (rev 11)
		Express Root Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port #4 [8086:7a3b] (rev 11)
		Express Root Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.1/06:00.0 USB controller [0c03]: Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10)
		Express Endpoint
	0000:00:1c.2/07:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
		Express Endpoint
	0000:00:1c.3/08:00.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Upstream Port
	0000:00:1c.3/08:00.0/09:01.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Downstream Port (Slot-)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3/08:00.0/09:02.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Downstream Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3/08:00.0/09:03.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Downstream Port (Slot-)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3/08:00.0/09:01.0/0a:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 07)
		Express Endpoint
	0000:00:1c.3/08:00.0/09:03.0/0c:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 07)
		Express Endpoint

The source of the issue is the root port at 00:1c.0, which does not
have ACS support, claims that it has a slot but there is none, and
therefore has no subordinate DMA capable devices, nor does the root
port itself have an MMIO BAR.  I don't know if there's something we can
key on for the root port to mark it isolated.

00:1c.0 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port #1 [8086:7a38] (rev 11) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin ? routed to IRQ 125
	IOMMU group: 12
	Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
	I/O behind bridge: 6000-6fff [size=4K] [16-bit]
	Memory behind bridge: 40800000-411fffff [size=10M] [32-bit]
	Prefetchable memory behind bridge: 60e0000000-60e09fffff [size=10M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), IntMsgNum 0
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+ TEE-IO-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #1, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
		LnkSta:	Speed 2.5GT/s, Width x0
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+
			Slot #0, PowerLimit 0W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg+
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
			Changed: MRL- PresDet- LinkState-
		RootCap: CRSVisible-
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 2
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee002b8  Data: 0000
	Capabilities: [90] Null
	Kernel driver in use: pcieport

This is seen on a Gigabyte B760M DS3H DDR4 motherboard.  There's a
version of this board with wifi whereas this one has empty pads where
that m.2 slot might go.  I'd guess wifi might sit downstream of this
port if it were present, but I don't know how it'd change the feature
set of the root port.  The populated root ports show a more reasonable
set of capabilities:

00:1c.1 PCI bridge [0604]: Intel Corporation Device [8086:7a39] (rev 11) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5001]
	Flags: bus master, fast devsel, latency 0, IRQ 126, IOMMU group 12
	Bus: primary=00, secondary=06, subordinate=06, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 41800000-419fffff [size=2M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Capabilities: [40] Express Root Port (Slot+), IntMsgNum 0
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [98] Subsystem: Gigabyte Technology Co., Ltd Device [1458:5001]
	Capabilities: [a0] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [220] Access Control Services
	Capabilities: [200] L1 PM Substates
	Capabilities: [150] Precision Time Measurement
	Capabilities: [a30] Secondary PCI Express
	Capabilities: [a90] Data Link Feature <?>
	Kernel driver in use: pcieport

I can't say that the proposed code here is doing the wrong thing by
propagating the lack of isolation, but it's gratuitous when there is no
DMA initiator on the non-isolated branch and it causes a significant
usage problem for vfio.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-18 20:19         ` Jason Gunthorpe
@ 2025-07-18 21:41           ` Donald Dutile
  2025-07-18 22:52             ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Donald Dutile @ 2025-07-18 21:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 7/18/25 4:19 PM, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 03:00:28PM -0400, Donald Dutile wrote:
>>>>> +	/*
>>>>> +	 * !self is only for SRIOV virtual busses which should have been
>>>>> +	 * excluded above.
>>>> by pci_is_root_bus() ?? -- that checks if bus->parent exists...
>>>> not sure how that excludes the case of !bus->self ...
>>>
>>> Should be this:
>>>
>>> 	/*
>>> 	 * !self is only for SRIOV virtual busses which should have been
>>> 	 * excluded by pci_physfn()
>>> 	 */
>>> 	if (WARN_ON(!bus->self))
>>>
>> my Linux tree says its this:
>> static inline bool pci_is_root_bus(struct pci_bus *pbus)
>> {
>>          return !(pbus->parent);
>> }
>>
>> is there a change to pci_is_root_bus() in a -next branch?
> 
> Not that, at the start of the function there is a pci_physfn(), the
> entire function never works on a VF, so bus is never a VF's bus.
> 
Well, i guess it depends on what you call 'a VF's bus' -- it returns the VF's->PF(pdev)->bus if virt-fn,
which I would call the VF's bus.
thanks for pointing further up... now I get your added edit above (which I didn't read carefully, /my bad).

>>>>> +	 */
>>>>> +	if (WARN_ON(!bus->self))
>>>>> +		return ERR_PTR(-EINVAL);
>>>>> +
>>>>> +	group = iommu_group_get(&bus->self->dev);
>>>>> +	if (!group) {
>>>>> +		/*
>>>>> +		 * If the upstream bridge needs the same group as pdev then
>>>>> +		 * there is no way for it's pci_device_group() to discover it.
>>>>> +		 */
>>>>> +		dev_err(&pdev->dev,
>>>>> +			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
>>>>> +			pci_name(bus->self));
>>>>> +		return ERR_PTR(-EPROBE_DEFER);
>>>>> +	}
>>>>> +	if (group->bus_data & BUS_DATA_PCI_NON_ISOLATED)
>>>>> +		return group;
>>>>> +	iommu_group_put(group);
>>>>> +	return NULL;
>>>> ... and w/o the function description, I don't follow:
>>>> -- rtn an iommu-group if it has NON_ISOLATED property ... but rtn null if all devices below it are isolated?
>>>
>>> Yes. For all these internal functions non null means we found a group
>>> to join, NULL means to keep checking isolation rules.
>>>
>> ah, so !group == keep looking for for non-isolated conditions.. got it.
>> Could that lead to two iommu-groups being created that could/should be one larger one?
> 
> The insistence on doing things in order should prevent that from
> happening. So long as the larger group is present in the upstream
> direction, or within the current bus, then it can be joined up.
> 
> This doesn't work if it randomly applies to PCI devices, it is why the
> above has added the "PCI device is probing out of order" detection.
> 
ok, will keep that concept in mind when reviewing.

> Jason
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-18 21:41           ` Donald Dutile
@ 2025-07-18 22:52             ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 22:52 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Fri, Jul 18, 2025 at 05:41:47PM -0400, Donald Dutile wrote:
> > Not that, at the start of the function there is a pci_physfn(), the
> > entire function never works on a VF, so bus is never a VF's bus.
> > 
> Well, i guess it depends on what you call 'a VF's bus' -- it returns the VF's->PF(pdev)->bus if virt-fn,
> which I would call the VF's bus.

There is "VF bus" that the PCI core creates if the VF's RID reaches
outside the PF's bus. This bus - the "SRIOV virtual bus" has a NULL self.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-07-18 21:29 ` [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Alex Williamson
@ 2025-07-18 22:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 22:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Fri, Jul 18, 2025 at 03:29:34PM -0600, Alex Williamson wrote:

> I can't say that the proposed code here is doing the wrong thing by
> propagating the lack of isolation, but it's gratuitous when there is no
> DMA initiator on the non-isolated branch and it causes a significant
> usage problem for vfio.

I think the answer is the quote Don found, if the MFD function has no
ACS cap then we assume it cannot do P2P:

 e.g., Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function Devices
   ...
   ACS P2P Request Redirect: **must** be implemented by Functions that
   support peer-to-peer traffic with other Functions.

So I propose we lean into that. It should resolve everything you noted
here.

Instead the logic will be if any function with ACS has non-isolated
ACS then the whole MFD is in a group. This is alot closer to what it
does today, with the addition that explicit ACS non-isolation groups
correctly.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-07-09 14:52 ` [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
@ 2025-07-28  9:47   ` Cédric Le Goater
  2025-07-28 13:58     ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2025-07-28  9:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Hello Jason,

On 7/9/25 16:52, Jason Gunthorpe wrote:
> The current algorithm does not work if the ACS is not uniform across the
> entire MFD. This seems to mostly happen in the real world because of
> Linux's quirk systems that sometimes quirks only one function in a MFD,
> creating an asymmetric situation.
> 
> For discussion let's consider a simple MFD topology like the below:
> 
>                        -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>        Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>                        |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> This asymmetric ACS could be created using the config_acs kernel command
> line parameter, from quirks, or from a poorly thought out device that has
> ACS flags only on some functions.
> 
> Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
> do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
> function. Thus we expect an iommu_group to contain all three
> devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
> single device group of 1f.6.
> 
> The current algorithm sees the good ACS flags on 00:1f.6 and does not
> consider ACS on any other MFD functions.
> 
> For path properties the ACS flags say that 00:1f.6 is safe to use with
> PASID and supports SVA as it will not have any portions of its address
> space routed away from the IOMMU, this part of the ACS system is working
> correctly.
> 
> This is a problematic fix because this historical mistake has created an
> ecosystem around it. We now have quirks that assume single function is
> enough to quirk and it seems there are PCI root complexes that make the
> same non-compliant assumption.
> 
> The new helper pci_mfd_isolation() retains the existing quirks and we
> will probably need to add additional HW quirks for PCI root complexes that
> have not followed the spec but have been silently working today.
> 
> Use pci_reachable_set() in pci_device_group() to make the resulting
> algorithm faster and easier to understand.
> 
> Add pci_mfds_are_same_group() which specifically looks pair-wise at all
> functions in the MFDs. Any function without ACS isolation will become
> reachable to all other functions.
> 
> pci_reachable_set() does the calculations for figuring out the set of
> devices under the pci_bus_sem, which is better than repeatedly searching
> across all PCI devices.
> 
> Once the set of devices is determined and the set has more than one device
> use pci_get_slot() to search for any existing groups in the reachable set.
> 
> Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 173 +++++++++++++++++-------------------------
>   1 file changed, 71 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 7b407065488296..cd26b43916e8be 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_id);
>   
> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> -					       unsigned long *devfns);
> -
> -/*
> - * For multifunction devices which are not isolated from each other, find
> - * all the other non-isolated functions and look for existing groups.  For
> - * each function, we also need to look for aliases to or from other devices
> - * that may already have a group.
> - */
> -static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
> -							unsigned long *devfns)
> -{
> -	struct pci_dev *tmp = NULL;
> -	struct iommu_group *group;
> -
> -	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> -		return NULL;
> -
> -	for_each_pci_dev(tmp) {
> -		if (tmp == pdev || tmp->bus != pdev->bus ||
> -		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
> -		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
> -			continue;
> -
> -		group = get_pci_alias_group(tmp, devfns);
> -		if (group) {
> -			pci_dev_put(tmp);
> -			return group;
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
> -/*
> - * Look for aliases to or from the given device for existing groups. DMA
> - * aliases are only supported on the same bus, therefore the search
> - * space is quite small (especially since we're really only looking at pcie
> - * device, and therefore only expect multiple slots on the root complex or
> - * downstream switch ports).  It's conceivable though that a pair of
> - * multifunction devices could have aliases between them that would cause a
> - * loop.  To prevent this, we use a bitmap to track where we've been.
> - */
> -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> -					       unsigned long *devfns)
> -{
> -	struct pci_dev *tmp = NULL;
> -	struct iommu_group *group;
> -
> -	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
> -		return NULL;
> -
> -	group = iommu_group_get(&pdev->dev);
> -	if (group)
> -		return group;
> -
> -	for_each_pci_dev(tmp) {
> -		if (tmp == pdev || tmp->bus != pdev->bus)
> -			continue;
> -
> -		/* We alias them or they alias us */
> -		if (pci_devs_are_dma_aliases(pdev, tmp)) {
> -			group = get_pci_alias_group(tmp, devfns);
> -			if (group) {
> -				pci_dev_put(tmp);
> -				return group;
> -			}
> -
> -			group = get_pci_function_alias_group(tmp, devfns);
> -			if (group) {
> -				pci_dev_put(tmp);
> -				return group;
> -			}
> -		}
> -	}
> -
> -	return NULL;
> -}
> -
>   /*
>    * Generic device_group call-back function. It just allocates one
>    * iommu-group per device.
> @@ -1534,40 +1455,88 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
>   	return group;
>   }
>   
> +/*
> + * Ignoring quirks, all functions in the MFD need to be isolated from each other
> + * and get their own groups, otherwise the whole MFD will share a group. Any
> + * function that lacks explicit ACS isolation is assumed to be able to P2P
> + * access any other function in the MFD.
> + */
> +static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
> +{
> +	/* Are deva/devb functions in the same MFD? */
> +	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
> +		return false;
> +	/* Don't understand what is happening, be conservative */
> +	if (deva->multifunction != devb->multifunction)
> +		return true;
> +	if (!deva->multifunction)
> +		return false;
> +
> +	/* Quirks can inhibit single MFD functions from combining into groups */
> +	if (pci_mfd_isolated(deva) || pci_mfd_isolated(devb))
> +		return false;
> +
> +	/* Can they reach each other's MMIO through P2P? */
> +	return !pci_acs_enabled(deva, PCI_ACS_ISOLATED) ||
> +	       !pci_acs_enabled(devb, PCI_ACS_ISOLATED);
> +}
> +
> +static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
> +{
> +	/*
> +	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
> +	 */
> +	if (pci_devs_are_dma_aliases(deva, devb))
> +		return true;
> +
> +	return pci_mfds_are_same_group(deva, devb);
> +}
> +
>   static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
>   {
> -	struct iommu_group *group;
> -	DECLARE_BITMAP(devfns, 256) = {};
> +	struct pci_reachable_set devfns;
> +	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
> +	unsigned int devfn;
>   
>   	/*
> -	 * Look for existing groups on device aliases.  If we alias another
> -	 * device or another device aliases us, use the same group.
> +	 * Look for existing groups on device aliases and multi-function ACS. If
> +	 * we alias another device or another device aliases us, use the same
> +	 * group.
> +	 *
> +	 * pci_reachable_set() should return the same bitmap if called for any
> +	 * device in the set and we want all devices in the set to have the same
> +	 * group.
>   	 */
> -	group = get_pci_alias_group(pdev, devfns);
> -	if (group)
> -		return group;
> +	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
> +	/* start is known to have iommu_group_get() == NULL */
> +	__clear_bit(pdev->devfn, devfns.devfns);
>   
>   	/*
> -	 * Look for existing groups on non-isolated functions on the same
> -	 * slot and aliases of those funcions, if any.  No need to clear
> -	 * the search bitmap, the tested devfns are still valid.
> -	 */
> -	group = get_pci_function_alias_group(pdev, devfns);
> -	if (group)
> -		return group;
> -
> -	/*
> -	 * When MFD's are included in the set due to ACS we assume that if ACS
> -	 * permits an internal loopback between functions it also permits the
> -	 * loopback to go downstream if a function is a bridge.
> +	 * When MFD functions are included in the set due to ACS we assume that
> +	 * if ACS permits an internal loopback between functions it also permits
> +	 * the loopback to go downstream if any function is a bridge.
>   	 *
>   	 * It is less clear what aliases mean when applied to a bridge. For now
>   	 * be conservative and also propagate the group downstream.
>   	 */
> -	__clear_bit(pdev->devfn & 0xFF, devfns);
> -	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
> -		return pci_group_alloc_non_isolated();
> -	return NULL;
> +	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
> +		return NULL;
> +
> +	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
> +		struct iommu_group *group;
> +		struct pci_dev *pdev_slot;
> +
> +		pdev_slot = pci_get_slot(pdev->bus, devfn);
> +		group = iommu_group_get(&pdev_slot->dev);
> +		pci_dev_put(pdev_slot);
> +		if (group) {
> +			if (WARN_ON(!(group->bus_data &
> +				      BUS_DATA_PCI_NON_ISOLATED)))
> +				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> +			return group;
> +		}
> +	}
> +	return pci_group_alloc_non_isolated();
>   }
>   
>   static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)


I am seeing this WARN_ON when creating VFs of a CX-7 adapter :

[   31.436294] pci 0000:b1:00.2: enabling Extended Tags
[   31.448767] ------------[ cut here ]------------
[   31.453392] WARNING: CPU: 47 PID: 1673 at drivers/iommu/iommu.c:1533 pci_device_group+0x307/0x3b0
....
[   31.869174]  </TASK>
[   31.871373] ---[ end trace 0000000000000000 ]---
[   31.876020] pci 0000:b1:00.2: Adding to iommu group 11
[   31.883572] mlx5_core 0000:b1:00.2: enabling device (0000 -> 0002)
[   31.889846] mlx5_core 0000:b1:00.2: firmware version: 28.40.1000

Some more info on the system below,

b1:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
	Subsystem: Mellanox Technologies Device 0026
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 17
	NUMA node: 1
	IOMMU group: 12
	Region 0: Memory at dc000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at dbd00000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 32GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s (downgraded), Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn+
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [48] Vital Product Data
		Product Name: NVIDIA ConnectX-7 Ethernet adapter card, 200 GbE , Dual-port QSFP112, PCIe 5.0 x16, Crypto and Secure Boot
		Read-only fields:
			[PN] Part number: MCX713106AC-VEAT
			[EC] Engineering changes: A5
			[V2] Vendor specific: MCX713106AC-VEAT
			[SN] Serial number: MT2304XZ02VD
			[V3] Vendor specific: 6e988f87fb9ced118000946dae47a24a
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX713106A
			[V0] Vendor specific: PCIeGen5 x16
			[VU] Vendor specific: MT2304XZ02VDMLNXS0D0F1
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable- Migration- Interrupt- MSE- ARIHierarchy-
		IOVSta:	Migration-
		Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 01
		VF offset: 9, stride: 1, Device ID: 101e
		Supported Page Size: 000007ff, System Page Size: 00000001
		Region 0: Memory at 00000000e0000000 (64-bit, prefetchable)
		VF Migration: offset: 00000000, BIR: 0
	Capabilities: [230 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [420 v1] Data Link Feature <?>
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core
b1:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
	Subsystem: Mellanox Technologies Device 0026
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	NUMA node: 1
	IOMMU group: 11
	Region 0: Memory at e0800000 (64-bit, prefetchable) [virtual] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 32GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown (downgraded), Width x0 (downgraded)
			TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [100 v1] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Kernel driver in use: mlx5_vfio_pci
	Kernel modules: mlx5_core


  +-[0000:b0]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
  |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
  |           +-00.2  Intel Corporation Ice Lake RAS
  |           +-00.4  Intel Corporation Ice Lake IEH
  |           \-04.0-[b1-b2]--+-00.0  Mellanox Technologies MT2910 Family [ConnectX-7]
  |                           +-00.1  Mellanox Technologies MT2910 Family [ConnectX-7]
  |                           +-00.2  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           +-00.3  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           +-00.4  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           +-00.5  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           +-00.6  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           +-00.7  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           +-01.0  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  |                           \-01.1  Mellanox Technologies ConnectX Family mlx5Gen Virtual Function


IOMMU groups are baroque :


  *IOMMU Group 11 :
	b1:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
	b1:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
	b1:00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
	b1:00.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
	b1:00.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
	b1:00.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
	b1:00.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  *IOMMU Group 12 :
	b1:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
  *IOMMU Group 184 :
	b1:01.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
  *IOMMU Group 185 :
	b1:01.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function



Other differences are on the onboard graphic card:

  *IOMMU Group 26 :
	02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
	03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

becomes :

  *IOMMU Group 26 :
	02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
  *IOMMU Group 27 :
	03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

I guess that's unrelated


Thanks,

C.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs
  2025-07-28  9:47   ` Cédric Le Goater
@ 2025-07-28 13:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-07-28 13:58 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Mon, Jul 28, 2025 at 11:47:44AM +0200, Cédric Le Goater wrote:
> > +	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
> > +		struct iommu_group *group;
> > +		struct pci_dev *pdev_slot;
> > +
> > +		pdev_slot = pci_get_slot(pdev->bus, devfn);
> > +		group = iommu_group_get(&pdev_slot->dev);
> > +		pci_dev_put(pdev_slot);
> > +		if (group) {
> > +			if (WARN_ON(!(group->bus_data &
> > +				      BUS_DATA_PCI_NON_ISOLATED)))
> > +				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
> > +			return group;
> > +		}
> > +	}
> > +	return pci_group_alloc_non_isolated();
> >   }
> >   static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
> 
> 
> I am seeing this WARN_ON when creating VFs of a CX-7 adapter :
> 
> [   31.436294] pci 0000:b1:00.2: enabling Extended Tags
> [   31.448767] ------------[ cut here ]------------
> [   31.453392] WARNING: CPU: 47 PID: 1673 at drivers/iommu/iommu.c:1533 pci_device_group+0x307/0x3b0
> ....

Ah, yeah, it thinks the SRIOV VFs are part of the MFD because they
match the slot. I guess the old code had this same issue but it was
more harmless.

>  *IOMMU Group 11 :
> 	b1:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
> 	b1:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
> 	b1:00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
> 	b1:00.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
> 	b1:00.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
> 	b1:00.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
> 	b1:00.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function

This are all on the MFD's slot slot..

>  *IOMMU Group 12 :
> 	b1:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]

The one is basically what triggered the WARN_ON, it was done before
SRIOV "changed" the slot.

>  *IOMMU Group 184 :
> 	b1:01.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
>  *IOMMU Group 185 :
> 	b1:01.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function

These two are on another slot so they no longer matched

It should be fixed by ignoring VFs when doing MFD matching.

> Other differences are on the onboard graphic card:
> 
>  *IOMMU Group 26 :
> 	02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
> 	03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
> 
> becomes :
> 
>  *IOMMU Group 26 :
> 	02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
>  *IOMMU Group 27 :
> 	03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

Yes, this is a deliberate improvement. PCIe to PCI bridges like this
do keep the bridge isolated from the VGA. Only the downstream PCI
devices alias with each other and are non-isolated.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
                   ` (16 preceding siblings ...)
  2025-07-18 21:29 ` [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Alex Williamson
@ 2025-08-02  1:45 ` Ethan Zhao
  2025-08-02 15:18   ` Jason Gunthorpe
  17 siblings, 1 reply; 46+ messages in thread
From: Ethan Zhao @ 2025-08-02  1:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 7/9/2025 10:52 PM, Jason Gunthorpe wrote:
> The series patches have extensive descriptions as to the problem and
> solution, but in short the ACS flags are not analyzed according to the
> spec to form the iommu_groups that VFIO is expecting for security.
> 
> ACS is an egress control only. For a path the ACS flags on each hop only
> effect what other devices the TLP is allowed to reach. It does not prevent
> other devices from reaching into this path.
Perhaps I was a little confused here, the egress control vector on the
switch port could prevent the downstream EP device from P2P TLP eaching.
while EP has no knob if is isolated.>
> For VFIO if device A is permitted to access device B's MMIO then A and B
> must be grouped together. This says that even if a path has isolating ACS
> flags on each hop, off-path devices with non-isolating ACS can still reach
> into that path and must be grouped gother.
> 
> For switches, a PCIe topology like:
> 
>                                 -- DSP 02:00.0 -> End Point A
>   Root 00:00.0 -> USP 01:00.0 --|
>                                 -- DSP 02:03.0 -> End Point B
> 
> Will generate unique single device groups for every device even if ACS is
> not enabled on the two DSP ports. It should at least group A/B together
> because no ACS means A can reach the MMIO of B. This is a serious failure
> for the VFIO security model.
Yup, whether EP A /EP B is isolated, depends on the egress ACS setting 
on their DSP.>
> For multi-function-devices, a PCIe topology like:
> 
>                    -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>    Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>                    |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. In many cases
> we suspect that the MFD actually doesn't need ACS, so this is probably not
> as important a security failure, but from a spec perspective the correct
> answer is one group of [1f.0, 1f.2, 1f.6] beacuse 1f.0/2 have no ACS
> preventing them from reaching the MMIO of 1f.6.
I wonder if MFD/SRIOV has the egress control like switch port.

Thanks,
Ethan>
> There is also some confusing spec language about how ACS and SRIOV works
> which this series does not address.
> 
> This entire series goes further and makes some additional improvements to
> the ACS validation found while studying this problem. The groups around a
> PCIe to PCI bridge are shrunk to not include the PCIe bridge.
> 
> The last patches implement "ACS Enhanced" on top of it. Due to how ACS
> Enhanced was defined as a non-backward compatible feature it is important
> to get SW support out there.
> 
> Due to the potential of iommu_groups becoming winder and thus non-usable
> for VFIO this should go to a linux-next tree to give it some more
> exposure.
> 
> I have now tested this a few systems I could get:
> 
>   - Various Intel client systems:
>     * Raptor Lake, with VMD enabled and using the real_dev mechanism
>     * 6/7th generation 100 Series/C320
>     * 5/6th generation 100 Series/C320 with a NIC MFD quirk
>     * Tiger Lake
>     * 5/6th generation Sunrise Point
>    No change in grouping on any of these systems
> 
>   - NVIDIA Grace system with 5 different PCI switches from two vendors
>     Bug fix widening the iommu_groups works as expected here
> 
>   - AMD Milan Starship/Matisse
>     * Groups are similar, this series generates narrow groups because the
>       dummy host bridges always get their own groups. Something forcibly
>       disables ACS SV on one bridge which correctly causes one larger
>       group.
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups
> 
> v2:
>   - Revise comments and commit messages
>   - Rename struct pci_alias_set to pci_reachable_set
>   - Make more sense of the special bus->self = NULL case for SRIOV
>   - Add pci_group_alloc_non_isolated() for readability
>   - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED
>   - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD
>     function is a bridge
>   - New patches to add pci_mfd_isolation() to retain more cases of narrow
>     groups on MFDs with missing ACS.
>   - Redescribe the MFD related change as a bug fix. For a MFD to be
>     isolated all functions must have egress control on their P2P.
> v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com
> 
> Jason Gunthorpe (16):
>    PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
>    PCI: Add pci_bus_isolation()
>    iommu: Compute iommu_groups properly for PCIe switches
>    iommu: Organize iommu_group by member size
>    PCI: Add pci_reachable_set()
>    PCI: Remove duplication in calling pci_acs_ctrl_enabled()
>    PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs()
>    PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs()
>    PCI: Widen the acs_flags to u32 within the quirk callback
>    PCI: Add pci_mfd_isolation()
>    iommu: Compute iommu_groups properly for PCIe MFDs
>    iommu: Validate that pci_for_each_dma_alias() matches the groups
>    PCI: Add the ACS Enhanced Capability definitions
>    PCI: Enable ACS Enhanced bits for enable_acs and config_acs
>    PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
>    PCI: Check ACS Extended flags for pci_bus_isolated()
> 
>   drivers/iommu/iommu.c         | 486 +++++++++++++++++++++++-----------
>   drivers/pci/ats.c             |   4 +-
>   drivers/pci/pci.c             |  73 ++++-
>   drivers/pci/pci.h             |   5 +
>   drivers/pci/quirks.c          | 137 ++++++----
>   drivers/pci/search.c          | 294 ++++++++++++++++++++
>   include/linux/pci.h           |  50 ++++
>   include/uapi/linux/pci_regs.h |  18 ++
>   8 files changed, 846 insertions(+), 221 deletions(-)
> 
> 
> base-commit: e04c78d86a9699d136910cfc0bdcf01087e3267e


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-02  1:45 ` Ethan Zhao
@ 2025-08-02 15:18   ` Jason Gunthorpe
  2025-08-05  3:43     ` Ethan Zhao
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-08-02 15:18 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Sat, Aug 02, 2025 at 09:45:08AM +0800, Ethan Zhao wrote:
> 
> 
> On 7/9/2025 10:52 PM, Jason Gunthorpe wrote:
> > The series patches have extensive descriptions as to the problem and
> > solution, but in short the ACS flags are not analyzed according to the
> > spec to form the iommu_groups that VFIO is expecting for security.
> > 
> > ACS is an egress control only. For a path the ACS flags on each hop only
> > effect what other devices the TLP is allowed to reach. It does not prevent
> > other devices from reaching into this path.

> Perhaps I was a little confused here, the egress control vector on the

Linux does not support egress control vector. Enabling that is a
different project and we would indeed need to introduce different
logic.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-02 15:18   ` Jason Gunthorpe
@ 2025-08-05  3:43     ` Ethan Zhao
  2025-08-05 12:35       ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Ethan Zhao @ 2025-08-05  3:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 8/2/2025 11:18 PM, Jason Gunthorpe wrote:
> On Sat, Aug 02, 2025 at 09:45:08AM +0800, Ethan Zhao wrote:
>>
>>
>> On 7/9/2025 10:52 PM, Jason Gunthorpe wrote:
>>> The series patches have extensive descriptions as to the problem and
>>> solution, but in short the ACS flags are not analyzed according to the
>>> spec to form the iommu_groups that VFIO is expecting for security.
>>>
>>> ACS is an egress control only. For a path the ACS flags on each hop only
>>> effect what other devices the TLP is allowed to reach. It does not prevent
>>> other devices from reaching into this path.
> 
>> Perhaps I was a little confused here, the egress control vector on the
> 
> Linux does not support egress control vector. Enabling that is a
> different project and we would indeed need to introduce different
> logic.
My understanding, iommu has no logic yet to handle the egress control
vector configuration case, the static groups were created according to
FW DRDB tables, also not the case handled by notifiers for Hot-plug
events (BUS_NOTIFY_ADD_DEVICE etc). iommu groups need some kind of {
add, remove etc } per egress control vector configuration operation.

Thanks,
Ethan>
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-07-09 14:52 ` [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
  2025-07-17 22:17   ` Donald Dutile
@ 2025-08-05  4:39   ` Askar Safin
  1 sibling, 0 replies; 46+ messages in thread
From: Askar Safin @ 2025-08-05  4:39 UTC (permalink / raw)
  To: jgg
  Cc: alex.williamson, baolu.lu, bhelgaas, galshalom, iommu, joro,
	jroedel, kevin.tian, kvm, linux-pci, maorg, patches, robin.murphy,
	tdave, tony.zhu, will

"interpretation" should be, not "interpritation"

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-05  3:43     ` Ethan Zhao
@ 2025-08-05 12:35       ` Jason Gunthorpe
  2025-08-05 14:41         ` Ethan Zhao
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 12:35 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Aug 05, 2025 at 11:43:29AM +0800, Ethan Zhao wrote:
> 
> 
> On 8/2/2025 11:18 PM, Jason Gunthorpe wrote:
> > On Sat, Aug 02, 2025 at 09:45:08AM +0800, Ethan Zhao wrote:
> > > 
> > > 
> > > On 7/9/2025 10:52 PM, Jason Gunthorpe wrote:
> > > > The series patches have extensive descriptions as to the problem and
> > > > solution, but in short the ACS flags are not analyzed according to the
> > > > spec to form the iommu_groups that VFIO is expecting for security.
> > > > 
> > > > ACS is an egress control only. For a path the ACS flags on each hop only
> > > > effect what other devices the TLP is allowed to reach. It does not prevent
> > > > other devices from reaching into this path.
> > 
> > > Perhaps I was a little confused here, the egress control vector on the
> > 
> > Linux does not support egress control vector. Enabling that is a
> > different project and we would indeed need to introduce different
> > logic.
> My understanding, iommu has no logic yet to handle the egress control
> vector configuration case, 

We don't support it at all. If some FW leaves it configured then it
will work at the PCI level but Linux has no awarness of what it is
doing.

Arguably Linux should disable it on boot, but we don't..

> The static groups were created according to
> FW DRDB tables, 

?? iommu_groups have nothing to do with FW tables.

> also not the case handled by notifiers for Hot-plug events
> (BUS_NOTIFY_ADD_DEVICE etc).

This is handled.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-05 12:35       ` Jason Gunthorpe
@ 2025-08-05 14:41         ` Ethan Zhao
  2025-08-05 14:43           ` Jason Gunthorpe
  0 siblings, 1 reply; 46+ messages in thread
From: Ethan Zhao @ 2025-08-05 14:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 8/5/2025 8:35 PM, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 11:43:29AM +0800, Ethan Zhao wrote:
>>
>>
>> On 8/2/2025 11:18 PM, Jason Gunthorpe wrote:
>>> On Sat, Aug 02, 2025 at 09:45:08AM +0800, Ethan Zhao wrote:
>>>>
>>>>
>>>> On 7/9/2025 10:52 PM, Jason Gunthorpe wrote:
>>>>> The series patches have extensive descriptions as to the problem and
>>>>> solution, but in short the ACS flags are not analyzed according to the
>>>>> spec to form the iommu_groups that VFIO is expecting for security.
>>>>>
>>>>> ACS is an egress control only. For a path the ACS flags on each hop only
>>>>> effect what other devices the TLP is allowed to reach. It does not prevent
>>>>> other devices from reaching into this path.
>>>
>>>> Perhaps I was a little confused here, the egress control vector on the
>>>
>>> Linux does not support egress control vector. Enabling that is a
>>> different project and we would indeed need to introduce different
>>> logic.
>> My understanding, iommu has no logic yet to handle the egress control
>> vector configuration case,
> 
> We don't support it at all. If some FW leaves it configured then it
> will work at the PCI level but Linux has no awarness of what it is
> doing.
> 
> Arguably Linux should disable it on boot, but we don't..
linux tool like setpci could access PCIe configuration raw data, so
does to the ACS control bits. that is boring.>
>> The static groups were created according to
>> FW DRDB tables,
> 
> ?? iommu_groups have nothing to do with FW tables.
Sorry, typo, ACPI drhd table.

Thanks,
Ethan>
>> also not the case handled by notifiers for Hot-plug events
>> (BUS_NOTIFY_ADD_DEVICE etc).
> 
> This is handled.
> 
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-05 14:41         ` Ethan Zhao
@ 2025-08-05 14:43           ` Jason Gunthorpe
  2025-08-06  2:22             ` Ethan Zhao
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 14:43 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Tue, Aug 05, 2025 at 10:41:03PM +0800, Ethan Zhao wrote:

> > > My understanding, iommu has no logic yet to handle the egress control
> > > vector configuration case,
> > 
> > We don't support it at all. If some FW leaves it configured then it
> > will work at the PCI level but Linux has no awarness of what it is
> > doing.
> > 
> > Arguably Linux should disable it on boot, but we don't..
> linux tool like setpci could access PCIe configuration raw data, so
> does to the ACS control bits. that is boring.

Any change to ACS after boot is "not supported" - iommu groups are one
time only using boot config only. If someone wants to customize ACS
they need to use the new config_acs kernel parameter.

> > > The static groups were created according to
> > > FW DRDB tables,
> > 
> > ?? iommu_groups have nothing to do with FW tables.
> Sorry, typo, ACPI drhd table.

Same answer, AFAIK FW tables have no effect on iommu_groups

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-05 14:43           ` Jason Gunthorpe
@ 2025-08-06  2:22             ` Ethan Zhao
  2025-08-06  2:41               ` Baolu Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Ethan Zhao @ 2025-08-06  2:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 8/5/2025 10:43 PM, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 10:41:03PM +0800, Ethan Zhao wrote:
> 
>>>> My understanding, iommu has no logic yet to handle the egress control
>>>> vector configuration case,
>>>
>>> We don't support it at all. If some FW leaves it configured then it
>>> will work at the PCI level but Linux has no awarness of what it is
>>> doing.
>>>
>>> Arguably Linux should disable it on boot, but we don't..
>> linux tool like setpci could access PCIe configuration raw data, so
>> does to the ACS control bits. that is boring.
> 
> Any change to ACS after boot is "not supported" - iommu groups are one
> time only using boot config only. If someone wants to customize ACS
> they need to use the new config_acs kernel parameter.
That would leave ACS to boot time configuration only. Linux never
limits tools to access(write) hardware directly even it could do that.
Would it be better to have interception/configure-able policy for such
hardware access behavior in kernel like what hypervisor does to MSR etc ?
> 
>>>> The static groups were created according to
>>>> FW DRDB tables,
>>>
>>> ?? iommu_groups have nothing to do with FW tables.
>> Sorry, typo, ACPI drhd table.
> 
> Same answer, AFAIK FW tables have no effect on iommu_groups 
My understanding, FW tables are part of the description about device 
topology and iommu-device relationship. did I really misunderstand
something ?

Thanks,
Ethan >
> Jason


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-06  2:22             ` Ethan Zhao
@ 2025-08-06  2:41               ` Baolu Lu
  2025-08-06 13:40                 ` Jason Gunthorpe
                                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Baolu Lu @ 2025-08-06  2:41 UTC (permalink / raw)
  To: Ethan Zhao, Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

On 8/6/25 10:22, Ethan Zhao wrote:
> On 8/5/2025 10:43 PM, Jason Gunthorpe wrote:
>> On Tue, Aug 05, 2025 at 10:41:03PM +0800, Ethan Zhao wrote:
>>
>>>>> My understanding, iommu has no logic yet to handle the egress control
>>>>> vector configuration case,
>>>>
>>>> We don't support it at all. If some FW leaves it configured then it
>>>> will work at the PCI level but Linux has no awarness of what it is
>>>> doing.
>>>>
>>>> Arguably Linux should disable it on boot, but we don't..
>>> linux tool like setpci could access PCIe configuration raw data, so
>>> does to the ACS control bits. that is boring.
>>
>> Any change to ACS after boot is "not supported" - iommu groups are one
>> time only using boot config only. If someone wants to customize ACS
>> they need to use the new config_acs kernel parameter.
> That would leave ACS to boot time configuration only. Linux never
> limits tools to access(write) hardware directly even it could do that.
> Would it be better to have interception/configure-able policy for such
> hardware access behavior in kernel like what hypervisor does to MSR etc ?

A root user could even clear the BME or MSE bits of a device's PCIe
configuration space, even if the device is already bound to a driver and
operating normally. I don't think there's a mechanism to prevent that
from happening, besides permission enforcement. I believe that the same
applies to the ACS control.

>>
>>>>> The static groups were created according to
>>>>> FW DRDB tables,
>>>>
>>>> ?? iommu_groups have nothing to do with FW tables.
>>> Sorry, typo, ACPI drhd table.
>>
>> Same answer, AFAIK FW tables have no effect on iommu_groups 
> My understanding, FW tables are part of the description about device 
> topology and iommu-device relationship. did I really misunderstand
> something ?

The ACPI/DMAR table describes the platform's IOMMU topology, not the
device topology, which is described by the PCI bus. So, the firmware
table doesn't impact the iommu_group.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-06  2:41               ` Baolu Lu
@ 2025-08-06 13:40                 ` Jason Gunthorpe
  2025-08-07  1:36                 ` Ethan Zhao
  2025-08-08  7:56                 ` Ethan Zhao
  2 siblings, 0 replies; 46+ messages in thread
From: Jason Gunthorpe @ 2025-08-06 13:40 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Ethan Zhao, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Alex Williamson, galshalom,
	Joerg Roedel, Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Wed, Aug 06, 2025 at 10:41:41AM +0800, Baolu Lu wrote:
> > > Any change to ACS after boot is "not supported" - iommu groups are one
> > > time only using boot config only. If someone wants to customize ACS
> > > they need to use the new config_acs kernel parameter.
> > That would leave ACS to boot time configuration only. Linux never
> > limits tools to access(write) hardware directly even it could do that.
> > Would it be better to have interception/configure-able policy for such
> > hardware access behavior in kernel like what hypervisor does to MSR etc ?
> 
> A root user could even clear the BME or MSE bits of a device's PCIe
> configuration space, even if the device is already bound to a driver and
> operating normally. I don't think there's a mechanism to prevent that
> from happening, besides permission enforcement. I believe that the same
> applies to the ACS control.

Yes, we let the user do set_pci and they get to deal with the
resulting mess. Practically the kernel can't restructure the
iommu_groups once they are created.

Jason

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-06  2:41               ` Baolu Lu
  2025-08-06 13:40                 ` Jason Gunthorpe
@ 2025-08-07  1:36                 ` Ethan Zhao
  2025-08-08  7:56                 ` Ethan Zhao
  2 siblings, 0 replies; 46+ messages in thread
From: Ethan Zhao @ 2025-08-07  1:36 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 8/6/2025 10:41 AM, Baolu Lu wrote:
> On 8/6/25 10:22, Ethan Zhao wrote:
>> On 8/5/2025 10:43 PM, Jason Gunthorpe wrote:
>>> On Tue, Aug 05, 2025 at 10:41:03PM +0800, Ethan Zhao wrote:
>>>
>>>>>> My understanding, iommu has no logic yet to handle the egress control
>>>>>> vector configuration case,
>>>>>
>>>>> We don't support it at all. If some FW leaves it configured then it
>>>>> will work at the PCI level but Linux has no awarness of what it is
>>>>> doing.
>>>>>
>>>>> Arguably Linux should disable it on boot, but we don't..
>>>> linux tool like setpci could access PCIe configuration raw data, so
>>>> does to the ACS control bits. that is boring.
>>>
>>> Any change to ACS after boot is "not supported" - iommu groups are one
>>> time only using boot config only. If someone wants to customize ACS
>>> they need to use the new config_acs kernel parameter.
>> That would leave ACS to boot time configuration only. Linux never
>> limits tools to access(write) hardware directly even it could do that.
>> Would it be better to have interception/configure-able policy for such
>> hardware access behavior in kernel like what hypervisor does to MSR etc ?
> 
> A root user could even clear the BME or MSE bits of a device's PCIe
> configuration space, even if the device is already bound to a driver and
> operating normally. I don't think there's a mechanism to prevent that
pci tools such setpci accesses PCIe device configuration space via sysfs
interface, it has default write/read rights setting to root users, that 
is one point could control the root permission.

PCIe device configuration space was mapped into CPU address space via
ECAM by calling ioremap to setup CPU page table, the PTE has permission
control bits for read/wirte/cache etc. this is another point to control.

Legacy PCI device configuration space was accessed via 0xCF8/0xCFC 
ioport operation, there is point to intercept.

To prevent device from DMA to configuration space, the same IOMMU 
pagetable PTE could be setup to control the access.

> from happening, besides permission enforcement. I believe that the same
> applies to the ACS control.
> 
>>>
>>>>>> The static groups were created according to
>>>>>> FW DRDB tables,
>>>>>
>>>>> ?? iommu_groups have nothing to do with FW tables.
>>>> Sorry, typo, ACPI drhd table.
>>>
>>> Same answer, AFAIK FW tables have no effect on iommu_groups 
>> My understanding, FW tables are part of the description about device 
>> topology and iommu-device relationship. did I really misunderstand
>> something ?
> 
> The ACPI/DMAR table describes the platform's IOMMU topology, not the
> device topology, which is described by the PCI bus. So, the firmware
> table doesn't impact the iommu_group.

I remember drhd table list the iommus and the device belong to them.
but kernel still needs to traverse PCIe topology to make up iommu_groups.


Thanks,
Ethan>
> Thanks,
> baolu


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
  2025-08-06  2:41               ` Baolu Lu
  2025-08-06 13:40                 ` Jason Gunthorpe
  2025-08-07  1:36                 ` Ethan Zhao
@ 2025-08-08  7:56                 ` Ethan Zhao
  2 siblings, 0 replies; 46+ messages in thread
From: Ethan Zhao @ 2025-08-08  7:56 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu



On 8/6/2025 10:41 AM, Baolu Lu wrote:
> On 8/6/25 10:22, Ethan Zhao wrote:
>> On 8/5/2025 10:43 PM, Jason Gunthorpe wrote:
>>> On Tue, Aug 05, 2025 at 10:41:03PM +0800, Ethan Zhao wrote:
>>>
>>>>>> My understanding, iommu has no logic yet to handle the egress control
>>>>>> vector configuration case,
>>>>>
>>>>> We don't support it at all. If some FW leaves it configured then it
>>>>> will work at the PCI level but Linux has no awarness of what it is
>>>>> doing.
>>>>>
>>>>> Arguably Linux should disable it on boot, but we don't..
>>>> linux tool like setpci could access PCIe configuration raw data, so
>>>> does to the ACS control bits. that is boring.
>>>
>>> Any change to ACS after boot is "not supported" - iommu groups are one
>>> time only using boot config only. If someone wants to customize ACS
>>> they need to use the new config_acs kernel parameter.
>> That would leave ACS to boot time configuration only. Linux never
>> limits tools to access(write) hardware directly even it could do that.
>> Would it be better to have interception/configure-able policy for such
>> hardware access behavior in kernel like what hypervisor does to MSR etc ?
> 
> A root user could even clear the BME or MSE bits of a device's PCIe
> configuration space, even if the device is already bound to a driver and
> operating normally. I don't think there's a mechanism to prevent that
> from happening, besides permission enforcement. I believe that the same
> applies to the ACS control.
> 
>>>
>>>>>> The static groups were created according to
>>>>>> FW DRDB tables,
>>>>>
>>>>> ?? iommu_groups have nothing to do with FW tables.
>>>> Sorry, typo, ACPI drhd table.
>>>
>>> Same answer, AFAIK FW tables have no effect on iommu_groups 
>> My understanding, FW tables are part of the description about device 
>> topology and iommu-device relationship. did I really misunderstand
>> something ?
> 
> The ACPI/DMAR table describes the platform's IOMMU topology, not the
> device topology, which is described by the PCI bus. So, the firmware
> table doesn't impact the iommu_group.
There is kernel lockdown lsm works via kernel parameter lockdown=
{integrity | confidentiality},

"If set to integrity, kernel features that allow userland to modify the 
running kernel are disabled. If set to confidentiality, kernel features 
that allow userland to extract confidential information from the kernel
are also disabled. "

It also works for PCIe configuration space directly access from 
userland, but the levels of configuration granularity is coarse, can't 
configure kernel to only prevent PCI device from accessing from userland.

The design and implementation is structured and the granularity is fine-
grained, it is easy to extended and add new kernel parameter to only
lockdown PCI device configuration space access.

[LOCKDOWN_PCI_ACCESS] = "direct PCI access"

Thanks,
Ethan



> 
> Thanks,
> baolu


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 10/16] PCI: Add pci_mfd_isolation()
  2025-07-09 14:52 ` [PATCH v2 10/16] PCI: Add pci_mfd_isolation() Jason Gunthorpe
@ 2025-08-20 17:21   ` Keith Busch
  0 siblings, 0 replies; 46+ messages in thread
From: Keith Busch @ 2025-08-20 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Alex Williamson, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Wed, Jul 09, 2025 at 11:52:13AM -0300, Jason Gunthorpe wrote:
> @@ -2078,6 +2079,9 @@ pci_reachable_set(struct pci_dev *start, struct pci_reachable_set *devfns,
>  static inline enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
>  { return PCIE_NON_ISOLATED; }
>  
> +bool pci_mfd_isolated(struct pci_dev *dev)
> +{ return false; }
> +

This needs to be 'static inline'.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2025-08-20 17:21 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 01/16] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 02/16] PCI: Add pci_bus_isolation() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
2025-07-17 22:03   ` Donald Dutile
2025-07-18 18:09     ` Jason Gunthorpe
2025-07-18 19:00       ` Donald Dutile
2025-07-18 20:19         ` Jason Gunthorpe
2025-07-18 21:41           ` Donald Dutile
2025-07-18 22:52             ` Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 04/16] iommu: Organize iommu_group by member size Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 05/16] PCI: Add pci_reachable_set() Jason Gunthorpe
2025-07-17 22:04   ` Donald Dutile
2025-07-18 17:49     ` Jason Gunthorpe
2025-07-18 19:10       ` Donald Dutile
2025-07-09 14:52 ` [PATCH v2 06/16] PCI: Remove duplication in calling pci_acs_ctrl_enabled() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 07/16] PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 08/16] PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 09/16] PCI: Widen the acs_flags to u32 within the quirk callback Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 10/16] PCI: Add pci_mfd_isolation() Jason Gunthorpe
2025-08-20 17:21   ` Keith Busch
2025-07-09 14:52 ` [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
2025-07-28  9:47   ` Cédric Le Goater
2025-07-28 13:58     ` Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
2025-07-17 22:07   ` Donald Dutile
2025-07-09 14:52 ` [PATCH v2 13/16] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 14/16] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
2025-07-17 22:17   ` Donald Dutile
2025-07-18 17:52     ` Jason Gunthorpe
2025-08-05  4:39   ` Askar Safin
2025-07-09 14:52 ` [PATCH v2 16/16] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
2025-07-18 21:29 ` [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Alex Williamson
2025-07-18 22:59   ` Jason Gunthorpe
2025-08-02  1:45 ` Ethan Zhao
2025-08-02 15:18   ` Jason Gunthorpe
2025-08-05  3:43     ` Ethan Zhao
2025-08-05 12:35       ` Jason Gunthorpe
2025-08-05 14:41         ` Ethan Zhao
2025-08-05 14:43           ` Jason Gunthorpe
2025-08-06  2:22             ` Ethan Zhao
2025-08-06  2:41               ` Baolu Lu
2025-08-06 13:40                 ` Jason Gunthorpe
2025-08-07  1:36                 ` Ethan Zhao
2025-08-08  7:56                 ` Ethan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).