kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
@ 2025-06-30 22:28 Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The series patches have extensive descriptions as to the problem and
solution, but in short a PCIe topology like:

                               -- DSP 02:00.0 -> End Point A
 Root 00:00.0 -> USP 01:00.0 --|
                               -- DSP 02:03.0 -> End Point B

Will generate unique single device groups for every device even if ACS is
not enabled on the two DSP ports. This is a serious failure for the VFIO
security model.

This entire series goes further and makes some additional improvements to
the ACS validation found while studying this problem. The groups around a
PCIe to PCI bridge are shrunk to not include the PCIe bridge.

The last patches implement "ACS Enhanced" on top of it. Due to how ACS
Enhanced was defined as a non-backward compatible feature it is important
to get SW support out there.

Due to potential VFIO complaints this should go to a linux-next tree to
give it some more exposure.

This has been tested on a system here with 5 different PCIe switches from
two vendors, a PCIe-PCI bridge, and a complex set of ACS flags.

This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups

Jason Gunthorpe (11):
  PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  PCI: Add pci_bus_isolation()
  iommu: Compute iommu_groups properly for PCIe switches
  iommu: Organize iommu_group by member size
  PCI: Add pci_reachable_set()
  iommu: Use pci_reachable_set() in pci_device_group()
  iommu: Validate that pci_for_each_dma_alias() matches the groups
  PCI: Add the ACS Enhanced Capability definitions
  PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  PCI: Check ACS Extended flags for pci_bus_isolated()

 drivers/iommu/iommu.c         | 439 ++++++++++++++++++++++------------
 drivers/pci/ats.c             |   4 +-
 drivers/pci/pci.c             |  73 +++++-
 drivers/pci/search.c          | 250 +++++++++++++++++++
 include/linux/pci.h           |  43 ++++
 include/uapi/linux/pci_regs.h |  18 ++
 6 files changed, 661 insertions(+), 166 deletions(-)


base-commit: e04c78d86a9699d136910cfc0bdcf01087e3267e
-- 
2.43.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 02/11] PCI: Add pci_bus_isolation() Jason Gunthorpe
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The next patch wants to use this constant, share it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c         | 16 +++-------------
 include/uapi/linux/pci_regs.h | 10 ++++++++++
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a4b606c591da66..d265de874b14b6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1408,16 +1408,6 @@ EXPORT_SYMBOL_GPL(iommu_group_id);
 static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 					       unsigned long *devfns);
 
-/*
- * To consider a PCI device isolated, we require ACS to support Source
- * Validation, Request Redirection, Completer Redirection, and Upstream
- * Forwarding.  This effectively means that devices cannot spoof their
- * requester ID, requests and completions cannot be redirected, and all
- * transactions are forwarded upstream, even as it passes through a
- * bridge where the target device is downstream.
- */
-#define REQ_ACS_FLAGS   (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
-
 /*
  * For multifunction devices which are not isolated from each other, find
  * all the other non-isolated functions and look for existing groups.  For
@@ -1430,13 +1420,13 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
 	struct pci_dev *tmp = NULL;
 	struct iommu_group *group;
 
-	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
+	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
 		return NULL;
 
 	for_each_pci_dev(tmp) {
 		if (tmp == pdev || tmp->bus != pdev->bus ||
 		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, REQ_ACS_FLAGS))
+		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
 			continue;
 
 		group = get_pci_alias_group(tmp, devfns);
@@ -1580,7 +1570,7 @@ struct iommu_group *pci_device_group(struct device *dev)
 		if (!bus->self)
 			continue;
 
-		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
+		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
 			break;
 
 		pdev = bus->self;
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index a3a3e942dedffc..6edffd31cd8e2c 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1008,6 +1008,16 @@
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+/*
+ * To consider a PCI device isolated, we require ACS to support Source
+ * Validation, Request Redirection, Completer Redirection, and Upstream
+ * Forwarding.  This effectively means that devices cannot spoof their
+ * requester ID, requests and completions cannot be redirected, and all
+ * transactions are forwarded upstream, even as it passes through a
+ * bridge where the target device is downstream.
+ */
+#define PCI_ACS_ISOLATED (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
+
 /* SATA capability */
 #define PCI_SATA_REGS		4	/* SATA REGs specifier */
 #define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-07-01 19:28   ` Alex Williamson
  2025-06-30 22:28 ` [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Prepare to move the calculation of the bus P2P isolation out of the iommu
code and into the PCI core. This allows using the faster list iteration
under the pci_bus_sem, and the code has a kinship with the logic in
pci_for_each_dma_alias().

Bus isolation is the concept that drives the iommu_groups for the purposes
of VFIO. Stated simply, if device A can send traffic to device B then they
must be in the same group.

Only PCIe provides isolation. The multi-drop electrical topology in
classical PCI allows any bus member to claim the transaction.

In PCIe isolation comes out of ACS. If a PCIe Switch and Root Complex has
ACS flags that prevent peer to peer traffic and funnel all operations to
the IOMMU then devices can be isolated.

If a multi-function devices has ACS flags preventing self loop back then
that device is isolated, though pci_bus_isolation() does not deal with
devices.

As a property of a bus, there are several positive cases:

 - The point to point "bus" on a physical PCIe link is isolated if the
   bridge/root device has something preventing self-access to its own
   MMIO.

 - A Root Port is usually isolated

 - A PCIe switch can be isolated if all it's Down Stream Ports have good
   ACS flags

pci_bus_isolation() implements these rules and returns an enum indicating
the level of isolation the bus has, with five possibilies:

 PCIE_ISOLATED: Traffic on this PCIE bus can not do any P2P.

 PCIE_SWITCH_DSP_NON_ISOLATED: The bus is the internal bus of a PCIE
     switch and the USP is isolated but the DSPs are not.

 PCIE_NON_ISOLATED: The PCIe bus has no isolation between the bridge or
     any downstream devices.

 PCI_BUS_NON_ISOLATED: It is a PCI/PCI-X but the bridge is PCIe, has no
     aliases and is isolated.

 PCI_BRIDGE_NON_ISOLATED: It is a PCI/PCI-X bus and has no isolation, the
     bridge is part of the group.

The calculation is done per-bus, so it is possible for a transactions from
a PCI device to travel through different bus isolation types on its way
upstream. PCIE_SWITCH_DSP_NON_ISOLATED/PCI_BUS_NON_ISOLATED and
PCIE_NON_ISOLATED/PCI_BRIDGE_NON_ISOLATED are the same for the purposes of
creating iommu groups. The distinction between PCIe and PCI allows for
easier understanding and debugging as to why the groups are choosen.

For the iommu groups if all busses on the upstream path are PCIE_ISOLATED
then the end device has a chance to have a single-device iommu_group. Once
any non-isolated bus segment is found that bus segment will have an
iommu_group that captures all downstream devices, and sometimes the
upstream bridge.

pci_bus_isolation() is principally about isolation, but there is an
overlap with grouping requirements for legacy PCI aliasing. For purely
legacy PCI environments pci_bus_isolation() returns
PCI_BRIDGE_NON_ISOLATED for everything and all devices within a hierarchy
are in one group. No need to worry about bridge aliasing.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 150 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  |  31 +++++++++
 2 files changed, 181 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 53840634fbfc2b..540a503b499e3f 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -113,6 +113,156 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
 	return ret;
 }
 
+static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+
+	/*
+	 * Within a PCIe switch we have an interior bus that has the Upstream
+	 * port as the bridge and a set of Downstream port bridging to the
+	 * egress ports.
+	 *
+	 * Each DSP has an ACS setting which controls where its traffic is
+	 * permitted to go. Any DSP with a permissive ACS setting can send
+	 * traffic flowing upstream back downstream through another DSP.
+	 *
+	 * Thus any non-permissive DSP spoils the whole bus.
+	 */
+	guard(rwsem_read)(&pci_bus_sem);
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		/* Don't understand what this is, be conservative */
+		if (!pci_is_pcie(pdev) ||
+		    pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM ||
+		    pdev->dma_alias_mask)
+			return PCIE_NON_ISOLATED;
+
+		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
+			return PCIE_SWITCH_DSP_NON_ISOLATED;
+	}
+	return PCIE_ISOLATED;
+}
+
+static bool pci_has_mmio(struct pci_dev *pdev)
+{
+	unsigned int i;
+
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		struct resource *res = pci_resource_n(pdev, i);
+
+		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
+			return true;
+	}
+	return false;
+}
+
+/**
+ * pci_bus_isolated - Determine how isolated connected devices are
+ * @bus: The bus to check
+ *
+ * Isolation is the ability of devices to talk to each other. Full isolation
+ * means that a device can only communicate with the IOMMU and can not do peer
+ * to peer within the fabric.
+ *
+ * We consider isolation on a bus by bus basis. If the bus will permit a
+ * transaction originated downstream to complete on anything other than the
+ * IOMMU then the bus is not isolated.
+ *
+ * Non-isolation includes all the downstream devices on this bus, and it may
+ * include the upstream bridge or port that is creating this bus.
+ *
+ * The various cases are returned in an enum.
+ *
+ * Broadly speaking this function evaluates the ACS settings in a PCI switch to
+ * determine if a PCI switch is configured to have full isolation.
+ *
+ * Old PCI/PCI-X busses cannot have isolation due to their physical properties,
+ * but they do have some aliasing properties that effect group creation.
+ *
+ * pci_bus_isolated() does not consider loopback internal to devices, like
+ * multi-function devices performing a self-loopback. The caller must check
+ * this separately. It does not considering alasing within the bus.
+ *
+ * It does not currently support the ACS P2P Egress Control Vector, Linux does
+ * not yet have any way to enable this feature. EC will create subsets of the
+ * bus that are isolated from other subsets.
+ */
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
+{
+	struct pci_dev *bridge = bus->self;
+	int type;
+
+	/* Consider virtual busses isolated */
+	if (!bridge)
+		return PCIE_ISOLATED;
+	if (pci_is_root_bus(bus))
+		return PCIE_ISOLATED;
+
+	/*
+	 * The bridge is not a PCIe bridge therefore this bus is PCI/PCI-X.
+	 *
+	 * PCI does not have anything like ACS. Any down stream device can bus
+	 * master an address that any other downstream device can claim. No
+	 * isolation is possible.
+	 */
+	if (!pci_is_pcie(bridge)) {
+		if (bridge->dev_flags & PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS)
+			type = PCI_EXP_TYPE_PCI_BRIDGE;
+		else
+			return PCI_BRIDGE_NON_ISOLATED;
+	} else {
+		type = pci_pcie_type(bridge);
+	}
+
+	switch (type) {
+	/*
+	 * Since PCIe links are point to point root and downstream ports are
+	 * isolated if their own MMIO cannot be reached.
+	 */
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+		if (!pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
+			return PCIE_NON_ISOLATED;
+		return PCIE_ISOLATED;
+
+	/*
+	 * bus is the interior bus of a PCI-E switch where ACS rules apply.
+	 */
+	case PCI_EXP_TYPE_UPSTREAM:
+		return pcie_switch_isolated(bus);
+
+	/*
+	 * PCIe to PCI/PCI-X - this bus is PCI.
+	 */
+	case PCI_EXP_TYPE_PCI_BRIDGE:
+		/*
+		 * A PCIe express bridge will use the subordinate bus number
+		 * with a 0 devfn as the RID in some cases. This causes all
+		 * subordinate devfns to alias with 0, which is the same
+		 * grouping as PCI_BUS_NON_ISOLATED. The RID of the bridge
+		 * itself is only used by the bridge.
+		 *
+		 * However, if the bridge has MMIO then we will assume the MMIO
+		 * is not isolated due to no ACS controls on this bridge type.
+		 */
+		if (pci_has_mmio(bridge))
+			return PCI_BRIDGE_NON_ISOLATED;
+		return PCI_BUS_NON_ISOLATED;
+
+	/*
+	 * PCI/PCI-X to PCIe - this bus is PCIe. We already know there must be a
+	 * PCI bus upstream of this bus, so just return non-isolated. If
+	 * upstream is PCI-X the PCIe RID should be preserved, but for PCI the
+	 * RID will be lost.
+	 */
+	case PCI_EXP_TYPE_PCIE_BRIDGE:
+		return PCI_BRIDGE_NON_ISOLATED;
+
+	default:
+		return PCI_BRIDGE_NON_ISOLATED;
+	}
+}
+EXPORT_SYMBOL_GPL(pci_bus_isolated);
+
 static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
 {
 	struct pci_bus *child;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f39238..deeb85467f4f38 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -834,6 +834,32 @@ struct pci_dynids {
 	struct list_head	list;	/* For IDs added at runtime */
 };
 
+enum pci_bus_isolation {
+	/*
+	 * The bus is off a root port and the root port has isolated ACS flags
+	 * or the bus is part of a PCIe switch and the switch has isolated ACS
+	 * flags.
+	 */
+	PCIE_ISOLATED,
+	/*
+	 * The switch's DSP's are not isolated from each other but are isolated
+	 * from the USP.
+	 */
+	PCIE_SWITCH_DSP_NON_ISOLATED,
+	/* The above and the USP's MMIO is not isolated. */
+	PCIE_NON_ISOLATED,
+	/*
+	 * A PCI/PCI-X bus, no isolation. This is like
+	 * PCIE_SWITCH_DSP_NON_ISOLATED in that the upstream bridge is isolated
+	 * from the bus. The bus itself may also have a shared alias of devfn=0.
+	 */
+	PCI_BUS_NON_ISOLATED,
+	/*
+	 * The above and the bridge's MMIO is not isolated and the bridge's RID
+	 * may be an alias.
+	 */
+	PCI_BRIDGE_NON_ISOLATED,
+};
 
 /*
  * PCI Error Recovery System (PCI-ERS).  If a PCI device driver provides
@@ -1222,6 +1248,8 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
 
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
+
 int pci_dev_present(const struct pci_device_id *ids);
 
 int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
@@ -2035,6 +2063,9 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
 						 struct pci_dev *from)
 { return NULL; }
 
+enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
+{ return PCI_NON_ISOLATED; }
+
 static inline int pci_dev_present(const struct pci_device_id *ids)
 { return 0; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 02/11] PCI: Add pci_bus_isolation() Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-07-01 19:29   ` Alex Williamson
  2025-06-30 22:28 ` [PATCH 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. Alex suspects a multi-function
modeling of the DSPs may have hidden it, but I could not find a switch
using that scheme.

Most likely people were simply happy they had smaller groups oblivious to
the security problems.

For discussion lets consider a simple topology like the below:

                               -- DSP 02:00.0 -> End Point A
 Root 00:00.0 -> USP 01:00.0 --|
                               -- DSP 02:03.0 -> End Point B

If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.

If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.

If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.

Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSPs group.

The current algorithm has several issues:

 1) It implicitly depends on ordering. Since the existing group discovery
    only goes in the upstream direction discovering a downstream device
    before its upstream will cause the wrong creation of narrower groups.

 2) It assumes that if the path from the end point to the root is entirely
    ACS isolated then that end point is isolated. This misses cross-traffic
    in the asymmetric ACS case.

 3) When evaluating a non-isolated DSP it does not check peer DSPs for an
    already established group unless the multi-function feature does it.

 4) It does not understand the aliasing rule for PCIe to PCI bridges
    where the alias is to the subordinate bus. The bridge's RID on the
    primary bus is not aliased. This causes the PCIe to PCI bridge to be
    wrongly joined to the group with the downstream devices.

As grouping is a security property for VFIO creating incorrectly narrowed
groups is a security problem for the system.

Revise the design to solve these problems.

Explicitly require ordering, or return EPROBE_DEFER if things are out of
order. This avoids silent errors that created smaller groups and solves
problem #1.

Work on busses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all busses on the path to an end device are
isolated then the end device has a chance to make a single-device group.

Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches alot of PCI
internals to get the information in the right format.

Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2

The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.

Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.

Finally, replace the initial pci_for_each_dma_alias() with a combination
of:

 - Directly checking pci_real_dma_dev() and enforcing ordering.
   The group should contain both pdev and pci_real_dma_dev(pdev) which is
   only possible if pdev is ordered after real_dma_dev. This solves a case
   of #1.

 - Indirectly relying on pci_bus_isolation() to report legacy PCI busses
   as non-isolated, with the enum including the distinction of the PCIe to
   PCI bridge being isolated from the downstream. This solves problem #4.

It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.

The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.

Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 243 +++++++++++++++++++++++++++++++-----------
 1 file changed, 178 insertions(+), 65 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d265de874b14b6..f4584ffacbc03d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -65,8 +65,16 @@ struct iommu_group {
 	struct list_head entry;
 	unsigned int owner_cnt;
 	void *owner;
+
+	/* Used by the device_group() callbacks */
+	u32 bus_data;
 };
 
+/*
+ * Everything downstream of this group should share it.
+ */
+#define BUS_DATA_PCI_UNISOLATED BIT(0)
+
 struct group_device {
 	struct list_head list;
 	struct device *dev;
@@ -1484,25 +1492,6 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 	return NULL;
 }
 
-struct group_for_pci_data {
-	struct pci_dev *pdev;
-	struct iommu_group *group;
-};
-
-/*
- * DMA alias iterator callback, return the last seen device.  Stop and return
- * the IOMMU group if we find one along the way.
- */
-static int get_pci_alias_or_group(struct pci_dev *pdev, u16 alias, void *opaque)
-{
-	struct group_for_pci_data *data = opaque;
-
-	data->pdev = pdev;
-	data->group = iommu_group_get(&pdev->dev);
-
-	return data->group != NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1534,52 +1523,11 @@ struct iommu_group *generic_single_device_group(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(generic_single_device_group);
 
-/*
- * Use standard PCI bus topology, isolation features, and DMA alias quirks
- * to find or create an IOMMU group for a device.
- */
-struct iommu_group *pci_device_group(struct device *dev)
+static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct group_for_pci_data data;
-	struct pci_bus *bus;
-	struct iommu_group *group = NULL;
+	struct iommu_group *group;
 	u64 devfns[4] = { 0 };
 
-	if (WARN_ON(!dev_is_pci(dev)))
-		return ERR_PTR(-EINVAL);
-
-	/*
-	 * Find the upstream DMA alias for the device.  A device must not
-	 * be aliased due to topology in order to have its own IOMMU group.
-	 * If we find an alias along the way that already belongs to a
-	 * group, use it.
-	 */
-	if (pci_for_each_dma_alias(pdev, get_pci_alias_or_group, &data))
-		return data.group;
-
-	pdev = data.pdev;
-
-	/*
-	 * Continue upstream from the point of minimum IOMMU granularity
-	 * due to aliases to the point where devices are protected from
-	 * peer-to-peer DMA by PCI ACS.  Again, if we find an existing
-	 * group, use it.
-	 */
-	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
-		if (!bus->self)
-			continue;
-
-		if (pci_acs_path_enabled(bus->self, NULL, PCI_ACS_ISOLATED))
-			break;
-
-		pdev = bus->self;
-
-		group = iommu_group_get(&pdev->dev);
-		if (group)
-			return group;
-	}
-
 	/*
 	 * Look for existing groups on device aliases.  If we alias another
 	 * device or another device aliases us, use the same group.
@@ -1593,12 +1541,177 @@ struct iommu_group *pci_device_group(struct device *dev)
 	 * slot and aliases of those funcions, if any.  No need to clear
 	 * the search bitmap, the tested devfns are still valid.
 	 */
-	group = get_pci_function_alias_group(pdev, (unsigned long *)devfns);
+	return get_pci_function_alias_group(pdev, (unsigned long *)devfns);
+}
+
+static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
+{
+	struct pci_bus *bus = pdev->bus;
+	struct iommu_group *group;
+
+	if (pci_is_root_bus(bus))
+		return NULL;
+
+	/* Skip virtual buses */
+	if (!bus->self)
+		return NULL;
+
+	group = iommu_group_get(&bus->self->dev);
+	if (!group) {
+		/*
+		 * If the upstream bridge needs the same group as pdev then
+		 * there is way for it's pci_device_group() to discover it.
+		 */
+		dev_err(&pdev->dev,
+			"PCI device is probing out of order, upstream bridge device of %s is not probed yet\n",
+			pci_name(bus->self));
+		return ERR_PTR(-EPROBE_DEFER);
+	}
+	if (group->bus_data & BUS_DATA_PCI_UNISOLATED)
+		return group;
+	iommu_group_put(group);
+	return NULL;
+}
+
+/*
+ * For legacy PCI we have two main considerations when forming groups:
+ *
+ *  1) In PCI we can loose the RID inside the fabric, or some devices will use
+ *     the wrong RID. The PCI core calls this aliasing, but from an IOMMU
+ *     perspective it means that a PCI device may have multiple RIDs and a
+ *     single RID may represent many PCI devices. This effectively means all the
+ *     aliases must share a translation, thus group, because the IOMMU cannot
+ *     tell devices apart.
+ *
+ *  2) PCI permits a bus segment to claim an address even if the transaction
+ *     originates from an end point not the CPU. When it happens it is called
+ *     peer to peer. Claiming a transaction in the middle of the bus hierarchy
+ *     bypasses the IOMMU translation. The IOMMU subsystem rules require these
+ *     devices to be placed in the same group because they lack isolation from
+ *     each other. In PCI Express the ACS system can be used to inhibit this and
+ *     force transactions to go to the IOMMU.
+ *
+ *     From a PCI perspective any given PCI bus is either isolating or
+ *     non-isolating. Isolating means downstream originated transactions always
+ *     progress toward the CPU and do not go to other devices on the bus
+ *     segment, while non-isolating means downstream originated transactions can
+ *     progress back downstream through another device on the bus segment.
+ *
+ *     Beyond buses a multi-function device or bridge can also allow
+ *     transactions to loop back internally from one function to another.
+ *
+ *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
+ *     that bus becomes a single group.
+ */
+struct iommu_group *pci_device_group(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct iommu_group *group;
+	struct pci_dev *real_pdev;
+
+	if (WARN_ON(!dev_is_pci(dev)))
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * Arches can supply a completely different PCI device that actually
+	 * does DMA.
+	 */
+	real_pdev = pci_real_dma_dev(pdev);
+	if (real_pdev != pdev) {
+		group = iommu_group_get(&real_pdev->dev);
+		if (!group) {
+			/*
+			 * The real_pdev has not had an iommu probed to it. We
+			 * can't create a new group here because there is no way
+			 * for pci_device_group(real_pdev) to pick it up.
+			 */
+			dev_err(dev,
+				"PCI device is probing out of order, real device of %s is not probed yet\n",
+				pci_name(real_pdev));
+			return ERR_PTR(-EPROBE_DEFER);
+		}
+		return group;
+	}
+
+	if (pdev->dev_flags & PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT)
+		return iommu_group_alloc();
+
+	/* Anything upstream of this enforcing non-isolated? */
+	group = pci_hierarchy_group(pdev);
 	if (group)
 		return group;
 
-	/* No shared group found, allocate new */
-	return iommu_group_alloc();
+	switch (pci_bus_isolated(pdev->bus)) {
+	case PCIE_ISOLATED:
+		/* Check multi-function groups and same-bus devfn aliases */
+		group = pci_get_alias_group(pdev);
+		if (group)
+			return group;
+
+		/* No shared group found, allocate new */
+		return iommu_group_alloc();
+
+	/*
+	 * On legacy PCI there is no RID at an electrical level. On PCI-X the
+	 * RID of the bridge may be used in some cases instead of the
+	 * downstream's RID. This creates aliasing problems. PCI/PCI-X doesn't
+	 * provide isolation either. The end result is that as soon as we hit a
+	 * PCI/PCI-X bus we switch to non-isolated for the whole downstream for
+	 * both aliasing and isolation reasons. The bridge has to be included in
+	 * the group because of the aliasing.
+	 */
+	case PCI_BRIDGE_NON_ISOLATED:
+	/* A PCIe switch where the USP has MMIO and is not isolated. */
+	case PCIE_NON_ISOLATED:
+		group = iommu_group_get(&pdev->bus->self->dev);
+		if (WARN_ON(!group))
+			return ERR_PTR(-EINVAL);
+		/*
+		 * No need to be concerned with aliases here since we are going
+		 * to put the entire downstream tree in the bridge/USP's group.
+		 */
+		group->bus_data |= BUS_DATA_PCI_UNISOLATED;
+		return group;
+
+	/*
+	 * It is a PCI bus and the upstream bridge/port does not alias or allow
+	 * P2P.
+	 */
+	case PCI_BUS_NON_ISOLATED:
+	/*
+	 * It is a PCIe switch and the DSP cannot reach the USP. The DSP's
+	 * are not isolated from each other and share a group.
+	 */
+	case PCIE_SWITCH_DSP_NON_ISOLATED: {
+		struct pci_dev *piter = NULL;
+
+		/*
+		 * All the downstream devices on the bus share a group. If this
+		 * is a PCIe switch then they will all be DSPs
+		 */
+		for_each_pci_dev(piter) {
+			if (piter->bus != pdev->bus)
+				continue;
+			group = iommu_group_get(&piter->dev);
+			if (group) {
+				pci_dev_put(piter);
+				if (WARN_ON(!(group->bus_data &
+					      BUS_DATA_PCI_UNISOLATED)))
+					group->bus_data |=
+						BUS_DATA_PCI_UNISOLATED;
+				return group;
+			}
+		}
+
+		group = iommu_group_alloc();
+		group->bus_data |= BUS_DATA_PCI_UNISOLATED;
+		return group;
+	}
+	default:
+		break;
+	}
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
 }
 EXPORT_SYMBOL_GPL(pci_device_group);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/11] iommu: Organize iommu_group by member size
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

To avoid some internal padding.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f4584ffacbc03d..f98c25514bf912 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -58,13 +58,13 @@ struct iommu_group {
 	void *iommu_data;
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
-	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
-	unsigned int owner_cnt;
 	void *owner;
+	unsigned int owner_cnt;
+	int id;
 
 	/* Used by the device_group() callbacks */
 	u32 bus_data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/11] PCI: Add pci_reachable_set()
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 06/11] iommu: Use pci_reachable_set() in pci_device_group() Jason Gunthorpe
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Implement pci_reachable_set() to efficiently compute a set of devices on
the same bus that are "reachable" from a starting device. The meaning of
reachability is defined by the caller through a callback function.

This is a faster implementation of the same logic in
pci_device_group(). Being inside the PCI core allows use of pci_bus_sem so
it can use list_for_each_entry() on a small list of devices instead of the
expensive for_each_pci_dev(). Server systems can now have hundreds of PCI
devices, but typically only a very small number of devices per bus.

An example of a reachability function would be pci_devs_are_dma_aliases()
which would compute a set of devices on the same bus that are
aliases. This would also be useful in future support for the ACS P2P
Egress Vector which has a similar reachability problem.

This is effectively a graph algorithm where the set of devices on the bus
are vertexes and the reachable() function defines the edges. It returns a
set of vertexes that form a connected graph.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  | 12 ++++++
 2 files changed, 102 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 540a503b499e3f..3bc20659af6b20 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -571,3 +571,93 @@ int pci_dev_present(const struct pci_device_id *ids)
 	return 0;
 }
 EXPORT_SYMBOL(pci_dev_present);
+
+/**
+ * pci_reachable_set - Generate a bitmap of devices within a reachability set
+ * @start: First device in the set
+ * @devfns: The set of devices on the bus
+ * @reachable: Callback to tell if two devices can reach each other
+ *
+ * Compute a bitmap where every set bit is a device on the bus that is reachable
+ * from the start device, including the start device. Reachability between two
+ * devices is determined by a callback function.
+ *
+ * This is a non-recursive implementation that invokes the callback once per
+ * pair. The callback must be commutative:
+ *    reachable(a, b) == reachable(b, a)
+ * reachable() can form a cyclic graph:
+ *    reachable(a,b) == reachable(b,c) == reachable(c,a) == true
+ *
+ * Since this function is limited to a single bus the largest set can be 256
+ * devices large.
+ */
+void pci_reachable_set(struct pci_dev *start, struct pci_alias_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb))
+{
+	struct pci_alias_set todo_devfns = {};
+	struct pci_alias_set next_devfns = {};
+	struct pci_bus *bus = start->bus;
+	bool again;
+
+	/* Assume devfn of all PCI devices is bounded by MAX_NR_DEVFNS */
+	static_assert(sizeof(next_devfns.devfns) * BITS_PER_BYTE >=
+		      MAX_NR_DEVFNS);
+
+	memset(devfns, 0, sizeof(devfns->devfns));
+	__set_bit(start->devfn, devfns->devfns);
+	__set_bit(start->devfn, next_devfns.devfns);
+
+	down_read(&pci_bus_sem);
+	while (true) {
+		unsigned int devfna;
+		unsigned int i;
+
+		/*
+		 * For each device that hasn't been checked compare every
+		 * device on the bus against it.
+		 */
+		again = false;
+		for_each_set_bit(devfna, next_devfns.devfns, MAX_NR_DEVFNS) {
+			struct pci_dev *deva = NULL;
+			struct pci_dev *devb;
+
+			list_for_each_entry(devb, &bus->devices, bus_list) {
+				if (devb->devfn == devfna)
+					deva = devb;
+
+				if (test_bit(devb->devfn, devfns->devfns))
+					continue;
+
+				if (!deva) {
+					deva = devb;
+					list_for_each_entry_continue(
+						deva, &bus->devices, bus_list)
+						if (deva->devfn == devfna)
+							break;
+				}
+
+				if (!reachable(deva, devb))
+					continue;
+
+				__set_bit(devb->devfn, todo_devfns.devfns);
+				again = true;
+			}
+		}
+
+		if (!again)
+			break;
+
+		/*
+		 * Every new bit adds a new deva to check, reloop the whole
+		 * thing. Expect this to be rare.
+		 */
+		for (i = 0; i != ARRAY_SIZE(devfns->devfns); i++) {
+			devfns->devfns[i] |= todo_devfns.devfns[i];
+			next_devfns.devfns[i] = todo_devfns.devfns[i];
+			todo_devfns.devfns[i] = 0;
+		}
+	}
+	up_read(&pci_bus_sem);
+}
+EXPORT_SYMBOL_GPL(pci_reachable_set);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index deeb85467f4f38..dbcffc77650dd7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -834,6 +834,10 @@ struct pci_dynids {
 	struct list_head	list;	/* For IDs added at runtime */
 };
 
+struct pci_alias_set {
+	DECLARE_BITMAP(devfns, 256);
+};
+
 enum pci_bus_isolation {
 	/*
 	 * The bus is off a root port and the root port has isolated ACS flags
@@ -1248,6 +1252,9 @@ struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 struct pci_dev *pci_get_base_class(unsigned int class, struct pci_dev *from);
 
+void pci_reachable_set(struct pci_dev *start, struct pci_alias_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb));
 enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus);
 
 int pci_dev_present(const struct pci_device_id *ids);
@@ -2063,6 +2070,11 @@ static inline struct pci_dev *pci_get_base_class(unsigned int class,
 						 struct pci_dev *from)
 { return NULL; }
 
+void pci_reachable_set(struct pci_dev *start, struct pci_alias_set *devfns,
+		       bool (*reachable)(struct pci_dev *deva,
+					 struct pci_dev *devb))
+{ }
+
 enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 { return PCI_NON_ISOLATED; }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/11] iommu: Use pci_reachable_set() in pci_device_group()
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

This is intended as a non-functional change. It aims to simplify and
optimize pci_get_alias_group().

The purpose of pci_get_alias_group() is to find the set of devices on the
same bus that are not ACS isolated or part of the alias set. All of these
devices should share a group so pci_get_alias_group() will search that set
for an existing group to join.

pci_reachable_set() does the calculations for figuring out the set of
devices under the pci_bus_sem, which is better than repeatedly searching
across all PCI devices.

Once the set of devices is determined it is a quick pci_get_slot() to
search for any existing groups in the subset.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 137 ++++++++++++++----------------------------
 1 file changed, 45 insertions(+), 92 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f98c25514bf912..5a8077d4b79d41 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_group_id);
 
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns);
-
-/*
- * For multifunction devices which are not isolated from each other, find
- * all the other non-isolated functions and look for existing groups.  For
- * each function, we also need to look for aliases to or from other devices
- * that may already have a group.
- */
-static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
-							unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
-		return NULL;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus ||
-		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
-			continue;
-
-		group = get_pci_alias_group(tmp, devfns);
-		if (group) {
-			pci_dev_put(tmp);
-			return group;
-		}
-	}
-
-	return NULL;
-}
-
-/*
- * Look for aliases to or from the given device for existing groups. DMA
- * aliases are only supported on the same bus, therefore the search
- * space is quite small (especially since we're really only looking at pcie
- * device, and therefore only expect multiple slots on the root complex or
- * downstream switch ports).  It's conceivable though that a pair of
- * multifunction devices could have aliases between them that would cause a
- * loop.  To prevent this, we use a bitmap to track where we've been.
- */
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
-		return NULL;
-
-	group = iommu_group_get(&pdev->dev);
-	if (group)
-		return group;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus)
-			continue;
-
-		/* We alias them or they alias us */
-		if (pci_devs_are_dma_aliases(pdev, tmp)) {
-			group = get_pci_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-
-			group = get_pci_function_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-		}
-	}
-
-	return NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1523,25 +1444,57 @@ struct iommu_group *generic_single_device_group(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(generic_single_device_group);
 
+static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/*
+	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
+	 */
+	if (pci_devs_are_dma_aliases(deva, devb))
+		return true;
+
+	/*
+	 * If any function in a multi function device does not have ACS
+	 * isolation turned on then the specs allows it to loop back internally
+	 * to another function.
+	 */
+	if ((deva->multifunction || devb->multifunction) &&
+	    PCI_SLOT(deva->devfn) == PCI_SLOT(devb->devfn)) {
+		if (!pci_acs_enabled(deva, PCI_ACS_ISOLATED) ||
+		    !pci_acs_enabled(devb, PCI_ACS_ISOLATED))
+			return true;
+	}
+	return false;
+}
+
 static struct iommu_group *pci_get_alias_group(struct pci_dev *pdev)
 {
-	struct iommu_group *group;
-	u64 devfns[4] = { 0 };
+	struct pci_alias_set devfns;
+	unsigned int devfn;
 
 	/*
-	 * Look for existing groups on device aliases.  If we alias another
-	 * device or another device aliases us, use the same group.
+	 * Look for existing groups on device aliases and multi-function ACS. If
+	 * we alias another device or another device aliases us, use the same
+	 * group.
+	 *
+	 * pci_reachable_set() should return the same bitmap if called for any
+	 * device in the set and we want all devices in the set to have the same
+	 * group.
 	 */
-	group = get_pci_alias_group(pdev, (unsigned long *)devfns);
-	if (group)
-		return group;
+	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
+	/* start is known to have iommu_group_get() == NULL */
+	__clear_bit(pdev->devfn & 0xFF, devfns.devfns);
+	for_each_set_bit(devfn, devfns.devfns,
+			 sizeof(devfns.devfns) * BITS_PER_BYTE) {
+		struct iommu_group *group;
+		struct pci_dev *pdev_slot;
 
-	/*
-	 * Look for existing groups on non-isolated functions on the same
-	 * slot and aliases of those funcions, if any.  No need to clear
-	 * the search bitmap, the tested devfns are still valid.
-	 */
-	return get_pci_function_alias_group(pdev, (unsigned long *)devfns);
+		pdev_slot = pci_get_slot(pdev->bus, devfn);
+		group = iommu_group_get(&pdev_slot->dev);
+		pci_dev_put(pdev_slot);
+		if (group)
+			return group;
+	}
+	return NULL;
 }
 
 static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 06/11] iommu: Use pci_reachable_set() in pci_device_group() Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Directly check that the devices touched by pci_for_each_dma_alias() match
the groups that were built by pci_device_group(). This helps validate that
pci_for_each_dma_alias() and pci_bus_isolated() are consistent.

This should eventually be hidden behind a debug kconfig, but for now it is
good to get feedback from more diverse systems if there are any problems.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 73 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5a8077d4b79d41..907c1b4eb883ed 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1556,7 +1556,7 @@ static struct iommu_group *pci_hierarchy_group(struct pci_dev *pdev)
  *     Once a PCI bus becomes non isolating the entire downstream hierarchy of
  *     that bus becomes a single group.
  */
-struct iommu_group *pci_device_group(struct device *dev)
+static struct iommu_group *__pci_device_group(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct iommu_group *group;
@@ -1666,6 +1666,77 @@ struct iommu_group *pci_device_group(struct device *dev)
 	WARN_ON(true);
 	return ERR_PTR(-EINVAL);
 }
+
+struct check_group_aliases_data {
+	struct pci_dev *pdev;
+	struct iommu_group *group;
+};
+
+static void pci_check_group(const struct check_group_aliases_data *data,
+			    u16 alias, struct pci_dev *pdev)
+{
+	struct iommu_group *group;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return;
+
+	if (group != data->group)
+		dev_err(&data->pdev->dev,
+			"During group construction alias processing needed dev %s alias %x to have the same group but %u != %u\n",
+			pci_name(pdev), alias, data->group->id, group->id);
+	iommu_group_put(group);
+}
+
+static int pci_check_group_aliases(struct pci_dev *pdev, u16 alias,
+				   void *opaque)
+{
+	const struct check_group_aliases_data *data = opaque;
+
+	/*
+	 * Sometimes when a PCIe-PCI bridge is performing transactions on behalf
+	 * of its subordinate bus it uses devfn=0 on the subordinate bus as the
+	 * alias. This means that 0 will alias with all devfns on the
+	 * subordinate bus and so we expect to see those in the same group. pdev
+	 * in this case is the bridge itself and pdev->bus is the primary bus of
+	 * the bridge.
+	 */
+	if (pdev->bus->number != PCI_BUS_NUM(alias)) {
+		struct pci_dev *piter = NULL;
+
+		for_each_pci_dev(piter) {
+			if (pci_domain_nr(pdev->bus) ==
+				    pci_domain_nr(piter->bus) &&
+			    PCI_BUS_NUM(alias) == pdev->bus->number)
+				pci_check_group(data, alias, piter);
+		}
+	} else {
+		pci_check_group(data, alias, pdev);
+	}
+
+	return 0;
+}
+
+struct iommu_group *pci_device_group(struct device *dev)
+{
+	struct iommu_group *group = __pci_device_group(dev);
+
+	if (!IS_ERR(group)) {
+		struct check_group_aliases_data data = {
+			.pdev = to_pci_dev(dev), .group = group
+		};
+
+		/*
+		 * The IOMMU driver should use pci_for_each_dma_alias() to
+		 * figure out what RIDs to program and the core requires all the
+		 * RIDs to fall within the same group. Validate that everything
+		 * worked properly.
+		 */
+		pci_for_each_dma_alias(data.pdev, pci_check_group_aliases,
+				       &data);
+	}
+	return group;
+}
 EXPORT_SYMBOL_GPL(pci_device_group);
 
 /* Get the IOMMU group for device on fsl-mc bus */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/11] PCI: Add the ACS Enhanced Capability definitions
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

This brings the definitions up to PCI Express revision 5.0:

 * ACS I/O Request Blocking Enable
 * ACS DSP Memory Target Access Control
 * ACS USP Memory Target Access Control
 * ACS Unclaimed Request Redirect

Support for this entire grouping is advertised by the ACS Enhanced
Capability bit.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/uapi/linux/pci_regs.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 6edffd31cd8e2c..79d8d317b3c17e 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1004,8 +1004,16 @@
 #define  PCI_ACS_UF		0x0010	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x0020	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x0040	/* Direct Translated P2P */
+#define  PCI_ACS_ENHANCED	0x0080  /* IORB, DSP_MT_xx, USP_MT_XX. Capability only */
+#define  PCI_ACS_EGRESS_CTL_SZ	GENMASK(15, 8) /* Egress Control Vector Size */
 #define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
+#define  PCI_ACS_IORB		0x0080  /* I/O Request Blocking */
+#define  PCI_ACS_DSP_MT_RB	0x0100  /* DSP Memory Target Access Control Request Blocking */
+#define  PCI_ACS_DSP_MT_RR	0x0200  /* DSP Memory Target Access Control Request Redirect */
+#define  PCI_ACS_USP_MT_RB	0x0400  /* USP Memory Target Access Control Request Blocking */
+#define  PCI_ACS_USP_MT_RR	0x0800  /* USP Memory Target Access Control Request Redirect */
+#define  PCI_ACS_UNCLAIMED_RR	0x1000  /* Unclaimed Request Redirect Control */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

The ACS Enhanced bits are intended to address a lack of precision in the
spec about what ACS P2P Request Redirect is supposed to do. While Linux
has long assumed that PCI_ACS_RR would cover MMIO BARs located in the root
port and PCIe Switch ports, the spec took the position that it is
implementation specific.

To get the behavior Linux has long assumed it should be setting:

  PCI_ACS_RR | PCI_ACS_DSP_MT_RR | PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAMED_RR

Follow this guidance in enable_acs and set the additional bits if ACS
Enhanced is supported.

Allow config_acs to control these bits if the device has ACS Enhanced.

The spec permits the HW to wire the bits, so after setting them
pci_acs_flags_enabled() does do a pci_read_config_word() to read the
actual value in effect.

Note that currently Linux sets these bits to 0, so any new HW that comes
supporting ACS Enhanced will end up with historical Linux disabling these
functions. Devices wanting to be compatible with old Linux will need to
wire the ctrl bits to follow ACS_RR. Devices that implement ACS Enhanced
and support the ctrl=0 behavior will break PASID SVA support and VFIO
isolation when ACS_RR is enabled.

Due to the above I strongly encourage backporting this change otherwise
old kernels may have issues with new generations of PCI switches.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/pci.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113bdf..d16b92f3a0c881 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -957,6 +957,7 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
 			     const char *p, const u16 acs_mask, const u16 acs_flags)
 {
 	u16 flags = acs_flags;
+	u16 supported_flags;
 	u16 mask = acs_mask;
 	char *delimit;
 	int ret = 0;
@@ -1001,8 +1002,14 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps,
 			}
 		}
 
-		if (mask & ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
-			    PCI_ACS_UF | PCI_ACS_EC | PCI_ACS_DT)) {
+		supported_flags = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+				  PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_EC |
+				  PCI_ACS_DT;
+		if (caps->cap & PCI_ACS_ENHANCED)
+			supported_flags |= PCI_ACS_USP_MT_RR |
+					   PCI_ACS_DSP_MT_RR |
+					   PCI_ACS_UNCLAIMED_RR;
+		if (mask & ~supported_flags) {
 			pci_err(dev, "Invalid ACS flags specified\n");
 			return;
 		}
@@ -1062,6 +1069,14 @@ static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps)
 	/* Upstream Forwarding */
 	caps->ctrl |= (caps->cap & PCI_ACS_UF);
 
+	/*
+	 * USP/DSP Memory Target Access Control and Unclaimed Request Redirect
+	 */
+	if (caps->cap & PCI_ACS_ENHANCED) {
+		caps->ctrl |= PCI_ACS_USP_MT_RR | PCI_ACS_DSP_MT_RR |
+			      PCI_ACS_UNCLAIMED_RR;
+	}
+
 	/* Enable Translation Blocking for external devices and noats */
 	if (pci_ats_disabled() || dev->external_facing || dev->untrusted)
 		caps->ctrl |= (caps->cap & PCI_ACS_TB);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid()
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-06-30 22:28 ` [PATCH 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
  2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

Switches ignore the PASID when routing TLPs. This means the path from the
PASID issuing end point to the IOMMU must be direct with no possibility
for another device to claim the addresses.

This is done using ACS flags and pci_enable_pasid() checks for this.

The new ACS Enhanced bits clarify some undefined behaviors in the spec
around what P2P Request Redirect means.

Linux has long assumed that PCI_ACS_RR implies PCI_ACS_DSP_MT_RR |
PCI_ACS_USP_MT_RR | PCI_ACS_UNCLAIMED_RR.

If the device supports ACS Enhanced then use the information it reports to
determine if PASID SVA is supported or not.

 PCI_ACS_DSP_MT_RR: Prevents Downstream Port BAR's from claiming upstream
                    flowing transactions

 PCI_ACS_USP_MT_RR: Prevents Upstream Port BAR's from claiming upstream
                    flowing transactions

 PCI_ACS_UNCLAIMED_RR: Prevents a hole in the USP bridge window compared
                       to all the DSP bridge windows from generating a
                       error.

Each of these cases would poke a hole in the PASID address space which is
not permitted.

Enhance the comments around pci_acs_flags_enabled() to better explain the
reasoning for its logic. Continue to take the approach of assuming the
device is doing the "right ACS" if it does not explicitly declare
otherwise.

Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/ats.c |  4 +++-
 drivers/pci/pci.c | 54 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index ec6c8dbdc5e9c9..00603c2c4ff0ea 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -416,7 +416,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
 	if (!pasid)
 		return -EINVAL;
 
-	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
+	if (!pci_acs_path_enabled(pdev, NULL,
+				  PCI_ACS_RR | PCI_ACS_UF | PCI_ACS_USP_MT_RR |
+				  PCI_ACS_DSP_MT_RR | PCI_ACS_UNCLAIMED_RR))
 		return -EINVAL;
 
 	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d16b92f3a0c881..e49370c90ec890 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3601,6 +3601,52 @@ void pci_configure_ari(struct pci_dev *dev)
 	}
 }
 
+
+/*
+ * The spec is not clear what it means if the capability bit is 0. One view is
+ * that the device acts as though the ctrl bit is zero, another view is the
+ * device behavior is undefined.
+ *
+ * Historically Linux has taken the position that the capability bit as 0 means
+ * the device supports the most favorable interpritation of the spec - ie that
+ * things like P2P RR are always on. As this is security sensitive we expect
+ * devices that do not follow this rule to be quirked.
+ *
+ * ACS Enhanced eliminated undefined areas of the spec around MMIO in root ports
+ * and switch ports. If those ports have no MMIO then it is not relavent.
+ * PCI_ACS_UNCLAIMED_RR eliminates the undefined area around an upstream switch
+ * window that is not fully decoded by the downstream windows.
+ *
+ * This takes the same approach with ACS Enhanced, if the device does not
+ * support it then we assume the ACS P2P RR has all the enhanced behaviors too.
+ *
+ * Due to ACS Enhanced bits being force set to 0 by older Linux kernels, and
+ * those values would break old kernels on the edge cases they cover, the only
+ * compatible thing for a new device to implement is ACS Enhanced supported with
+ * the control bits (except PCI_ACS_IORB) wired to follow ACS_RR.
+ */
+static u16 pci_acs_ctrl_mask(struct pci_dev *pdev, u16 hw_cap)
+{
+	/*
+	 * Egress Control enables use of the Egress Control Vector which is not
+	 * present without the cap.
+	 */
+	u16 mask = PCI_ACS_EC;
+
+	mask = hw_cap & (PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+				      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
+
+	/*
+	 * If ACS Enhanced is supported the device reports what it is doing
+	 * through these bits which may not be settable.
+	 */
+	if (hw_cap & PCI_ACS_ENHANCED)
+		mask |= PCI_ACS_IORB | PCI_ACS_DSP_MT_RB | PCI_ACS_DSP_MT_RR |
+			PCI_ACS_USP_MT_RB | PCI_ACS_USP_MT_RR |
+			PCI_ACS_UNCLAIMED_RR;
+	return mask;
+}
+
 static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
 {
 	int pos;
@@ -3610,15 +3656,9 @@ static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
 	if (!pos)
 		return false;
 
-	/*
-	 * Except for egress control, capabilities are either required
-	 * or only required if controllable.  Features missing from the
-	 * capability field can therefore be assumed as hard-wired enabled.
-	 */
 	pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
-	acs_flags &= (cap | PCI_ACS_EC);
-
 	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+	acs_flags &= pci_acs_ctrl_mask(pdev, cap);
 	return (ctrl & acs_flags) == acs_flags;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/11] PCI: Check ACS Extended flags for pci_bus_isolated()
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
@ 2025-06-30 22:28 ` Jason Gunthorpe
  2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
  11 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 22:28 UTC (permalink / raw)
  To: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian,
	kvm, maorg, patches, tdave, Tony Zhu

When looking at a PCIe switch we want to see that the USP/DSP MMIO have
request redirect enabled. Detect the case where the USP is expressly not
isolated from the DSP and ensure the USP is included in the group.

The DSP Memory Target also applies to the Root Port, check it there
too. If upstream directed transactions can reach the root port MMIO then
it is not isolated.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/pci/search.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 3bc20659af6b20..779c5979443def 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -127,6 +127,8 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
 	 * traffic flowing upstream back downstream through another DSP.
 	 *
 	 * Thus any non-permissive DSP spoils the whole bus.
+	 * PCI_ACS_UNCLAIMED_RR is not required since rejecting requests with
+	 * error is still isolation.
 	 */
 	guard(rwsem_read)(&pci_bus_sem);
 	list_for_each_entry(pdev, &bus->devices, bus_list) {
@@ -136,8 +138,14 @@ static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
 		    pdev->dma_alias_mask)
 			return PCIE_NON_ISOLATED;
 
-		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
+		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED |
+						   PCI_ACS_DSP_MT_RR |
+						   PCI_ACS_USP_MT_RR)) {
+			/* The USP is isolated from the DSP */
+			if (!pci_acs_enabled(pdev, PCI_ACS_USP_MT_RR))
+				return PCIE_NON_ISOLATED;
 			return PCIE_SWITCH_DSP_NON_ISOLATED;
+		}
 	}
 	return PCIE_ISOLATED;
 }
@@ -216,11 +224,13 @@ enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
 	switch (type) {
 	/*
 	 * Since PCIe links are point to point root and downstream ports are
-	 * isolated if their own MMIO cannot be reached.
+	 * isolated if their own MMIO cannot be reached. The root port
+	 * uses DSP_MT_RR for its own MMIO.
 	 */
 	case PCI_EXP_TYPE_ROOT_PORT:
 	case PCI_EXP_TYPE_DOWNSTREAM:
-		if (!pci_acs_enabled(bridge, PCI_ACS_ISOLATED))
+		if (!pci_acs_enabled(bridge,
+				     PCI_ACS_ISOLATED | PCI_ACS_DSP_MT_RR))
 			return PCIE_NON_ISOLATED;
 		return PCIE_ISOLATED;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-06-30 22:28 ` [PATCH 02/11] PCI: Add pci_bus_isolation() Jason Gunthorpe
@ 2025-07-01 19:28   ` Alex Williamson
  2025-07-02  1:00     ` Jason Gunthorpe
  2025-07-03 15:30     ` Jason Gunthorpe
  0 siblings, 2 replies; 33+ messages in thread
From: Alex Williamson @ 2025-07-01 19:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Mon, 30 Jun 2025 19:28:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> diff --git a/drivers/pci/search.c b/drivers/pci/search.c
> index 53840634fbfc2b..540a503b499e3f 100644
> --- a/drivers/pci/search.c
> +++ b/drivers/pci/search.c
> @@ -113,6 +113,156 @@ int pci_for_each_dma_alias(struct pci_dev *pdev,
>  	return ret;
>  }
>  
> +static enum pci_bus_isolation pcie_switch_isolated(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +
> +	/*
> +	 * Within a PCIe switch we have an interior bus that has the Upstream
> +	 * port as the bridge and a set of Downstream port bridging to the
> +	 * egress ports.
> +	 *
> +	 * Each DSP has an ACS setting which controls where its traffic is
> +	 * permitted to go. Any DSP with a permissive ACS setting can send
> +	 * traffic flowing upstream back downstream through another DSP.
> +	 *
> +	 * Thus any non-permissive DSP spoils the whole bus.
> +	 */
> +	guard(rwsem_read)(&pci_bus_sem);
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		/* Don't understand what this is, be conservative */
> +		if (!pci_is_pcie(pdev) ||
> +		    pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM ||
> +		    pdev->dma_alias_mask)
> +			return PCIE_NON_ISOLATED;
> +
> +		if (!pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
> +			return PCIE_SWITCH_DSP_NON_ISOLATED;
> +	}
> +	return PCIE_ISOLATED;
> +}
> +
> +static bool pci_has_mmio(struct pci_dev *pdev)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> +		struct resource *res = pci_resource_n(pdev, i);
> +
> +		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
> +			return true;
> +	}
> +	return false;
> +}

Maybe the intent is to make this as generic as possible, but it seems
to only be used for bridge devices, so technically it could get
away with testing only the first two resources, right?

> +
> +/**
> + * pci_bus_isolated - Determine how isolated connected devices are
> + * @bus: The bus to check
> + *
> + * Isolation is the ability of devices to talk to each other. Full isolation
> + * means that a device can only communicate with the IOMMU and can not do peer
> + * to peer within the fabric.
> + *
> + * We consider isolation on a bus by bus basis. If the bus will permit a
> + * transaction originated downstream to complete on anything other than the
> + * IOMMU then the bus is not isolated.
> + *
> + * Non-isolation includes all the downstream devices on this bus, and it may
> + * include the upstream bridge or port that is creating this bus.
> + *
> + * The various cases are returned in an enum.
> + *
> + * Broadly speaking this function evaluates the ACS settings in a PCI switch to
> + * determine if a PCI switch is configured to have full isolation.
> + *
> + * Old PCI/PCI-X busses cannot have isolation due to their physical properties,
> + * but they do have some aliasing properties that effect group creation.
> + *
> + * pci_bus_isolated() does not consider loopback internal to devices, like
> + * multi-function devices performing a self-loopback. The caller must check
> + * this separately. It does not considering alasing within the bus.
> + *
> + * It does not currently support the ACS P2P Egress Control Vector, Linux does
> + * not yet have any way to enable this feature. EC will create subsets of the
> + * bus that are isolated from other subsets.
> + */
> +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> +{
> +	struct pci_dev *bridge = bus->self;
> +	int type;
> +
> +	/* Consider virtual busses isolated */
> +	if (!bridge)
> +		return PCIE_ISOLATED;
> +	if (pci_is_root_bus(bus))
> +		return PCIE_ISOLATED;

How do we know the root bus isn't conventional?  I suppose this is only
called by IOMMU code, but QEMU can make some pretty weird
configurations.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-06-30 22:28 ` [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
@ 2025-07-01 19:29   ` Alex Williamson
  2025-07-02  1:04     ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2025-07-01 19:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Mon, 30 Jun 2025 19:28:33 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d265de874b14b6..f4584ffacbc03d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -65,8 +65,16 @@ struct iommu_group {
>  	struct list_head entry;
>  	unsigned int owner_cnt;
>  	void *owner;
> +
> +	/* Used by the device_group() callbacks */
> +	u32 bus_data;
>  };
>  
> +/*
> + * Everything downstream of this group should share it.
> + */
> +#define BUS_DATA_PCI_UNISOLATED BIT(0)

NON_ISOLATED for consistency w/ enum from the previous patch?

...
> +struct iommu_group *pci_device_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct iommu_group *group;
> +	struct pci_dev *real_pdev;
> +
> +	if (WARN_ON(!dev_is_pci(dev)))
> +		return ERR_PTR(-EINVAL);
> +
> +	/*
> +	 * Arches can supply a completely different PCI device that actually
> +	 * does DMA.
> +	 */
> +	real_pdev = pci_real_dma_dev(pdev);
> +	if (real_pdev != pdev) {
> +		group = iommu_group_get(&real_pdev->dev);
> +		if (!group) {
> +			/*
> +			 * The real_pdev has not had an iommu probed to it. We
> +			 * can't create a new group here because there is no way
> +			 * for pci_device_group(real_pdev) to pick it up.
> +			 */
> +			dev_err(dev,
> +				"PCI device is probing out of order, real device of %s is not probed yet\n",
> +				pci_name(real_pdev));
> +			return ERR_PTR(-EPROBE_DEFER);
> +		}
> +		return group;
> +	}
> +
> +	if (pdev->dev_flags & PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT)
> +		return iommu_group_alloc();
> +
> +	/* Anything upstream of this enforcing non-isolated? */
> +	group = pci_hierarchy_group(pdev);
>  	if (group)
>  		return group;
>  
> -	/* No shared group found, allocate new */
> -	return iommu_group_alloc();
> +	switch (pci_bus_isolated(pdev->bus)) {
> +	case PCIE_ISOLATED:
> +		/* Check multi-function groups and same-bus devfn aliases */
> +		group = pci_get_alias_group(pdev);
> +		if (group)
> +			return group;
> +
> +		/* No shared group found, allocate new */
> +		return iommu_group_alloc();

I'm not following how we'd handle a multi-function root port w/o
consistent ACS isolation here.  How/where does the resulting group get
the UNISOLATED flag set?

I think that's necessary for the look-back in pci_hierarchy_group(),
right?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2025-06-30 22:28 ` [PATCH 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
@ 2025-07-01 21:48 ` Alex Williamson
  2025-07-02  1:47   ` Jason Gunthorpe
                     ` (2 more replies)
  11 siblings, 3 replies; 33+ messages in thread
From: Alex Williamson @ 2025-07-01 21:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

Testing on some systems here...

I have an AMD system:

# lspci -tv
-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex
           +-00.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU
           +-01.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
           +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon Pro W5700]
           |                                            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                                            +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] Device 7316
           |                                            \-00.3  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 USB
           +-01.2-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
           +-02.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
           +-02.1-[05]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less)
           +-02.2-[06-0b]----00.0-[07-0b]--+-01.0-[08]--+-00.0  MosChip Semiconductor Technology Ltd. MCS9922 PCIe Multi-I/O Controller
           |                               |            \-00.1  MosChip Semiconductor Technology Ltd. MCS9922 PCIe Multi-I/O Controller
           |                               +-02.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit Network Connection
           |                               |               \-00.1  Intel Corporation 82576 Gigabit Network Connection
           |                               \-03.0-[0b]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
           +-03.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
           +-03.1-[0c]----00.0  JMicron Technology Corp. JMB58x AHCI SATA controller
           +-03.2-[0d]----00.0  Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
           +-04.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
           +-08.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
           +-08.1-[0e]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Raphael
           |            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Radeon High Definition Audio Controller [Rembrandt/Strix]
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
           |            +-00.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
           |            +-00.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
           |            \-00.6  Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller
           +-08.3-[0f]----00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7

Notably, each case where there's a dummy host bridge followed by some
number of additional functions (ie. 01.0, 02.0, 03.0, 08.0), that dummy
host bridge is tainting the function isolation and merging the group.
For instance each of these were previously a separate group and are now
combined into one group.

# lspci -vvvs 00:01. [manually edited]
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge

00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
	Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
	Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

The endpoints result in equivalent grouping, but this is a case where I
don't understand how we have non-isolated functions yet isolated
subordinate buses.

An Alder Lake system shows something similar:

# lspci -tv
-[0000:00]-+-00.0  Intel Corporation 12th Gen Core Processor Host Bridge
           +-01.0-[01-02]----00.0-[02]--
           +-02.0  Intel Corporation Alder Lake-S GT1 [UHD Graphics 770]
           +-04.0  Intel Corporation Alder Lake Innovation Platform Framework Processor Participant
           +-06.0-[03]----00.0  Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less)
           +-08.0  Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator
           +-14.0  Intel Corporation Raptor Lake USB 3.2 Gen 2x2 (20 Gb/s) XHCI Host Controller
           +-14.2  Intel Corporation Raptor Lake-S PCH Shared SRAM
           +-15.0  Intel Corporation Raptor Lake Serial IO I2C Host Controller #0
           +-15.1  Intel Corporation Raptor Lake Serial IO I2C Host Controller #1
           +-15.2  Intel Corporation Raptor Lake Serial IO I2C Host Controller #2
           +-15.3  Intel Corporation Device 7a4f
           +-16.0  Intel Corporation Raptor Lake CSME HECI #1
           +-17.0  Intel Corporation Raptor Lake SATA AHCI Controller
           +-19.0  Intel Corporation Device 7a7c
           +-19.1  Intel Corporation Device 7a7d
           +-1a.0-[04]----00.0  Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less)
           +-1c.0-[05]--
           +-1c.1-[06]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
           +-1c.2-[07]----00.0  Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
           +-1c.3-[08-0c]----00.0-[09-0c]--+-01.0-[0a]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
           |                               +-02.0-[0b]--
           |                               \-03.0-[0c]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
           +-1f.0  Intel Corporation Device 7a06
           +-1f.3  Intel Corporation Raptor Lake High Definition Audio Controller
           +-1f.4  Intel Corporation Raptor Lake-S PCH SMBus Controller
           \-1f.5  Intel Corporation Raptor Lake SPI (flash) Controller

00:1c. are all grouped together.  Here 1c.0 does not report ACS, but
the other root ports do:

# lspci -vvvs 1c. | grep -e ^0 -e "Access Control Services"
00:1c.0 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])
00:1c.1 PCI bridge: Intel Corporation Device 7a39 (rev 11) (prog-if 00 [Normal decode])
	Capabilities: [220 v1] Access Control Services
00:1c.2 PCI bridge: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 (rev 11) (prog-if 00 [Normal decode])
	Capabilities: [220 v1] Access Control Services
00:1c.3 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #4 (rev 11) (prog-if 00 [Normal decode])
	Capabilities: [220 v1] Access Control Services

So again the group is tainted by a device that cannot generate DMA, the
endpoint grouping remains equivalent, but isolated buses downstream of
this non-isolated group doesn't seem to make sense.

I'll try to generate further interesting configs.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-07-01 19:28   ` Alex Williamson
@ 2025-07-02  1:00     ` Jason Gunthorpe
  2025-07-03 15:30     ` Jason Gunthorpe
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-02  1:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, Jul 01, 2025 at 01:28:59PM -0600, Alex Williamson wrote:
> > +static bool pci_has_mmio(struct pci_dev *pdev)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> > +		struct resource *res = pci_resource_n(pdev, i);
> > +
> > +		if (resource_size(res) && resource_type(res) == IORESOURCE_MEM)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> 
> Maybe the intent is to make this as generic as possible, but it seems
> to only be used for bridge devices, so technically it could get
> away with testing only the first two resources, right?

Yes, the intent was to be general, yes it could probably check only
the two type1 BARs, however I was thinking the ROM should be included
too, but I don't recall if type 1 has a ROM BAR or not..

> > +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> > +{
> > +	struct pci_dev *bridge = bus->self;
> > +	int type;
> > +
> > +	/* Consider virtual busses isolated */
> > +	if (!bridge)
> > +		return PCIE_ISOLATED;
> > +	if (pci_is_root_bus(bus))
> > +		return PCIE_ISOLATED;
> 
> How do we know the root bus isn't conventional?  I suppose this is only
> called by IOMMU code, but QEMU can make some pretty weird
> configurations.

I feel pretty wobbly on the root bus and root port parts here. So I'm
not sure about this. My ARM system doesn't seem to have these in the
same way.

Since we have a bus->self maybe it should be checking the bus->self's
type the same as normal but we should not inherit bus->self's group in
the iommu.c code?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-01 19:29   ` Alex Williamson
@ 2025-07-02  1:04     ` Jason Gunthorpe
  2025-07-17 19:25       ` Donald Dutile
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-02  1:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, Jul 01, 2025 at 01:29:05PM -0600, Alex Williamson wrote:
> On Mon, 30 Jun 2025 19:28:33 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index d265de874b14b6..f4584ffacbc03d 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -65,8 +65,16 @@ struct iommu_group {
> >  	struct list_head entry;
> >  	unsigned int owner_cnt;
> >  	void *owner;
> > +
> > +	/* Used by the device_group() callbacks */
> > +	u32 bus_data;
> >  };
> >  
> > +/*
> > + * Everything downstream of this group should share it.
> > + */
> > +#define BUS_DATA_PCI_UNISOLATED BIT(0)
> 
> NON_ISOLATED for consistency w/ enum from the previous patch?

Yes

> > -	/* No shared group found, allocate new */
> > -	return iommu_group_alloc();
> > +	switch (pci_bus_isolated(pdev->bus)) {
> > +	case PCIE_ISOLATED:
> > +		/* Check multi-function groups and same-bus devfn aliases */
> > +		group = pci_get_alias_group(pdev);
> > +		if (group)
> > +			return group;
> > +
> > +		/* No shared group found, allocate new */
> > +		return iommu_group_alloc();
> 
> I'm not following how we'd handle a multi-function root port w/o
> consistent ACS isolation here.  How/where does the resulting group get
> the UNISOLATED flag set?

Still wobbly on the root port/root bus.. So the answer is probably
that it doesn't.

What does a multi-function root port with different ACS flags even
mean and how should we treat it? I had in mind that the first root
port is the TA and immediately goes the IOMMU.

If you can explain a bit more about how you see the root ports working
I can try to make an implementation.

AFAICT the spec sort of says 'implementation defined' for ACS on root
ports??

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
@ 2025-07-02  1:47   ` Jason Gunthorpe
  2025-07-04  0:37   ` Jason Gunthorpe
  2025-07-08 20:47   ` Jason Gunthorpe
  2 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-02  1:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:
> Testing on some systems here...
> 
> I have an AMD system:

I have to admit I don't really like lspci -t, mostly because the man
page doesn't describe the notation '-[0000:00]- +-01.1-[01-03]' means
(I guess it is the subordinate bus range), and it drops any lableing
of the interior switch devices.

I've found lspci -PP to be alot easier to follow for this work:

 00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 02)
 00:06.0/02:00.0 Non-Volatile memory controller: SK hynix Platinum P41/PC801 NVMe Solid State Drive

However it is curious that doesn't include in the path

 00:00.0 Host bridge: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers (rev 02)

but lspci -t does..

> # lspci -tv
> -[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex
>            +-00.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU
>            +-01.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon Pro W5700]
>            |                                            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
>            |                                            +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] Device 7316
>            |                                            \-00.3  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 USB
>            +-01.2-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
>            +-02.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-02.1-[05]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less)
>            +-02.2-[06-0b]----00.0-[07-0b]--+-01.0-[08]--+-00.0  MosChip Semiconductor Technology Ltd. MCS9922 PCIe Multi-I/O Controller
>            |                               |            \-00.1  MosChip Semiconductor Technology Ltd. MCS9922 PCIe Multi-I/O Controller
>            |                               +-02.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit Network Connection
>            |                               |               \-00.1  Intel Corporation 82576 Gigabit Network Connection
>            |                               \-03.0-[0b]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
>            +-03.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-03.1-[0c]----00.0  JMicron Technology Corp. JMB58x AHCI SATA controller
>            +-03.2-[0d]----00.0  Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
>            +-04.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-08.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-08.1-[0e]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Raphael
>            |            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Radeon High Definition Audio Controller [Rembrandt/Strix]
>            |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
>            |            +-00.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
>            |            +-00.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
>            |            \-00.6  Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller
>            +-08.3-[0f]----00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI
>            +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
>            +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
>            +-18.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0
>            +-18.1  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1
>            +-18.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2
>            +-18.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3
>            +-18.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4
>            +-18.5  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5
>            +-18.6  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6
>            \-18.7  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7
> 
> Notably, each case where there's a dummy host bridge followed by some
> number of additional functions (ie. 01.0, 02.0, 03.0, 08.0), that dummy
> host bridge is tainting the function isolation and merging the group.
> For instance each of these were previously a separate group and are now
> combined into one group.

Okay.. So what is this topology trying to represent and what should we
be doing in Linux here for groups?

I note that the spec left ACS flags for root ports as implementation
specific.. So I have no idea what this actually is trying to tell the
OS :\

> # lspci -vvvs 00:01. [manually edited]
> 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
> 
> 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
> 	Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
> 	Capabilities: [2a0 v1] Access Control Services
> 		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
> 		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 
> 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
> 	Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
> 	Capabilities: [2a0 v1] Access Control Services
> 		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
> 		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 
> The endpoints result in equivalent grouping, but this is a case where I
> don't understand how we have non-isolated functions yet isolated
> subordinate buses.

Sorry, I'm not sure I followed exactly, let me repeat what I think:

The new code is putting 00:01.0, 00:01.1, 00:01.2 in a group because
it is a MFD and not all functions in the MFD have ACS? This sounds
does sound correct? I would have expected the original code to do this
also? Why does it avoid it?

But then you mean 04:00.0 "Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983"
gets its own group?

I think this happens because the MFD code in pci_get_alias_group()
joins all functions together but does not set BUS_DATA_PCI_UNISOLATED
within the group. So the downstreams of the bridge remain isolated,
and the bus 00 was never NON_ISOLATED.

This does not seem right. Probably pci_get_alias_group() should be
setting BUS_DATA_PCI_UNISOLATED if the ACS is not isolated in the
function. I did not even slightly think about how a bridge USP on a
MFD would even work :\

> An Alder Lake system shows something similar:
> 
> # lspci -tv
> -[0000:00]-+-00.0  Intel Corporation 12th Gen Core Processor Host Bridge
>            +-01.0-[01-02]----00.0-[02]--
>            +-02.0  Intel Corporation Alder Lake-S GT1 [UHD Graphics 770]
>            +-04.0  Intel Corporation Alder Lake Innovation Platform Framework Processor Participant
>            +-06.0-[03]----00.0  Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less)
>            +-08.0  Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator
>            +-14.0  Intel Corporation Raptor Lake USB 3.2 Gen 2x2 (20 Gb/s) XHCI Host Controller
>            +-14.2  Intel Corporation Raptor Lake-S PCH Shared SRAM
>            +-15.0  Intel Corporation Raptor Lake Serial IO I2C Host Controller #0
>            +-15.1  Intel Corporation Raptor Lake Serial IO I2C Host Controller #1
>            +-15.2  Intel Corporation Raptor Lake Serial IO I2C Host Controller #2
>            +-15.3  Intel Corporation Device 7a4f
>            +-16.0  Intel Corporation Raptor Lake CSME HECI #1
>            +-17.0  Intel Corporation Raptor Lake SATA AHCI Controller
>            +-19.0  Intel Corporation Device 7a7c
>            +-19.1  Intel Corporation Device 7a7d
>            +-1a.0-[04]----00.0  Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less)
>            +-1c.0-[05]--
>            +-1c.1-[06]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
>            +-1c.2-[07]----00.0  Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
>            +-1c.3-[08-0c]----00.0-[09-0c]--+-01.0-[0a]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
>            |                               +-02.0-[0b]--
>            |                               \-03.0-[0c]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
>            +-1f.0  Intel Corporation Device 7a06
>            +-1f.3  Intel Corporation Raptor Lake High Definition Audio Controller
>            +-1f.4  Intel Corporation Raptor Lake-S PCH SMBus Controller
>            \-1f.5  Intel Corporation Raptor Lake SPI (flash) Controller
> 
> 00:1c. are all grouped together.  Here 1c.0 does not report ACS, but
> the other root ports do:
> 
> # lspci -vvvs 1c. | grep -e ^0 -e "Access Control Services"
> 00:1c.0 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])

So this is a PCI bridge not a host brdige like AMD.. 
What are the PCI types for this? Is it a root port?

> 00:1c.1 PCI bridge: Intel Corporation Device 7a39 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 00:1c.2 PCI bridge: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 00:1c.3 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #4 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services

Same question, are these all root ports?

My desktop has:

00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 02) (prog-if 00 [Normal decode])
        Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00

So maybe yes?

> So again the group is tainted by a device that cannot generate DMA, the
> endpoint grouping remains equivalent, but isolated buses downstream of
> this non-isolated group doesn't seem to make sense.
> 
> I'll try to generate further interesting configs.  Thanks,

Thanks a lot, this is very different from my ARM systems here.

I really am at a bit of a loss what Linux should do here.. Your point
about a "device that cannot generate DMA" makes alot of sense. I've
thought the same way about the DSPs too.

Another thought - should we draw a line across the root ports and
assume that if a TLP reaches a root port/bridge that it goes the
IOMMU? Essentially we don't let the BUS_DATA_PCI_UNISOLATED of the
bus->self propogate if bus->self is a root port?

This would allow fixing the bridge MFD miss and still generate the
groupings you see here in a deliberate way.

Alternatively, should we try to identify these "no DMA" devices and
then improve the various ACS calculations? A device with no MMIO, no
IO, and no downstream can reasonably be considered to have no
DMA. Will that describe the two cases you saw that "spolied" the group?
We can detect that and enhance the ACS function to report they have ACS
RR/etc enabled.

I was thinking about this already in terms of the DSPs not really
needed to be group'd with their downstreams if they don't have MMIO
and can't initiate DMA.

To summarize:
 1) We are getting acceptable groupings for the downstream devices. So
    a lot is working well
 2) The root complex integrated devices, upstream of root ports, are
    not working the same way
 3) There is a miss of MFD ACS propogation on bridge/switch USPs

Also, I updated the github with an extra patch that has the debugging
I've been using. It may be helpful. It shows bus by bus what the
isolation is and various other decision points.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-07-01 19:28   ` Alex Williamson
  2025-07-02  1:00     ` Jason Gunthorpe
@ 2025-07-03 15:30     ` Jason Gunthorpe
  2025-07-03 22:17       ` Alex Williamson
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-03 15:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, Jul 01, 2025 at 01:28:59PM -0600, Alex Williamson wrote:
> > +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> > +{
> > +	struct pci_dev *bridge = bus->self;
> > +	int type;
> > +
> > +	/* Consider virtual busses isolated */
> > +	if (!bridge)
> > +		return PCIE_ISOLATED;
> > +	if (pci_is_root_bus(bus))
> > +		return PCIE_ISOLATED;
> 
> How do we know the root bus isn't conventional?

If I read this right this is dead code..

/*
 * Returns true if the PCI bus is root (behind host-PCI bridge),
 * false otherwise
 *
 * Some code assumes that "bus->self == NULL" means that bus is a root bus.
 * This is incorrect because "virtual" buses added for SR-IOV (via
 * virtfn_add_bus()) have "bus->self == NULL" but are not root buses.
 */
static inline bool pci_is_root_bus(struct pci_bus *pbus)
{
	return !(pbus->parent);

Looking at the call chain of pci_alloc_bus():
 pci_alloc_child_bus() - Parent bus may not be NULL
 pci_add_new_bus() - All callers pass !NULL bus
 pci_register_host_bridge() - Sets self and parent to NULL

Thus if pci_is_root() == true implies bus->self == NULL so we can't
get here.

So I will change it to be like:

	/*
	 * This bus was created by pci_register_host_bridge(). There is nothing
	 * upstream of this, assume it contains the TA and that the root complex
	 * does not allow P2P without going through the IOMMU.
	 */
	if (pci_is_root_bus(bus))
		return PCIE_ISOLATED;

	/*
	 * Sometimes SRIOV VFs can have a "virtual" bus if the SRIOV RID's
	 * extend past the bus numbers of the parent. The spec says that SRIOV
	 * VFs and PFs should act the same as functions in a MFD. MFD isolation
	 * is handled outside this function.
	 */
	if (!bridge)
		return PCIE_ISOLATED;

And now it seems we never took care with SRIOV, along with the PF
every SRIOV VF needs to have its ACS checked as though it was a MFD..

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-07-03 15:30     ` Jason Gunthorpe
@ 2025-07-03 22:17       ` Alex Williamson
  2025-07-03 23:08         ` Alex Williamson
  2025-07-03 23:15         ` Jason Gunthorpe
  0 siblings, 2 replies; 33+ messages in thread
From: Alex Williamson @ 2025-07-03 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Thu, 3 Jul 2025 12:30:30 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jul 01, 2025 at 01:28:59PM -0600, Alex Williamson wrote:
> > > +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> > > +{
> > > +	struct pci_dev *bridge = bus->self;
> > > +	int type;
> > > +
> > > +	/* Consider virtual busses isolated */
> > > +	if (!bridge)
> > > +		return PCIE_ISOLATED;
> > > +	if (pci_is_root_bus(bus))
> > > +		return PCIE_ISOLATED;  
> > 
> > How do we know the root bus isn't conventional?  
> 
> If I read this right this is dead code..
> 
> /*
>  * Returns true if the PCI bus is root (behind host-PCI bridge),
>  * false otherwise
>  *
>  * Some code assumes that "bus->self == NULL" means that bus is a root bus.
>  * This is incorrect because "virtual" buses added for SR-IOV (via
>  * virtfn_add_bus()) have "bus->self == NULL" but are not root buses.
>  */
> static inline bool pci_is_root_bus(struct pci_bus *pbus)
> {
> 	return !(pbus->parent);
> 
> Looking at the call chain of pci_alloc_bus():
>  pci_alloc_child_bus() - Parent bus may not be NULL
>  pci_add_new_bus() - All callers pass !NULL bus
>  pci_register_host_bridge() - Sets self and parent to NULL
> 
> Thus if pci_is_root() == true implies bus->self == NULL so we can't
> get here.

Yep, seems correct.

> So I will change it to be like:
> 
> 	/*
> 	 * This bus was created by pci_register_host_bridge(). There is nothing
> 	 * upstream of this, assume it contains the TA and that the root complex
> 	 * does not allow P2P without going through the IOMMU.
> 	 */
> 	if (pci_is_root_bus(bus))
> 		return PCIE_ISOLATED;

Ok, but did we sidestep the question of whether the root bus can be
conventional?

> 
> 	/*
> 	 * Sometimes SRIOV VFs can have a "virtual" bus if the SRIOV RID's
> 	 * extend past the bus numbers of the parent. The spec says that SRIOV
> 	 * VFs and PFs should act the same as functions in a MFD. MFD isolation
> 	 * is handled outside this function.
> 	 */
> 	if (!bridge)
> 		return PCIE_ISOLATED;
> 
> And now it seems we never took care with SRIOV, along with the PF
> every SRIOV VF needs to have its ACS checked as though it was a MFD..

There's actually evidence that we did take care to make sure VFs never
flag themselves as multifunction in order to avoid the multifunction
ACS tests.  I think we'd see lots of devices suddenly unusable for one
of their intended use cases if we grouped VFs that don't expose an ACS
capability.  Also VFs from multiple PFs exist on the same virtual bus,
so I imagine if the PF supports ACS but the VF doesn't, you'd end up
with multiple isolation domains on the same bus.  Thus, we've so far
take the approach that "surely the hw vendor intended these to be used
independently", and only considered the isolation upstream from the VFs.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-07-03 22:17       ` Alex Williamson
@ 2025-07-03 23:08         ` Alex Williamson
  2025-07-03 23:21           ` Jason Gunthorpe
  2025-07-03 23:15         ` Jason Gunthorpe
  1 sibling, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2025-07-03 23:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Thu, 3 Jul 2025 16:17:27 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 3 Jul 2025 12:30:30 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Jul 01, 2025 at 01:28:59PM -0600, Alex Williamson wrote:  
> > > > +enum pci_bus_isolation pci_bus_isolated(struct pci_bus *bus)
> > > > +{
> > > > +	struct pci_dev *bridge = bus->self;
> > > > +	int type;
> > > > +
> > > > +	/* Consider virtual busses isolated */
> > > > +	if (!bridge)
> > > > +		return PCIE_ISOLATED;
> > > > +	if (pci_is_root_bus(bus))
> > > > +		return PCIE_ISOLATED;    
> > > 
> > > How do we know the root bus isn't conventional?    
> > 
> > If I read this right this is dead code..
> > 
> > /*
> >  * Returns true if the PCI bus is root (behind host-PCI bridge),
> >  * false otherwise
> >  *
> >  * Some code assumes that "bus->self == NULL" means that bus is a root bus.
> >  * This is incorrect because "virtual" buses added for SR-IOV (via
> >  * virtfn_add_bus()) have "bus->self == NULL" but are not root buses.
> >  */
> > static inline bool pci_is_root_bus(struct pci_bus *pbus)
> > {
> > 	return !(pbus->parent);
> > 
> > Looking at the call chain of pci_alloc_bus():
> >  pci_alloc_child_bus() - Parent bus may not be NULL
> >  pci_add_new_bus() - All callers pass !NULL bus
> >  pci_register_host_bridge() - Sets self and parent to NULL
> > 
> > Thus if pci_is_root() == true implies bus->self == NULL so we can't
> > get here.  
> 
> Yep, seems correct.
> 
> > So I will change it to be like:
> > 
> > 	/*
> > 	 * This bus was created by pci_register_host_bridge(). There is nothing
> > 	 * upstream of this, assume it contains the TA and that the root complex
> > 	 * does not allow P2P without going through the IOMMU.
> > 	 */
> > 	if (pci_is_root_bus(bus))
> > 		return PCIE_ISOLATED;  
> 
> Ok, but did we sidestep the question of whether the root bus can be
> conventional?
> 
> > 
> > 	/*
> > 	 * Sometimes SRIOV VFs can have a "virtual" bus if the SRIOV RID's
> > 	 * extend past the bus numbers of the parent. The spec says that SRIOV
> > 	 * VFs and PFs should act the same as functions in a MFD. MFD isolation
> > 	 * is handled outside this function.
> > 	 */
> > 	if (!bridge)
> > 		return PCIE_ISOLATED;
> > 
> > And now it seems we never took care with SRIOV, along with the PF
> > every SRIOV VF needs to have its ACS checked as though it was a MFD..  
> 
> There's actually evidence that we did take care to make sure VFs never
> flag themselves as multifunction in order to avoid the multifunction
> ACS tests.  I think we'd see lots of devices suddenly unusable for one
> of their intended use cases if we grouped VFs that don't expose an ACS
> capability.  Also VFs from multiple PFs exist on the same virtual bus,
> so I imagine if the PF supports ACS but the VF doesn't, you'd end up
> with multiple isolation domains on the same bus.  Thus, we've so far
> take the approach that "surely the hw vendor intended these to be used
> independently", and only considered the isolation upstream from the VFs.

BTW, spec 6.0.1, section 6.12:

  For ACS requirements, single-Function devices that are SR-IOV capable
  must be handled as if they were Multi-Function Devices, since they
  essentially behave as Multi-Function Devices after their Virtual
  Functions (VFs) are enabled.

Also, section 7.7.11:

  If an SR-IOV Capable Device other than one in a Root Complex
  implements internal peer-to-peer transactions, ACS is required, and
  ACS P2P Egress Control must be supported.

The latter says to me that a non root complex SR-IOV device that does
not implement ACS does not implement internal p2p routing.  OTOH, the
former seems to suggest that we need to consider VFs as peers of the
PF, maybe even governed by ACS on the PF, relative to MF routing.  I'm
not really finding anything that says the VF itself needs to implement
ACS.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-07-03 22:17       ` Alex Williamson
  2025-07-03 23:08         ` Alex Williamson
@ 2025-07-03 23:15         ` Jason Gunthorpe
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-03 23:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Thu, Jul 03, 2025 at 04:17:27PM -0600, Alex Williamson wrote:
> > So I will change it to be like:
> > 
> > 	/*
> > 	 * This bus was created by pci_register_host_bridge(). There is nothing
> > 	 * upstream of this, assume it contains the TA and that the root complex
> > 	 * does not allow P2P without going through the IOMMU.
> > 	 */
> > 	if (pci_is_root_bus(bus))
> > 		return PCIE_ISOLATED;
> 
> Ok, but did we sidestep the question of whether the root bus can be
> conventional?

The root bus doesn't exist, it is the thing on the other side of the
host bridge.. So it is implementation specific and I don't think we
can make any guesses about it.

> > And now it seems we never took care with SRIOV, along with the PF
> > every SRIOV VF needs to have its ACS checked as though it was a MFD..
> 
> There's actually evidence that we did take care to make sure VFs never
> flag themselves as multifunction in order to avoid the multifunction
> ACS tests.

That's not what I mean.. The spec says:

 6.12 Access Control Services (ACS)

 ACS defines a set of control points within a PCI Express topology to
 determine whether a TLP is to be routed normally, blocked, or
 redirected. ACS is applicable to RCs, Switches, and Multi-Function
 Devices. For ACS requirements, single-Function devices that are
 SR-IOV capable must be handled as if they were Multi-Function
 Devices, since they essentially behave as Multi-Function Devices
 after their Virtual Functions (VFs) are enabled.

I read "essentially behave as Multi-Function Devices" as meaning the
VFs and PFs are all together and have possible internal loopback
similar to a MFDs.

Meaning we can have P2P between VFs and ACS is present to inhibit
that.

>  I think we'd see lots of devices suddenly unusable for one
> of their intended use cases if we grouped VFs that don't expose an ACS
> capability.  

That may be, but how should the spec be understood?

> Also VFs from multiple PFs exist on the same virtual bus,
> so I imagine if the PF supports ACS but the VF doesn't, you'd end up
> with multiple isolation domains on the same bus.  

AFAICT the virtual bus thing is a Linux-ism to handle the bus numbers
taken over by SRIOV VF RIDS going past a single bus number. I've
revised things to better handle that by processing only the physical
busses.

I wasn't thinking multiple groups because ACS is all or nothing - if
one VF/PF has ACS that permits it to P2P internally to all other
functions then the entire PF and all VFs are one group. Same logic as
MFDs.

> Thus, we've so far take the approach that "surely the hw vendor
> intended these to be used independently", and only considered the
> isolation upstream from the VFs.  Thanks,

So maybe if the ACS capability is not present we assume that it means
there is no internal loopback, otherwise the ACS must be correct?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] PCI: Add pci_bus_isolation()
  2025-07-03 23:08         ` Alex Williamson
@ 2025-07-03 23:21           ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-03 23:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Thu, Jul 03, 2025 at 05:08:46PM -0600, Alex Williamson wrote:
> BTW, spec 6.0.1, section 6.12:
> 
>   For ACS requirements, single-Function devices that are SR-IOV capable
>   must be handled as if they were Multi-Function Devices, since they
>   essentially behave as Multi-Function Devices after their Virtual
>   Functions (VFs) are enabled.
> 
> Also, section 7.7.11:
> 
>   If an SR-IOV Capable Device other than one in a Root Complex
>   implements internal peer-to-peer transactions, ACS is required, and
>   ACS P2P Egress Control must be supported.

Oh, I haven't seen that one yet..

> The latter says to me that a non root complex SR-IOV device that does
> not implement ACS does not implement internal p2p routing.  

Great

> OTOH, the former seems to suggest that we need to consider VFs as
> peers of the PF

Yes, I think both are saying that. 7.7.11 talks about "internal
peer-to-peer transactions" which is exactly what ACS for a MFD is
about.

So it seems VF to VF is possible through internal transactions.

> maybe even governed by ACS on the PF, relative to
> MF routing.  

But here I think it is not clear - does the ACS of the PF globally
control all the VFs or should each VF have its own ACS in addition to
the PF?

The latter is certainly more useful, especially if the egress control
feature is used.

Let me ask some expert people what they know.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
  2025-07-02  1:47   ` Jason Gunthorpe
@ 2025-07-04  0:37   ` Jason Gunthorpe
  2025-07-11 14:55     ` Alex Williamson
  2025-07-08 20:47   ` Jason Gunthorpe
  2 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-04  0:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:

> 00:1c. are all grouped together.  Here 1c.0 does not report ACS, but
> the other root ports do:

I dug an older Intel system out of my closet and got it to run this
kernel, it has another odd behavior, maybe related to what you are
seeing..

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
00:01.0/01:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
00:01.0/01:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
00:1b.0/02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963

And here we are interested in this group:

00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)

Which the current code puts into two groups
  [00:1f.0 00:1f.2 00:1f.3 00:1f.4]
  [00:1f.6]

While this series puts them all in one group.

No device in the MFD 00:1f has an ACS capability however only 00:1f.6 has a quirk:

	{ PCI_VENDOR_ID_INTEL, 0x15b7, pci_quirk_mf_endpoint_acs },
	/*
	 * SV, TB, and UF are not relevant to multifunction endpoints.
	 *
	 * Multifunction devices are only required to implement RR, CR, and DT
	 * in their ACS capability if they support peer-to-peer transactions.
	 * Devices matching this quirk have been verified by the vendor to not
	 * perform peer-to-peer with other functions, allowing us to mask out
	 * these bits as if they were unimplemented in the ACS capability.
	 */

Giving these ACS results:

pci 0000:00:1f.0: pci_acs_enabled:3693   result=0 1d
pci 0000:00:1f.2: pci_acs_enabled:3693   result=0 1d
pci 0000:00:1f.3: pci_acs_enabled:3693   result=0 1d
pci 0000:00:1f.4: pci_acs_enabled:3693   result=0 1d
pci 0000:00:1f.6: pci_acs_enabled:3693   result=1 1d

Which shows the logic here:

static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
							unsigned long *devfns)
{
	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
		return NULL;

Is causing the grouping difference. When it checks 00:1f.6 it sees
pci_acs_enabled = true and then ignores the rest of the MFD.  This is
basically part of my issue #2 that off-path ACS is not considered.

AFAIK ACS is a per-function egress property (eg it is why it is called
the ACS Egress Vector). Meaning if 01f.4 sends a P2P DMA targetting
MMIO in 1f.6 it is the ACS of 01f.4 as the egress that is responsible
to block it. The ACS of 1f.6 as the ingress is not considered.

By our rules if 01f.4 can DMA into 01f.6 they should be in the same
group.

I point to "Table 6-10 ACS P2P Request Redirect and ACS P2P Egress
Control Interactions" as supporting this. None of these options are
'block incoming request' - they are all talking about how to route the
original outgoing request.

So I think the above is a bug in the current kernel, the logic should
require that all functions in the MFD have ACS on, otherwise they need
to share a single group. It is what is implemented in this series, and
I think it is why you saw other cases where a single bad ACS "spoils"
the MFD?

It seems the qurking should have included all the functions in this
MFD, not just the NIC.

Does this seem right to you?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
  2025-07-02  1:47   ` Jason Gunthorpe
  2025-07-04  0:37   ` Jason Gunthorpe
@ 2025-07-08 20:47   ` Jason Gunthorpe
  2025-07-11 15:40     ` Alex Williamson
  2 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-08 20:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:

> Notably, each case where there's a dummy host bridge followed by some
> number of additional functions (ie. 01.0, 02.0, 03.0, 08.0), that dummy
> host bridge is tainting the function isolation and merging the group.
> For instance each of these were previously a separate group and are now
> combined into one group.

I was able to run some testing on a Milan system that seems similar.

It has the weird "Dummy Host Bridge" MFD. I fixed it with this:

/*
 * For some reason AMD likes to put "dummy functions" in their PCI hierarchy as
 * part of a multi function device. These are notable because they can't do
 * anything. No BARs and no downstream bus. Since they cannot accept P2P or
 * initiate any MMIO we consider them to be isolated from the rest of MFD. Since
 * they often accompany a real PCI bridge with downstream devices it is
 * important that the MFD be isolated. Annoyingly there is no ACS capability
 * reported we have to special case it.
 */
static bool pci_dummy_function(struct pci_dev *pdev)
{
	if (pdev->class >> 8 == PCI_CLASS_BRIDGE_HOST && !pci_has_mmio(pdev))
		return true;
	return false;
}

This AMD system has second weirdness:

40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
40:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

Notice the SrcValid- 

The kernel definately set SrcValid+, the device stored it, and it
never set SrcValid-, yet somehow it got changed:

[    0.483828] pci 0000:40:01.1: pci_enable_acs:1089
[    0.483828] pci 0000:40:01.1: pci_write_config_word:604 9 678 = 1d
[    0.483831] pci 0000:40:01.1: ACS Set to 1d, readback=1d
[..]
[    0.826514] pci 0000:40:01.1: __pci_device_group:1635 Starting
[    0.826517] pci 0000:40:01.1: pci_acs_flags_enabled:3668   ctrl=1c acs_flags=1d cap=5f

I instrumented pci_write_config_word() and it isn't being called a
second time. I didn't try to narrow this down, too weird. Guessing
ACPI or FW?

So the new logic puts all the above and the downstream into group due
to insuffucient isolation which is the only degredation on this
system, the LOM ethernet gets grouped together with the above MFD.

Given in this case we explicitly have ACS flags we consider
non-isolated I'm not sure there is anything to be done about it.

Which raises a question if SrcValid should be part of grouping or not,
it is more of a security enhancement, it doesn't permit/deny P2P
between devices?

> The endpoints result in equivalent grouping, but this is a case where I
> don't understand how we have non-isolated functions yet isolated
> subordinate buses.

And I fixed this too, as above is showing, by marking the group of the
MFD as non-isolated, thus forcing it to propogate downstream.

> An Alder Lake system shows something similar:

I also tested a bunch of Intel client systems. Some with an ACS quirk
and one with the VMD/non transparent bridge setup. Those had no
grouping changes, but no raptor lake in this group.

> # lspci -vvvs 1c. | grep -e ^0 -e "Access Control Services"
> 00:1c.0 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])
> 00:1c.1 PCI bridge: Intel Corporation Device 7a39 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 00:1c.2 PCI bridge: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 00:1c.3 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #4 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 
> So again the group is tainted by a device that cannot generate DMA, 

It looks like 00:1c.0 is advertised as a root port, so it can generate
DMA as part of its root port function bridging to something outside
the root complex.

This system doesn't seem to have anything downstream of that root port
(currently plugged in?), but IMHO that port should have ACS. By spec I
think it is correct to assume that without ACS traffic from downstream
of the root port would be able to follow the internal loopback of the
MFD.

This will probably need a quirk, and it is different from the AMD case
which used a host bridge..

Any other idea?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-04  0:37   ` Jason Gunthorpe
@ 2025-07-11 14:55     ` Alex Williamson
  2025-07-11 16:08       ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2025-07-11 14:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Thu, 3 Jul 2025 21:37:09 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:
> 
> > 00:1c. are all grouped together.  Here 1c.0 does not report ACS, but
> > the other root ports do:  
> 
> I dug an older Intel system out of my closet and got it to run this
> kernel, it has another odd behavior, maybe related to what you are
> seeing..
> 
> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
> 00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
> 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
> 00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
> 00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
> 00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
> 00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
> 00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
> 00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
> 00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
> 00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
> 00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
> 00:01.0/01:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 00:01.0/01:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 00:1b.0/02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963
> 
> And here we are interested in this group:
> 
> 00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
> 00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
> 00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
> 00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
> 
> Which the current code puts into two groups
>   [00:1f.0 00:1f.2 00:1f.3 00:1f.4]
>   [00:1f.6]
> 
> While this series puts them all in one group.
> 
> No device in the MFD 00:1f has an ACS capability however only 00:1f.6 has a quirk:
> 
> 	{ PCI_VENDOR_ID_INTEL, 0x15b7, pci_quirk_mf_endpoint_acs },
> 	/*
> 	 * SV, TB, and UF are not relevant to multifunction endpoints.
> 	 *
> 	 * Multifunction devices are only required to implement RR, CR, and DT
> 	 * in their ACS capability if they support peer-to-peer transactions.
> 	 * Devices matching this quirk have been verified by the vendor to not
> 	 * perform peer-to-peer with other functions, allowing us to mask out
> 	 * these bits as if they were unimplemented in the ACS capability.
> 	 */
> 
> Giving these ACS results:
> 
> pci 0000:00:1f.0: pci_acs_enabled:3693   result=0 1d
> pci 0000:00:1f.2: pci_acs_enabled:3693   result=0 1d
> pci 0000:00:1f.3: pci_acs_enabled:3693   result=0 1d
> pci 0000:00:1f.4: pci_acs_enabled:3693   result=0 1d
> pci 0000:00:1f.6: pci_acs_enabled:3693   result=1 1d
> 
> Which shows the logic here:
> 
> static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
> 							unsigned long *devfns)
> {
> 	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> 		return NULL;
> 
> Is causing the grouping difference. When it checks 00:1f.6 it sees
> pci_acs_enabled = true and then ignores the rest of the MFD.  This is
> basically part of my issue #2 that off-path ACS is not considered.
> 
> AFAIK ACS is a per-function egress property (eg it is why it is called
> the ACS Egress Vector). Meaning if 01f.4 sends a P2P DMA targetting
> MMIO in 1f.6 it is the ACS of 01f.4 as the egress that is responsible
> to block it. The ACS of 1f.6 as the ingress is not considered.
> 
> By our rules if 01f.4 can DMA into 01f.6 they should be in the same
> group.
> 
> I point to "Table 6-10 ACS P2P Request Redirect and ACS P2P Egress
> Control Interactions" as supporting this. None of these options are
> 'block incoming request' - they are all talking about how to route the
> original outgoing request.
> 
> So I think the above is a bug in the current kernel, the logic should
> require that all functions in the MFD have ACS on, otherwise they need
> to share a single group. It is what is implemented in this series, and
> I think it is why you saw other cases where a single bad ACS "spoils"
> the MFD?
> 
> It seems the qurking should have included all the functions in this
> MFD, not just the NIC.
> 
> Does this seem right to you?

Sorry, you hit me right before holiday and PTO here.  I agree that
we're currently looking at isolation primarily from an egress
perspective.  Unfortunately it's not always symmetric.  In your case
above, I think we'd consider it safe to assign 1f.6 to a userspace
driver because 1f.6 cannot generate DMA out of its isolation domain.
On the other hand, 1f.4 can theoretically DMA into 1f.6, so it would be
unwise to attach 1f.4 to a userspace driver.  In practice there's not
much utility in assigning 1f.4 to a userspace driver, it's generally
bound to a "trusted" kernel driver, so all is well.

If we say that 1f.4 taints the group, including 1f.6, I think we're
going to see a bunch of functional regressions for not much actual gain
in security.  Maybe we need some extension to the concept of groups to
represent the asymmetry.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-08 20:47   ` Jason Gunthorpe
@ 2025-07-11 15:40     ` Alex Williamson
  2025-07-11 16:14       ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2025-07-11 15:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Tue, 8 Jul 2025 17:47:15 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:
> 
> > Notably, each case where there's a dummy host bridge followed by some
> > number of additional functions (ie. 01.0, 02.0, 03.0, 08.0), that dummy
> > host bridge is tainting the function isolation and merging the group.
> > For instance each of these were previously a separate group and are now
> > combined into one group.  
> 
> I was able to run some testing on a Milan system that seems similar.
> 
> It has the weird "Dummy Host Bridge" MFD. I fixed it with this:
> 
> /*
>  * For some reason AMD likes to put "dummy functions" in their PCI hierarchy as
>  * part of a multi function device. These are notable because they can't do
>  * anything. No BARs and no downstream bus. Since they cannot accept P2P or
>  * initiate any MMIO we consider them to be isolated from the rest of MFD. Since
>  * they often accompany a real PCI bridge with downstream devices it is
>  * important that the MFD be isolated. Annoyingly there is no ACS capability
>  * reported we have to special case it.
>  */
> static bool pci_dummy_function(struct pci_dev *pdev)
> {
> 	if (pdev->class >> 8 == PCI_CLASS_BRIDGE_HOST && !pci_has_mmio(pdev))
> 		return true;
> 	return false;
> }

Yeah, that might work since it does report itself as a host bridge.
Probably noteworthy that you'd end up catching the Intel host bridge
with this too.
 
> This AMD system has second weirdness:
> 
> 40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>         Capabilities: [2a0 v1] Access Control Services
>                 ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
>                 ACSCtl: SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 40:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>         Capabilities: [2a0 v1] Access Control Services
>                 ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
>                 ACSCtl: SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>         Capabilities: [2a0 v1] Access Control Services
>                 ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
>                 ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 
> Notice the SrcValid- 
> 
> The kernel definately set SrcValid+, the device stored it, and it
> never set SrcValid-, yet somehow it got changed:
> 
> [    0.483828] pci 0000:40:01.1: pci_enable_acs:1089
> [    0.483828] pci 0000:40:01.1: pci_write_config_word:604 9 678 = 1d
> [    0.483831] pci 0000:40:01.1: ACS Set to 1d, readback=1d
> [..]
> [    0.826514] pci 0000:40:01.1: __pci_device_group:1635 Starting
> [    0.826517] pci 0000:40:01.1: pci_acs_flags_enabled:3668   ctrl=1c acs_flags=1d cap=5f
> 
> I instrumented pci_write_config_word() and it isn't being called a
> second time. I didn't try to narrow this down, too weird. Guessing
> ACPI or FW?
> 
> So the new logic puts all the above and the downstream into group due
> to insuffucient isolation which is the only degredation on this
> system, the LOM ethernet gets grouped together with the above MFD.
> 
> Given in this case we explicitly have ACS flags we consider
> non-isolated I'm not sure there is anything to be done about it.
> 
> Which raises a question if SrcValid should be part of grouping or not,
> it is more of a security enhancement, it doesn't permit/deny P2P
> between devices?

Strange issue.  If a device can spoof a RID then it can theoretically
inject a DMA payload as if it were another device.  That seems like
basic security, not just an enhancement.
 
> > The endpoints result in equivalent grouping, but this is a case where I
> > don't understand how we have non-isolated functions yet isolated
> > subordinate buses.  
> 
> And I fixed this too, as above is showing, by marking the group of the
> MFD as non-isolated, thus forcing it to propogate downstream.
> 
> > An Alder Lake system shows something similar:  
> 
> I also tested a bunch of Intel client systems. Some with an ACS quirk
> and one with the VMD/non transparent bridge setup. Those had no
> grouping changes, but no raptor lake in this group.
> 
> > # lspci -vvvs 1c. | grep -e ^0 -e "Access Control Services"
> > 00:1c.0 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])
> > 00:1c.1 PCI bridge: Intel Corporation Device 7a39 (rev 11) (prog-if 00 [Normal decode])
> > 	Capabilities: [220 v1] Access Control Services
> > 00:1c.2 PCI bridge: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 (rev 11) (prog-if 00 [Normal decode])
> > 	Capabilities: [220 v1] Access Control Services
> > 00:1c.3 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #4 (rev 11) (prog-if 00 [Normal decode])
> > 	Capabilities: [220 v1] Access Control Services
> > 
> > So again the group is tainted by a device that cannot generate DMA,   
> 
> It looks like 00:1c.0 is advertised as a root port, so it can generate
> DMA as part of its root port function bridging to something outside
> the root complex.
> 
> This system doesn't seem to have anything downstream of that root port
> (currently plugged in?), but IMHO that port should have ACS. By spec I
> think it is correct to assume that without ACS traffic from downstream
> of the root port would be able to follow the internal loopback of the
> MFD.
> 
> This will probably need a quirk, and it is different from the AMD case
> which used a host bridge..
> 
> Any other idea?

This root port at 1c.0 does look like it could have a subordinate
device, but there is no unpopulated slot/socket on this motherboard.
Possibly another motherboard SKU could use these links for a wifi card.
Versus the other root ports, it lacks routing of its interrupt line, it
does have a secondary bus and apertures assigned, it has a PCIe
capability that claims it has a slot (LnkCap 8GT/s x1), it has an MSI
capability and a NULL capability, no extended config space.  I'm not
sure if this vendor (Gigabyte) is unique in incompletely stubbing out
this link or if this is par for the course.

Again, it should be perfectly safe to assign things downstream of the
ACS isolated root ports in the MFD to userspace drivers, their egress
DMA is isolated.  It would only be ingress from an endpoint that seems
like it cannot exist that would be troublesome.  I don't have a good
solution.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-11 14:55     ` Alex Williamson
@ 2025-07-11 16:08       ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-11 16:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Fri, Jul 11, 2025 at 08:55:04AM -0600, Alex Williamson wrote:
> Sorry, you hit me right before holiday and PTO here.  I agree that
> we're currently looking at isolation primarily from an egress
> perspective.  Unfortunately it's not always symmetric.  In your case
> above, I think we'd consider it safe to assign 1f.6 to a userspace
> driver because 1f.6 cannot generate DMA out of its isolation domain.
> On the other hand, 1f.4 can theoretically DMA into 1f.6, so it would be
> unwise to attach 1f.4 to a userspace driver.  In practice there's not
> much utility in assigning 1f.4 to a userspace driver, it's generally
> bound to a "trusted" kernel driver, so all is well.

That is not how we've defined groups as a security object though. If
you flip the direction and say 1f.4 is used by VFIO and 1f.6 is in a
kernel driver this would not be acceptable.

While I like the idea, I think if we keep the current group system we
can't really make arguments like this.

FWIW, my v2 does seem to solve this problem class well enough.

> If we say that 1f.4 taints the group, including 1f.6, I think we're
> going to see a bunch of functional regressions for not much actual gain
> in security.

I have been thinking if we should have an isolation=relaxed/strict
boot option and relaxed has assumptions that are closer to what the
current kernel does in practice while strict would basically require
ACS everywhere to get smaller grouping.

For this problem relaxed could assume that a MFD function with no ACS
also has no internal loopback at all. Realistically this is probably
true in most cases.

What concerns me is the enterprise market that does have a strong need
for security here but does not have the resources of a CSP to properly
self-audit their systems. I think if the enterprise world could be
happy in a strict mode where quirks and ACS caps are mandatory which
would force their vendors to audit.

Consumer/etc doesn't really have the same security need and could be
use with relaxed if they have any problems.

> Maybe we need some extension to the concept of groups to
> represent the asymmetry.  Thanks,

Do you have any ideas? 

Arguably the right answer is for groups to only be about DMA aliases
and a separate 'P2P security graph' of what devices are allowed to P2P
to other devices is used to enforce the VFIO opening rules. I can
imagine how to make most of that work in iommufd, but not how to fully
retain the VFIO group uAPI. It would be alot of work.

But my main concern right now is the switch ACS which is the first
three patches only. If we remove the MFD loopback downstream
propagation from pci_get_alias_group() I think it will behave quite
like today unless switches are present. Would you be comfortable going
that far as a first step?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches
  2025-07-11 15:40     ` Alex Williamson
@ 2025-07-11 16:14       ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-11 16:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu

On Fri, Jul 11, 2025 at 09:40:20AM -0600, Alex Williamson wrote:
> Yeah, that might work since it does report itself as a host bridge.
> Probably noteworthy that you'd end up catching the Intel host bridge
> with this too.

Yes, I think the argument holds for Intel too. A device with no way to
initiate DMA and no MMIO to target P2P can always be its own group,
regardless of type.

> Again, it should be perfectly safe to assign things downstream of the
> ACS isolated root ports in the MFD to userspace drivers, their egress
> DMA is isolated.  It would only be ingress from an endpoint that seems
> like it cannot exist that would be troublesome.  I don't have a good
> solution.

I don't know what the kernel can do with this, IMHO the PCIE
capabilities are simply out of spec.

If we want to do my relaxed idea then we could also say that root
ports without ACS are assumed to have no possible P2P as well.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-02  1:04     ` Jason Gunthorpe
@ 2025-07-17 19:25       ` Donald Dutile
  2025-07-17 20:27         ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Donald Dutile @ 2025-07-17 19:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Bjorn Helgaas, iommu, Joerg Roedel, linux-pci, Robin Murphy,
	Will Deacon, Lu Baolu, galshalom, Joerg Roedel, Kevin Tian, kvm,
	maorg, patches, tdave, Tony Zhu



On 7/1/25 9:04 PM, Jason Gunthorpe wrote:
> On Tue, Jul 01, 2025 at 01:29:05PM -0600, Alex Williamson wrote:
>> On Mon, 30 Jun 2025 19:28:33 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index d265de874b14b6..f4584ffacbc03d 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -65,8 +65,16 @@ struct iommu_group {
>>>   	struct list_head entry;
>>>   	unsigned int owner_cnt;
>>>   	void *owner;
>>> +
>>> +	/* Used by the device_group() callbacks */
>>> +	u32 bus_data;
>>>   };
>>>   
>>> +/*
>>> + * Everything downstream of this group should share it.
>>> + */
>>> +#define BUS_DATA_PCI_UNISOLATED BIT(0)
>>
>> NON_ISOLATED for consistency w/ enum from the previous patch?
> 
> Yes
> 
>>> -	/* No shared group found, allocate new */
>>> -	return iommu_group_alloc();
>>> +	switch (pci_bus_isolated(pdev->bus)) {
>>> +	case PCIE_ISOLATED:
>>> +		/* Check multi-function groups and same-bus devfn aliases */
>>> +		group = pci_get_alias_group(pdev);
>>> +		if (group)
>>> +			return group;
>>> +
>>> +		/* No shared group found, allocate new */
>>> +		return iommu_group_alloc();
>>
>> I'm not following how we'd handle a multi-function root port w/o
>> consistent ACS isolation here.  How/where does the resulting group get
>> the UNISOLATED flag set?
> 
> Still wobbly on the root port/root bus.. So the answer is probably
> that it doesn't.
> 
> What does a multi-function root port with different ACS flags even
> mean and how should we treat it? I had in mind that the first root
> port is the TA and immediately goes the IOMMU.
> 
I'm looking for clarification what you are asking...

when you say 'multi-function root port', do you mean an RP that is a function
in a MFD in an RC ?  other?  A more explicit (complex?) example be given to
clarify?

IMO, the rule of MFD in an RC applies here, and that means the per-function ACS rules
for an MFD apply -- well, that's how I read section 6.12 (PCIe 7.0.-1.0-PUB).
This may mean checking ACS P2P Egress Control.  Table 6-11 may help wrt Egress control bits & RPs & Fcns.

If no (optional) ACS P2P Egress control, and no other ACS control, then I read/decode
the spec to mean no p2p btwn functions is possible, b/c if it is possible, by spec,
it must have an ACS cap to control it; ergo, no ACS cap, no p2p capability/routing.

- Don
> If you can explain a bit more about how you see the root ports working
> I can try to make an implementation.
> 
> AFAICT the spec sort of says 'implementation defined' for ACS on root
> ports??
> 
> Thanks,
> Jason
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-17 19:25       ` Donald Dutile
@ 2025-07-17 20:27         ` Jason Gunthorpe
  2025-07-18  2:31           ` Donald Dutile
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-17 20:27 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Alex Williamson, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Thu, Jul 17, 2025 at 03:25:35PM -0400, Donald Dutile wrote:
> > What does a multi-function root port with different ACS flags even
> > mean and how should we treat it? I had in mind that the first root
> > port is the TA and immediately goes the IOMMU.
> > 
> I'm looking for clarification what you are asking...
> 
> when you say 'multi-function root port', do you mean an RP that is a function
> in a MFD in an RC ?  other?  A more explicit (complex?) example be given to
> clarify?

A PCIE Root port with a downstream bus that is part of a MFD.

Maybe like this imaginary thing:

00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)

> IMO, the rule of MFD in an RC applies here, and that means the per-function ACS rules
> for an MFD apply -- well, that's how I read section 6.12 (PCIe 7.0.-1.0-PUB).
> This may mean checking ACS P2P Egress Control.  Table 6-11 may help wrt Egress control bits & RPs & Fcns.

The spec says "I donno"

 Implementation of ACS in RCiEPs is permitted but not required. It is
 explicitly permitted that, within a single Root Complex, some RCiEPs
 implement ACS and some do not. It is strongly recommended that Root
 Complex implementations ensure that all accesses originating from
 RCiEPs (PFs, VFs, and SDIs) without ACS support are first subjected to
 processing by the Translation Agent (TA) in the Root Complex before

"strongly recommended" is not "required".

> If no (optional) ACS P2P Egress control, and no other ACS control, then I read/decode
> the spec to mean no p2p btwn functions is possible, b/c if it is possible, by spec,
> it must have an ACS cap to control it; ergo, no ACS cap, no p2p capability/routing.

Where did you see this? Linux has never worked this way, we have
extensive ACS quirks specifically because we've assumed no ACS cap
means P2P is possible and not controllable.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-17 20:27         ` Jason Gunthorpe
@ 2025-07-18  2:31           ` Donald Dutile
  2025-07-18 13:32             ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Donald Dutile @ 2025-07-18  2:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu



On 7/17/25 4:27 PM, Jason Gunthorpe wrote:
> On Thu, Jul 17, 2025 at 03:25:35PM -0400, Donald Dutile wrote:
>>> What does a multi-function root port with different ACS flags even
>>> mean and how should we treat it? I had in mind that the first root
>>> port is the TA and immediately goes the IOMMU.
>>>
>> I'm looking for clarification what you are asking...
>>
>> when you say 'multi-function root port', do you mean an RP that is a function
>> in a MFD in an RC ?  other?  A more explicit (complex?) example be given to
>> clarify?
> 
> A PCIE Root port with a downstream bus that is part of a MFD.
> 
> Maybe like this imaginary thing:
> 
> 00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
> 00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
> 00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
> 00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
> 00:1f.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
> 
>> IMO, the rule of MFD in an RC applies here, and that means the per-function ACS rules
>> for an MFD apply -- well, that's how I read section 6.12 (PCIe 7.0.-1.0-PUB).
>> This may mean checking ACS P2P Egress Control.  Table 6-11 may help wrt Egress control bits & RPs & Fcns.
> 
> The spec says "I donno"
> 
>   Implementation of ACS in RCiEPs is permitted but not required. It is
>   explicitly permitted that, within a single Root Complex, some RCiEPs
>   implement ACS and some do not. It is strongly recommended that Root
>   Complex implementations ensure that all accesses originating from
>   RCiEPs (PFs, VFs, and SDIs) without ACS support are first subjected to
>   processing by the Translation Agent (TA) in the Root Complex before
> 
> "strongly recommended" is not "required".
> 
A bridge (00:1f.5) is not an EndPt(RCiEP). Thus the above doesn't apply to it.
[A PF, VF or SDI can be a RCiEP -- 00:1f.3, 00:1f.2 ]

>> If no (optional) ACS P2P Egress control, and no other ACS control, then I read/decode
>> the spec to mean no p2p btwn functions is possible, b/c if it is possible, by spec,
>> it must have an ACS cap to control it; ergo, no ACS cap, no p2p capability/routing.
> 
> Where did you see this? Linux has never worked this way, we have
> extensive ACS quirks specifically because we've assumed no ACS cap
> means P2P is possible and not controllable.
> 
e.g., Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function Devices
  ...
  ACS P2P Request Redirect: must be implemented by Functions that support peer-to-peer traffic with other Functions.
                            ^^^^

It's been noted/stated/admitted that MFDs have not followed the ACS rules, and thus the quirks may/are needed.

Linux default code should not be opposite of the spec, i.e., if no ACS, then P2P is possible, thus all fcns are part of an IOMMU group.
The spec states that ACS support must be provided if p2p traffic with other functions is supported.


> Jason
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
  2025-07-18  2:31           ` Donald Dutile
@ 2025-07-18 13:32             ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 13:32 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Alex Williamson, Bjorn Helgaas, iommu, Joerg Roedel, linux-pci,
	Robin Murphy, Will Deacon, Lu Baolu, galshalom, Joerg Roedel,
	Kevin Tian, kvm, maorg, patches, tdave, Tony Zhu

On Thu, Jul 17, 2025 at 10:31:42PM -0400, Donald Dutile wrote:
 
> > > If no (optional) ACS P2P Egress control, and no other ACS control, then I read/decode
> > > the spec to mean no p2p btwn functions is possible, b/c if it is possible, by spec,
> > > it must have an ACS cap to control it; ergo, no ACS cap, no p2p capability/routing.
> > 
> > Where did you see this? Linux has never worked this way, we have
> > extensive ACS quirks specifically because we've assumed no ACS cap
> > means P2P is possible and not controllable.
> > 
> e.g., Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function Devices
>  ...
>  ACS P2P Request Redirect: must be implemented by Functions that support peer-to-peer traffic with other Functions.
>                            ^^^^
> 
> It's been noted/stated/admitted that MFDs have not followed the ACS
> rules, and thus the quirks may/are needed.
> 
> Linux default code should not be opposite of the spec, i.e., if no
> ACS, then P2P is possible, thus all fcns are part of an IOMMU group.
> The spec states that ACS support must be provided if p2p traffic
> with other functions is supported.

Linux is definately the opposite of this.

Alex would you agree to reverse this logic for MFDs? If the MFD does
not have ACS cap then the MFD does not do internal loopback P2P?

I think that solves all the MFD related problems.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-07-18 13:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 02/11] PCI: Add pci_bus_isolation() Jason Gunthorpe
2025-07-01 19:28   ` Alex Williamson
2025-07-02  1:00     ` Jason Gunthorpe
2025-07-03 15:30     ` Jason Gunthorpe
2025-07-03 22:17       ` Alex Williamson
2025-07-03 23:08         ` Alex Williamson
2025-07-03 23:21           ` Jason Gunthorpe
2025-07-03 23:15         ` Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
2025-07-01 19:29   ` Alex Williamson
2025-07-02  1:04     ` Jason Gunthorpe
2025-07-17 19:25       ` Donald Dutile
2025-07-17 20:27         ` Jason Gunthorpe
2025-07-18  2:31           ` Donald Dutile
2025-07-18 13:32             ` Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 06/11] iommu: Use pci_reachable_set() in pci_device_group() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
2025-07-02  1:47   ` Jason Gunthorpe
2025-07-04  0:37   ` Jason Gunthorpe
2025-07-11 14:55     ` Alex Williamson
2025-07-11 16:08       ` Jason Gunthorpe
2025-07-08 20:47   ` Jason Gunthorpe
2025-07-11 15:40     ` Alex Williamson
2025-07-11 16:14       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).