From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Donald Dutile <ddutile@redhat.com>,
Bjorn Helgaas <bhelgaas@google.com>,
iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
linux-pci@vger.kernel.org, Robin Murphy <robin.murphy@arm.com>,
Will Deacon <will@kernel.org>,
Lu Baolu <baolu.lu@linux.intel.com>,
galshalom@nvidia.com, Joerg Roedel <jroedel@suse.de>,
Kevin Tian <kevin.tian@intel.com>,
kvm@vger.kernel.org, maorg@nvidia.com, patches@lists.linux.dev,
tdave@nvidia.com, Tony Zhu <tony.zhu@intel.com>
Subject: Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
Date: Tue, 23 Sep 2025 15:29:52 -0600 [thread overview]
Message-ID: <20250923152952.1f6c4b2f.alex.williamson@redhat.com> (raw)
In-Reply-To: <20250923130341.GJ1391379@nvidia.com>
On Tue, 23 Sep 2025 10:03:41 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Mon, Sep 22, 2025 at 07:10:29PM -0600, Alex Williamson wrote:
> > On Mon, 22 Sep 2025 20:15:41 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Mon, Sep 22, 2025 at 04:32:00PM -0600, Alex Williamson wrote:
> > > > The ACS capability was only introduced in PCIe 2.0 and vendors have
> > > > only become more diligent about implementing it as it's become
> > > > important for device isolation and assignment.
> > >
> > > IDK about this, I have very new systems and they still not have ACS
> > > flags according to this interpretation.
> >
> > But how can we assume that lack of a non-required capability means
> > anything at all??
> >
> > > > IMO, we can't assume anything at all about a multifunction device
> > > > that does not implement ACS.
> > >
> > > Yeah this is all true.
> > >
> > > But we are already assuming. Today we assume MFDs without caps must
> > > have internal loopback in some cases, and then in other cases we
> > > assume they don't.
> >
> > Where? Is this in reference to our handling of multi-function
> > endpoints vs whether downstream switch ports are represented as
> > multi-function vs multi-slot?
>
> If you have a MFD Linux with no ACS it will group the whole MFD if any
> of it lacks ACS caps because it assumes there is an internal loopback
> between functions.
Yes
> If the MFD has a single function with ACS then only that function is
> removed from the group. The only way we can understand this as correct
> by our grouping definition is to require the MFD have no internal
> loopback. ACS is an egress control, not an ingress control.
Yes, current grouping is focused on creating sets of devices that
cannot perform DMA outside of their group without passing through a
translation agent. It doesn't account for ingress from other devices.
One of the few examples of this that seems to exist is something like
you're describing where we have a MFD and one of the functions is
quirked or reports an empty ACS capability to create another group.
The ACS/quirk device is believed not to have the capability to DMA into
the non-ACS/quirk devices, but the opposite is not guaranteed. In
practice the ACS/quirk device is typically the only device that's
worthwhile to assign, so the host is still isolated from the userspace
driver. Arguably the userspace device may not be isolated from the
host devices, but without things like TDISP, there's already a degree
of trust in other host drivers and devices.
I'm afraid that including the ingress potential in the group
configuration is going to blow up existing groups, for not much
practical gain. I wonder if there's an approach where a group split in
this way might taint the non-ACS/quirk group to prevent vfio use cases
and whether that would sufficiently close this gap with minimal
breakage.
> If a MFD function is a bridge/port then the group doesn't propogate
> the group downstream of the bridge - again this requires assuming
> there is no internal loopback between functions.
I think that if we have a multi-function root port without ACS/quirks
that all the functions and downstream devices are grouped together.
For a long while this was the typical case on consumer grade hardware,
CPU root ports were multifunction without ACS and we only had
ACS/quirks for chipset-based root ports on such systems.
> It is taking the undefined behavior in the spec and selectively making
> both interpretations at once.
The intention is that undefined behavior should be considered
non-isolated. We try to define that boundary of a group based on
provable egress DMA.
> > > Assuming the MFD does not have internal loopback, while not entirely
> > > satisfactory, is the one that gives the least practical breakage.
> >
> > Seems like it's fixing one gap and opening another. I don't see that we
> > can implement ingress and egress isolation without breakage.
>
> Yeah, either we risk more insecurities or we risk large group sizes.
>
> > We may need an opt-in to continue egress only isolation.
>
> It isn't "egress only isolation" - the thing is I can't really
> articulate what the current rules even fully are..
>
> I'm not keen on an opt in. I'd rather find some rules we can live
> with.
>
> How about we answer the question "does this MFD have internal
> loopback" as:
> - NO if any function has an appropriate ACS cap or quirk.
In this case rather than split the one ACS/quirk function into a group
we split each function into a group. Now we potentially have singleton
groups for non-ACS/quirk functions that we really have no basis to
believe are isolated from other similar devices. Currently the poor
grouping of such devices generally deters assignment, narrowing the
exposure.
> - NO if any function is bridge/port
This would hand-wave away grouping multi-function downstream ports
without ACS/quirks with no justification afaict.
> - YES otherwise - all functions are end functions and no ACS declared
>
> As above this is quite a bit closer to what Linux is doing now. It is
> a practical estimation of the undefined spec behavior based on the
> historical security posture of Linux.
It's really not what we're doing now. We currently consider undefined
behavior to be non-isolated, or we try to. The above makes broad and
unwarranted (IMO) isolation claims.
> > And hardware vendors are going to volunteer that they lack p2p
> > isolation and we need to add a quirk to reduce the isolation...
> > dynamics are not in our favor. Hardware vendors have no incentive to
> > do the right thing.
>
> They do, otherwise they have major security holes in
> virtualization. In an enterprise setting I have no doubt it is already
> being done right, and has been for a decade.
>
> I think the above rules will broadly be pessimistic toward add in
> cards and optimistic toward the root complex.
This puts data at risk more so than assuming undefined behavior is
non-isolated. Currently bad grouping makes it difficult to attach
devices to userspace drivers. If the bad grouping reaches a
sufficiently high profile and is the result of lack of ACS then we
reach out to the hardware vendor to determine if isolation is actually
present, create quirks if confirmed, and encourage use of ACS to avoid
such problems in the future. If not confirmed, then the grouping is
unfortunate, users can and do patch their kernel to create overrides,
but they're on their own when they meet unreliable behavior. I think
this model has been working.
Should we re-evaluate how we handle downstream switch ports exposed as
separate slots, certainly. Should we consider how to handle a
potential lack of ingress isolation, probably, though we really need a
compelling example. Should we fundamentally reverse various policies
we've been using for over a decade in determining DMA isolation, IMO no.
Thanks,
Alex
next prev parent reply other threads:[~2025-09-23 21:30 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 02/11] PCI: Add pci_bus_isolation() Jason Gunthorpe
2025-07-01 19:28 ` Alex Williamson
2025-07-02 1:00 ` Jason Gunthorpe
2025-07-03 15:30 ` Jason Gunthorpe
2025-07-03 22:17 ` Alex Williamson
2025-07-03 23:08 ` Alex Williamson
2025-07-03 23:21 ` Jason Gunthorpe
2025-07-03 23:15 ` Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
2025-07-01 19:29 ` Alex Williamson
2025-07-02 1:04 ` Jason Gunthorpe
2025-07-17 19:25 ` Donald Dutile
2025-07-17 20:27 ` Jason Gunthorpe
2025-07-18 2:31 ` Donald Dutile
2025-07-18 13:32 ` Jason Gunthorpe
2025-09-22 22:32 ` Alex Williamson
2025-09-22 23:15 ` Jason Gunthorpe
2025-09-23 0:51 ` Donald Dutile
2025-09-23 1:17 ` Alex Williamson
2025-09-23 1:10 ` Alex Williamson
2025-09-23 2:26 ` Donald Dutile
2025-09-23 2:50 ` Alex Williamson
2025-09-23 12:32 ` Jason Gunthorpe
2025-09-23 12:58 ` Alex Williamson
2025-09-23 13:03 ` Jason Gunthorpe
2025-09-23 21:29 ` Alex Williamson [this message]
2025-09-25 12:20 ` Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 06/11] iommu: Use pci_reachable_set() in pci_device_group() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
2025-07-02 1:47 ` Jason Gunthorpe
2025-07-04 0:37 ` Jason Gunthorpe
2025-07-11 14:55 ` Alex Williamson
2025-07-11 16:08 ` Jason Gunthorpe
2025-07-08 20:47 ` Jason Gunthorpe
2025-07-11 15:40 ` Alex Williamson
2025-07-11 16:14 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250923152952.1f6c4b2f.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=baolu.lu@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=ddutile@redhat.com \
--cc=galshalom@nvidia.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=jroedel@suse.de \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=maorg@nvidia.com \
--cc=patches@lists.linux.dev \
--cc=robin.murphy@arm.com \
--cc=tdave@nvidia.com \
--cc=tony.zhu@intel.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox