public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	linux-pci@vger.kernel.org, Robin Murphy <robin.murphy@arm.com>,
	Will Deacon <will@kernel.org>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	galshalom@nvidia.com, Joerg Roedel <jroedel@suse.de>,
	Kevin Tian <kevin.tian@intel.com>,
	kvm@vger.kernel.org, maorg@nvidia.com, patches@lists.linux.dev,
	tdave@nvidia.com, Tony Zhu <tony.zhu@intel.com>
Subject: Re: [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches
Date: Thu, 25 Sep 2025 09:20:26 -0300	[thread overview]
Message-ID: <20250925122026.GV2617119@nvidia.com> (raw)
In-Reply-To: <20250923152952.1f6c4b2f.alex.williamson@redhat.com>

On Tue, Sep 23, 2025 at 03:29:52PM -0600, Alex Williamson wrote:
> > > Where?  Is this in reference to our handling of multi-function
> > > endpoints vs whether downstream switch ports are represented as
> > > multi-function vs multi-slot?  
> > 
> > If you have a MFD Linux with no ACS it will group the whole MFD if any
> > of it lacks ACS caps because it assumes there is an internal loopback
> > between functions.
> 
> Yes
> 
> > If the MFD has a single function with ACS then only that function is
> > removed from the group. The only way we can understand this as correct
> > by our grouping definition is to require the MFD have no internal
> > loopback. ACS is an egress control, not an ingress control.
> 
> Yes, current grouping is focused on creating sets of devices that
> cannot perform DMA outside of their group without passing through a
> translation agent.  It doesn't account for ingress from other devices.

That isn't the definition of groups at all. If any two devices can DMA
to each other they must be in the same group.

There is no ingress or egress rational with groups. The fact the
implementation is creating these inconsistencies with ingress vs
egress is a BUG and not in anyway part of the security definition.

> One of the few examples of this that seems to exist is something like
> you're describing where we have a MFD and one of the functions is
> quirked or reports an empty ACS capability to create another group.

Yes, one of my test systems has an intel NIC like this. 

> The ACS/quirk device is believed not to have the capability to DMA into
> the non-ACS/quirk devices, but the opposite is not guaranteed.  

I don't think so. The quirk should have been applied to the whole
MFD. Because the code has this behavior people got away with quirking
the NIC only. IMHO the way to understand that quirk is applying to the
whole MFD.

> In practice the ACS/quirk device is typically the only device that's
> worthwhile to assign, so the host is still isolated from the
> userspace driver.  Arguably the userspace device may not be isolated
> from the host devices, but without things like TDISP, there's
> already a degree of trust in other host drivers and devices.

No way! That isn't how any of this should work with hand waving around
"worthwhile to assign" guessing.

> I'm afraid that including the ingress potential in the group
> configuration is going to blow up existing groups, for not much
> practical gain.  

You are aruging both ways. The current design does not follow the
spec, and does not implement the security defintion for groups.

Dismissing that as "no practical gain" for correcting the security is
perhaps true, but then don't argue against this series that it
slightly weakens the security if it has "no practical gain". :(

> I wonder if there's an approach where a group split in
> this way might taint the non-ACS/quirk group to prevent vfio use cases
> and whether that would sufficiently close this gap with minimal
> breakage.

You want to build asymmetric groups now?? It is already too
complicated - this really brings no value IMHO.

> > If a MFD function is a bridge/port then the group doesn't propogate
> > the group downstream of the bridge - again this requires assuming
> > there is no internal loopback between functions.
> 
> I think that if we have a multi-function root port without ACS/quirks
> that all the functions and downstream devices are grouped together.

Yes, if the MFD bridge has ACS then yes it wrongly blocks the
propogation. Again the only way to understand this behavior as making
sense is if it is assuming there is no internal MFD loopback.

> > It is taking the undefined behavior in the spec and selectively making
> > both interpretations at once.
> 
> The intention is that undefined behavior should be considered
> non-isolated.

Sure, but it doesn't do that

> We try to define that boundary of a group based on provable egress
> DMA.

It's a bug. That logic doesn't match the security defintion of groups.

> > How about we answer the question "does this MFD have internal
> > loopback" as:
> >  - NO if any function has an appropriate ACS cap or quirk.
> 
> In this case rather than split the one ACS/quirk function into a group
> we split each function into a group.  Now we potentially have singleton
> groups for non-ACS/quirk functions that we really have no basis to
> believe are isolated from other similar devices.  

No, that's too strong. Given that ACS is an egress control it is
totally pointless for a vendor to make an ACS that egress controls one
function but the MFD has internal loopback allowing other functions to
ingress.

We are making the assumption, that Linux is already making, where if
some vendor has added/quirked ACS then it actually provides egress and
ingress isolation to that function. Meaning there is no MFD internal
loopback.

I'm propsing the consistently broaden this assumption to the whole MFD.

> >  - NO if any function is bridge/port
> 
> This would hand-wave away grouping multi-function downstream ports
> without ACS/quirks with no justification afaict.

We can drop this if the bridge functions have ACS already today to get
their groups split.

> >  - YES otherwise - all functions are end functions and no ACS declared
> > 
> > As above this is quite a bit closer to what Linux is doing now. It is
> > a practical estimation of the undefined spec behavior based on the
> > historical security posture of Linux.
> 
> It's really not what we're doing now.  We currently consider undefined
> behavior to be non-isolated, or we try to.  

Maybe it wanted to, but it isn't implemented right.

> The above makes broad and unwarranted (IMO) isolation claims.

I agree, it is trying to match the bugs in the current code with a
grounding in something explainable and understandable that matches our
security definition for groups.

> This puts data at risk more so than assuming undefined behavior is
> non-isolated.

I already sent a series that actually assumed undefined behavior was
non-isolated.

It breaks alot of systems because doing that *correctly* is extremely
pessimistic toward real world HW.

> Should we re-evaluate how we handle downstream switch ports exposed as
> separate slots, certainly.  

So lets start with that, we can stop this series after it fixes the
switch ports.

Jason

  reply	other threads:[~2025-09-25 12:20 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-30 22:28 [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 01/11] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 02/11] PCI: Add pci_bus_isolation() Jason Gunthorpe
2025-07-01 19:28   ` Alex Williamson
2025-07-02  1:00     ` Jason Gunthorpe
2025-07-03 15:30     ` Jason Gunthorpe
2025-07-03 22:17       ` Alex Williamson
2025-07-03 23:08         ` Alex Williamson
2025-07-03 23:21           ` Jason Gunthorpe
2025-07-03 23:15         ` Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 03/11] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
2025-07-01 19:29   ` Alex Williamson
2025-07-02  1:04     ` Jason Gunthorpe
2025-07-17 19:25       ` Donald Dutile
2025-07-17 20:27         ` Jason Gunthorpe
2025-07-18  2:31           ` Donald Dutile
2025-07-18 13:32             ` Jason Gunthorpe
2025-09-22 22:32               ` Alex Williamson
2025-09-22 23:15                 ` Jason Gunthorpe
2025-09-23  0:51                   ` Donald Dutile
2025-09-23  1:17                     ` Alex Williamson
2025-09-23  1:10                   ` Alex Williamson
2025-09-23  2:26                     ` Donald Dutile
2025-09-23  2:50                       ` Alex Williamson
2025-09-23 12:32                         ` Jason Gunthorpe
2025-09-23 12:58                           ` Alex Williamson
2025-09-23 13:03                     ` Jason Gunthorpe
2025-09-23 21:29                       ` Alex Williamson
2025-09-25 12:20                         ` Jason Gunthorpe [this message]
2025-06-30 22:28 ` [PATCH 04/11] iommu: Organize iommu_group by member size Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 05/11] PCI: Add pci_reachable_set() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 06/11] iommu: Use pci_reachable_set() in pci_device_group() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 07/11] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 08/11] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 09/11] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 10/11] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
2025-06-30 22:28 ` [PATCH 11/11] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
2025-07-01 21:48 ` [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches Alex Williamson
2025-07-02  1:47   ` Jason Gunthorpe
2025-07-04  0:37   ` Jason Gunthorpe
2025-07-11 14:55     ` Alex Williamson
2025-07-11 16:08       ` Jason Gunthorpe
2025-07-08 20:47   ` Jason Gunthorpe
2025-07-11 15:40     ` Alex Williamson
2025-07-11 16:14       ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250925122026.GV2617119@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=ddutile@redhat.com \
    --cc=galshalom@nvidia.com \
    --cc=iommu@lists.linux.dev \
    --cc=joro@8bytes.org \
    --cc=jroedel@suse.de \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=patches@lists.linux.dev \
    --cc=robin.murphy@arm.com \
    --cc=tdave@nvidia.com \
    --cc=tony.zhu@intel.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox