linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	linux-pci@vger.kernel.org, Robin Murphy <robin.murphy@arm.com>,
	Will Deacon <will@kernel.org>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	galshalom@nvidia.com, Joerg Roedel <jroedel@suse.de>,
	Kevin Tian <kevin.tian@intel.com>,
	kvm@vger.kernel.org, maorg@nvidia.com, patches@lists.linux.dev,
	tdave@nvidia.com, Tony Zhu <tony.zhu@intel.com>
Subject: Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS
Date: Fri, 18 Jul 2025 15:29:34 -0600	[thread overview]
Message-ID: <20250718152934.0cdb768f.alex.williamson@redhat.com> (raw)
In-Reply-To: <0-v2-4a9b9c983431+10e2-pcie_switch_groups_jgg@nvidia.com>

On Wed,  9 Jul 2025 11:52:03 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> The series patches have extensive descriptions as to the problem and
> solution, but in short the ACS flags are not analyzed according to the
> spec to form the iommu_groups that VFIO is expecting for security.
> 
> ACS is an egress control only. For a path the ACS flags on each hop only
> effect what other devices the TLP is allowed to reach. It does not prevent
> other devices from reaching into this path.
> 
> For VFIO if device A is permitted to access device B's MMIO then A and B
> must be grouped together. This says that even if a path has isolating ACS
> flags on each hop, off-path devices with non-isolating ACS can still reach
> into that path and must be grouped gother.
> 
> For switches, a PCIe topology like:
> 
>                                -- DSP 02:00.0 -> End Point A
>  Root 00:00.0 -> USP 01:00.0 --|
>                                -- DSP 02:03.0 -> End Point B
> 
> Will generate unique single device groups for every device even if ACS is
> not enabled on the two DSP ports. It should at least group A/B together
> because no ACS means A can reach the MMIO of B. This is a serious failure
> for the VFIO security model.
> 
> For multi-function-devices, a PCIe topology like:
> 
>                   -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
>   Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
>                   |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS
> 
> Will group [1f.0, 1f.2] and 1f.6 gets a single device group. In many cases
> we suspect that the MFD actually doesn't need ACS, so this is probably not
> as important a security failure, but from a spec perspective the correct
> answer is one group of [1f.0, 1f.2, 1f.6] beacuse 1f.0/2 have no ACS
> preventing them from reaching the MMIO of 1f.6.

This will break various LOM configurations where the NIC is a function
within a MFD RCiEP which has or quirks ACS while the other functions
have no ACS.

> There is also some confusing spec language about how ACS and SRIOV works
> which this series does not address.
> 
> This entire series goes further and makes some additional improvements to
> the ACS validation found while studying this problem. The groups around a
> PCIe to PCI bridge are shrunk to not include the PCIe bridge.
> 
> The last patches implement "ACS Enhanced" on top of it. Due to how ACS
> Enhanced was defined as a non-backward compatible feature it is important
> to get SW support out there.
> 
> Due to the potential of iommu_groups becoming winder and thus non-usable
> for VFIO this should go to a linux-next tree to give it some more
> exposure.
> 
> I have now tested this a few systems I could get:
> 
>  - Various Intel client systems:
>    * Raptor Lake, with VMD enabled and using the real_dev mechanism
>    * 6/7th generation 100 Series/C320
>    * 5/6th generation 100 Series/C320 with a NIC MFD quirk
>    * Tiger Lake
>    * 5/6th generation Sunrise Point
>   No change in grouping on any of these systems

Sorry I haven't had much time to look at this, but it would still cause
a regression on my AlderLake system.  I get the following new
mega-group:

IOMMU Group 12:
	0000:00:1c.0 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port #1 [8086:7a38] (rev 11)
		Express Root Port (Slot+)
	0000:00:1c.1 PCI bridge [0604]: Intel Corporation Device [8086:7a39] (rev 11)
		Express Root Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.2 PCI bridge [0604]: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 [8086:7a3a] (rev 11)
		Express Root Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port #4 [8086:7a3b] (rev 11)
		Express Root Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.1/06:00.0 USB controller [0c03]: Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10)
		Express Endpoint
	0000:00:1c.2/07:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
		Express Endpoint
	0000:00:1c.3/08:00.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Upstream Port
	0000:00:1c.3/08:00.0/09:01.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Downstream Port (Slot-)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3/08:00.0/09:02.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Downstream Port (Slot+)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3/08:00.0/09:03.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G404 EL/SL PCIe2 4-Port/4-Lane Packet Switch [12d8:2404] (rev 05)
		Express Downstream Port (Slot-)
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	0000:00:1c.3/08:00.0/09:01.0/0a:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 07)
		Express Endpoint
	0000:00:1c.3/08:00.0/09:03.0/0c:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 07)
		Express Endpoint

The source of the issue is the root port at 00:1c.0, which does not
have ACS support, claims that it has a slot but there is none, and
therefore has no subordinate DMA capable devices, nor does the root
port itself have an MMIO BAR.  I don't know if there's something we can
key on for the root port to mark it isolated.

00:1c.0 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port #1 [8086:7a38] (rev 11) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin ? routed to IRQ 125
	IOMMU group: 12
	Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
	I/O behind bridge: 6000-6fff [size=4K] [16-bit]
	Memory behind bridge: 40800000-411fffff [size=10M] [32-bit]
	Prefetchable memory behind bridge: 60e0000000-60e09fffff [size=10M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), IntMsgNum 0
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+ TEE-IO-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #1, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes, LnkDisable- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
		LnkSta:	Speed 2.5GT/s, Width x0
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+
			Slot #0, PowerLimit 0W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg+
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
			Changed: MRL- PresDet- LinkState-
		RootCap: CRSVisible-
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 2
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee002b8  Data: 0000
	Capabilities: [90] Null
	Kernel driver in use: pcieport

This is seen on a Gigabyte B760M DS3H DDR4 motherboard.  There's a
version of this board with wifi whereas this one has empty pads where
that m.2 slot might go.  I'd guess wifi might sit downstream of this
port if it were present, but I don't know how it'd change the feature
set of the root port.  The populated root ports show a more reasonable
set of capabilities:

00:1c.1 PCI bridge [0604]: Intel Corporation Device [8086:7a39] (rev 11) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5001]
	Flags: bus master, fast devsel, latency 0, IRQ 126, IOMMU group 12
	Bus: primary=00, secondary=06, subordinate=06, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 41800000-419fffff [size=2M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Capabilities: [40] Express Root Port (Slot+), IntMsgNum 0
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [98] Subsystem: Gigabyte Technology Co., Ltd Device [1458:5001]
	Capabilities: [a0] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [220] Access Control Services
	Capabilities: [200] L1 PM Substates
	Capabilities: [150] Precision Time Measurement
	Capabilities: [a30] Secondary PCI Express
	Capabilities: [a90] Data Link Feature <?>
	Kernel driver in use: pcieport

I can't say that the proposed code here is doing the wrong thing by
propagating the lack of isolation, but it's gratuitous when there is no
DMA initiator on the non-isolated branch and it causes a significant
usage problem for vfio.  Thanks,

Alex


  parent reply	other threads:[~2025-07-18 21:29 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-09 14:52 [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 01/16] PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 02/16] PCI: Add pci_bus_isolation() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 03/16] iommu: Compute iommu_groups properly for PCIe switches Jason Gunthorpe
2025-07-17 22:03   ` Donald Dutile
2025-07-18 18:09     ` Jason Gunthorpe
2025-07-18 19:00       ` Donald Dutile
2025-07-18 20:19         ` Jason Gunthorpe
2025-07-18 21:41           ` Donald Dutile
2025-07-18 22:52             ` Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 04/16] iommu: Organize iommu_group by member size Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 05/16] PCI: Add pci_reachable_set() Jason Gunthorpe
2025-07-17 22:04   ` Donald Dutile
2025-07-18 17:49     ` Jason Gunthorpe
2025-07-18 19:10       ` Donald Dutile
2025-07-09 14:52 ` [PATCH v2 06/16] PCI: Remove duplication in calling pci_acs_ctrl_enabled() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 07/16] PCI: Use pci_quirk_mf_endpoint_acs() for pci_quirk_amd_sb_acs() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 08/16] PCI: Use pci_acs_ctrl_isolated() for pci_quirk_al_acs() Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 09/16] PCI: Widen the acs_flags to u32 within the quirk callback Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 10/16] PCI: Add pci_mfd_isolation() Jason Gunthorpe
2025-08-20 17:21   ` Keith Busch
2025-07-09 14:52 ` [PATCH v2 11/16] iommu: Compute iommu_groups properly for PCIe MFDs Jason Gunthorpe
2025-07-28  9:47   ` Cédric Le Goater
2025-07-28 13:58     ` Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 12/16] iommu: Validate that pci_for_each_dma_alias() matches the groups Jason Gunthorpe
2025-07-17 22:07   ` Donald Dutile
2025-07-09 14:52 ` [PATCH v2 13/16] PCI: Add the ACS Enhanced Capability definitions Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 14/16] PCI: Enable ACS Enhanced bits for enable_acs and config_acs Jason Gunthorpe
2025-07-09 14:52 ` [PATCH v2 15/16] PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() Jason Gunthorpe
2025-07-17 22:17   ` Donald Dutile
2025-07-18 17:52     ` Jason Gunthorpe
2025-08-05  4:39   ` Askar Safin
2025-07-09 14:52 ` [PATCH v2 16/16] PCI: Check ACS Extended flags for pci_bus_isolated() Jason Gunthorpe
2025-07-18 21:29 ` Alex Williamson [this message]
2025-07-18 22:59   ` [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS Jason Gunthorpe
2025-08-02  1:45 ` Ethan Zhao
2025-08-02 15:18   ` Jason Gunthorpe
2025-08-05  3:43     ` Ethan Zhao
2025-08-05 12:35       ` Jason Gunthorpe
2025-08-05 14:41         ` Ethan Zhao
2025-08-05 14:43           ` Jason Gunthorpe
2025-08-06  2:22             ` Ethan Zhao
2025-08-06  2:41               ` Baolu Lu
2025-08-06 13:40                 ` Jason Gunthorpe
2025-08-07  1:36                 ` Ethan Zhao
2025-08-08  7:56                 ` Ethan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250718152934.0cdb768f.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=galshalom@nvidia.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jroedel@suse.de \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=patches@lists.linux.dev \
    --cc=robin.murphy@arm.com \
    --cc=tdave@nvidia.com \
    --cc=tony.zhu@intel.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).