From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 28137770FE for ; Mon, 22 Sep 2025 22:39:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758580795; cv=none; b=tzqbS4DH5edqeMZ1DR8eY5wx77CqmKK7ePifM7V9hVIdJiC4jVkuF2VPpIgO1xO37IxRpzcfifpFjuxKd2OLKKeodJ/KDRIeg/rS3sJE3fcoMwbRwc400/SDVHurNMLPA05WDvN1NH2pkGchq4mVbpkoCNs+l0sVsgp6iIQSi3o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758580795; c=relaxed/simple; bh=xxzc2/gsSbiwET7MEYllT68rGvqkghrFsdTf5eYlnns=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=RO+bf3GQme1o3SmrirjOv0DtiR/aa6EQZa/ITjus7k5ZpAVSQrT7PLeCTbF02Uv1cDpP1mxvqibVXYuRq8kS4UNCehXtXBcsxhGotzx0+yeZC2imNbT8bKoJlhcJZ+Kj8lt1pPSy3qTHbLtXcEd+QA7GeTDFxzAlBZEgrYyIW60= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=RM861iy7; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="RM861iy7" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1758580792; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Hua1YNDvrnVTXTDaulH5D0zPjatQnVFqU+8AkiHzeM0=; b=RM861iy7r0WGsxCtpNXyEICTKXixmRqBAFeIiYIWUj+x8of8ULIdwF0vRR90uHh0+oKTzs NFISD9jlWmmxQnV159CV0HfdCfxYoNWR5o0ZNkdyLCIjZmxFsoV9xcSGa+CuWi4nB32rrw A1hKIKZbv5WIZ4yqRZriGYdZnQmg5oI= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-369-DiK07SjhO4qFLE9DPR1fzQ-1; Mon, 22 Sep 2025 18:39:51 -0400 X-MC-Unique: DiK07SjhO4qFLE9DPR1fzQ-1 X-Mimecast-MFC-AGG-ID: DiK07SjhO4qFLE9DPR1fzQ_1758580790 Received: by mail-il1-f197.google.com with SMTP id e9e14a558f8ab-42484e75d3bso6535815ab.1 for ; Mon, 22 Sep 2025 15:39:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758580790; x=1759185590; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Hua1YNDvrnVTXTDaulH5D0zPjatQnVFqU+8AkiHzeM0=; b=VU4HSKY2Z55JNIxuEljr0NoMGw/66galjwMJpfvvIPz8r8H+0w3TyQ81RTcofXQ+vN AXHLPuLnItjX9Gb4nhvjKn1dpzGbO2N7HYkdrD/W1L07FqNkoPAIqbSYP+uQLerto/gz nysgVak4Rcwe+VVF1RF15k+dEYuytL6uL0WmODZUkGOlbVRSEIOBVPw0TG722+T/2OPl EpHNcADrJHi20CNokUtUolow//ceaQmHhZPiLm7OmIuIEmv/cjYHa763sFuwzefy+J8m zItRhtUaLZapes3gRojU+005sTfpbFz3mdkZnP3jp7WYWhox/+SJK38RsB41YCas3xlr OS7Q== X-Forwarded-Encrypted: i=1; AJvYcCVI2r3wCxNdIi2JDVX2Kh7ryYf1AZ6oULRMAE+FFGQzYfDhImFcejELkEv9ja4fD7zS/CFWgdtU@lists.linux.dev X-Gm-Message-State: AOJu0Yzgk8BJHvZPcuGr8IYkvOMOhV9pxFLDUNCRW3ud0kr7xYQ/wCE5 4bXUS01H2yP6a484U4XJaXIfHofaZgYQ8x/uIOiIjqoFOA2o3T3xmw9ihvU013LLMTntyvXTb38 FAWHt/uWysV1+o3bmY+0GRP/fGg2dBP+pbQcPAhnOFmq78tVw096Jmi00bZ0= X-Gm-Gg: ASbGncu3Uo3iL8Z2G8Y7adrt5lZYPEjE8JXNOuTfbhCM0vRzsNwgufe2FowLDytU/Kt PAG8jDUXzMa6tTfCx85914HKBU35NYoL6PlWNLDiJapJ79+8YZNd6RIShhc7g+k6lJLzdlaijt8 NG5xHWMjD38XQJ83eS4i8tsFPO8e+YbK2Mab6gVrY0GsKGaH52AGKpaK3k0O1NLPmVVbZSD765l wGJFBUyxFPT2A9bIPrRw3r71xwamtMPqJi+A8Dtk1j7R1FNNsigygl4Zsoj9qZ9FlBFIHAyWNHp CkyC+kmJNKwxowqSOp+E9OoW99KAiCbvS8OpYbzcRw8= X-Received: by 2002:a05:6602:168e:b0:896:fcb5:4bb3 with SMTP id ca18e2360f4ac-8e216156f1emr42492339f.4.1758580790050; Mon, 22 Sep 2025 15:39:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH6RnVqdk31F1uAF17Tm38XXvYq834lf3FKKUs0gGagizXj54kUsRUd21kDpI5BwyMPzZ14Xw== X-Received: by 2002:a05:6602:168e:b0:896:fcb5:4bb3 with SMTP id ca18e2360f4ac-8e216156f1emr42490639f.4.1758580789544; Mon, 22 Sep 2025 15:39:49 -0700 (PDT) Received: from redhat.com ([38.15.36.11]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-53d55460864sm6214071173.66.2025.09.22.15.39.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Sep 2025 15:39:48 -0700 (PDT) Date: Mon, 22 Sep 2025 16:39:47 -0600 From: Alex Williamson To: Jason Gunthorpe Cc: Bjorn Helgaas , iommu@lists.linux.dev, Joerg Roedel , linux-pci@vger.kernel.org, Robin Murphy , Will Deacon , Lu Baolu , Donald Dutile , galshalom@nvidia.com, Joerg Roedel , Kevin Tian , kvm@vger.kernel.org, maorg@nvidia.com, patches@lists.linux.dev, tdave@nvidia.com, Tony Zhu Subject: Re: [PATCH v3 00/11] Fix incorrect iommu_groups with PCIe ACS Message-ID: <20250922163947.5a8304d4.alex.williamson@redhat.com> In-Reply-To: <0-v3-8827cc7fc4e0+23f-pcie_switch_groups_jgg@nvidia.com> References: <0-v3-8827cc7fc4e0+23f-pcie_switch_groups_jgg@nvidia.com> Organization: Red Hat Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: sEqHcpXohRPxdp3mnKhsiBT3ql4T1wGu2_PcOxC193Q_1758580790 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Fri, 5 Sep 2025 15:06:15 -0300 Jason Gunthorpe wrote: > The series patches have extensive descriptions as to the problem and > solution, but in short the ACS flags are not analyzed according to the > spec to form the iommu_groups that VFIO is expecting for security. > > ACS is an egress control only. For a path the ACS flags on each hop only > effect what other devices the TLP is allowed to reach. It does not prevent > other devices from reaching into this path. > > For VFIO if device A is permitted to access device B's MMIO then A and B > must be grouped together. This says that even if a path has isolating ACS > flags on each hop, off-path devices with non-isolating ACS can still reach > into that path and must be grouped gother. > > For switches, a PCIe topology like: > > -- DSP 02:00.0 -> End Point A > Root 00:00.0 -> USP 01:00.0 --| > -- DSP 02:03.0 -> End Point B > > Will generate unique single device groups for every device even if ACS is > not enabled on the two DSP ports. It should at least group A/B together > because no ACS means A can reach the MMIO of B. This is a serious failure > for the VFIO security model. > > For multi-function-devices, a PCIe topology like: > > -- MFD 00:1f.0 ACS not supported > Root 00:00.00 --|- MFD 00:1f.2 ACS not supported > |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS > > Will group [1f.0, 1f.2] and 1f.6 gets a single device group. However from > a spec perspective each device should get its own group, because ACS not > supported can assume no loopback is possible by spec. I just dug through the thread with Don that I think tries to justify this, but I have a lot of concerns about this. I think the "must be implemented by Functions that support peer-to-peer traffic with other Functions" language is specifying that IF the device implements an ACS capability AND does not implement the specific ACS P2P flag being described, then and only then can we assume that form of P2P is not supported. OTOH, we cannot assume anything regarding internal P2P of an MFD that does not implement an ACS capability at all. I believe we even reached agreement with some NIC vendors in the early days of IOMMU groups that they needed to implement an "empty" ACS capability on their multifunction NICs such that they could describe in this way that internal P2P is not supported by the device. Thanks, Alex > > For root-ports a PCIe topology like: > -- Dev 01:00.0 > Root 00:00.00 --- Root Port 00:01.0 --| > | -- Dev 01:00.1 > |- Dev 00:17.0 > > Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no > ACS capability in the root port. > > While ACS on root ports is underspecified in the spec, it should still > function as an egress control and limit access to either the MMIO of the > root port itself, or perhaps some other devices upstream of the root > complex - 00:17.0 perhaps in this example. > > Historically the grouping in Linux has assumed the root port routes all > traffic into the TA/IOMMU and never bypasses the TA to go to other > functions in the root complex. Following the new understanding that ACS is > required for internal loopback also treat root ports with no ACS > capability as lacking internal loopback as well. > > There is also some confusing spec language about how ACS and SRIOV works > which this series does not address. > > > This entire series goes further and makes some additional improvements to > the ACS validation found while studying this problem. The groups around a > PCIe to PCI bridge are shrunk to not include the PCIe bridge. > > The last patches implement "ACS Enhanced" on top of it. Due to how ACS > Enhanced was defined as a non-backward compatible feature it is important > to get SW support out there. > > Due to the potential of iommu_groups becoming wider and thus non-usable > for VFIO this should go to a linux-next tree to give it some more > exposure. > > I have now tested this a few systems I could get: > > - Various Intel client systems: > * Raptor Lake, with VMD enabled and using the real_dev mechanism > * 6/7th generation 100 Series/C320 > * 5/6th generation 100 Series/C320 with a NIC MFD quirk > * Tiger Lake > * 5/6th generation Sunrise Point > > The 6/7th gen system has a root port without an ACS capability and it > becomes ungrouped as described above. > > All systems have changes, the MFDs in the root complex all become ungrouped. > > - NVIDIA Grace system with 5 different PCI switches from two vendors > Bug fix widening the iommu_groups works as expected here > > This is on github: https://github.com/jgunthorpe/linux/commits/pcie_switch_groups > > v3: > - Rebase to v6.17-rc4 > - Drop the quirks related patches > - Change the MFD logic to process no ACS cap as meaning no internal > loopback. This avoids creating non-isolated groups for MFD root ports in > common AMD and Intel systems > - Fix matching MFDs to ignore SRIOV VFs > - Fix some kbuild splats > v2: https://patch.msgid.link/r/0-v2-4a9b9c983431+10e2-pcie_switch_groups_jgg@nvidia.com > - Revise comments and commit messages > - Rename struct pci_alias_set to pci_reachable_set > - Make more sense of the special bus->self = NULL case for SRIOV > - Add pci_group_alloc_non_isolated() for readability > - Rename BUS_DATA_PCI_UNISOLATED to BUS_DATA_PCI_NON_ISOLATED > - Propogate BUS_DATA_PCI_NON_ISOLATED downstream from a MFD in case a MFD > function is a bridge > - New patches to add pci_mfd_isolation() to retain more cases of narrow > groups on MFDs with missing ACS. > - Redescribe the MFD related change as a bug fix. For a MFD to be > isolated all functions must have egress control on their P2P. > v1: https://patch.msgid.link/r/0-v1-74184c5043c6+195-pcie_switch_groups_jgg@nvidia.com > > Cc: galshalom@nvidia.com > Cc: tdave@nvidia.com > Cc: maorg@nvidia.com > Cc: kvm@vger.kernel.org > Cc: Ceric Le Goater" > Cc: Donald Dutile > Signed-off-by: Jason Gunthorpe > > Jason Gunthorpe (11): > PCI: Move REQ_ACS_FLAGS into pci_regs.h as PCI_ACS_ISOLATED > PCI: Add pci_bus_isolated() > iommu: Compute iommu_groups properly for PCIe switches > iommu: Organize iommu_group by member size > PCI: Add pci_reachable_set() > iommu: Compute iommu_groups properly for PCIe MFDs > iommu: Validate that pci_for_each_dma_alias() matches the groups > PCI: Add the ACS Enhanced Capability definitions > PCI: Enable ACS Enhanced bits for enable_acs and config_acs > PCI: Check ACS DSP/USP redirect bits in pci_enable_pasid() > PCI: Check ACS Extended flags for pci_bus_isolated() > > drivers/iommu/iommu.c | 510 +++++++++++++++++++++++----------- > drivers/pci/ats.c | 4 +- > drivers/pci/pci.c | 73 ++++- > drivers/pci/search.c | 274 ++++++++++++++++++ > include/linux/pci.h | 46 +++ > include/uapi/linux/pci_regs.h | 18 ++ > 6 files changed, 759 insertions(+), 166 deletions(-) > > > base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0