From: Baolu Lu <baolu.lu@linux.intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
David Woodhouse <dwmw2@infradead.org>,
iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
Robin Murphy <robin.murphy@arm.com>,
Will Deacon <will@kernel.org>
Cc: Kevin Tian <kevin.tian@intel.com>,
patches@lists.linux.dev, Tina Zhang <tina.zhang@intel.com>,
Wei Wang <wei.w.wang@intel.com>
Subject: Re: [PATCH 8/9] iommu/vt-d: Use the generic iommu page table
Date: Tue, 22 Jul 2025 14:44:01 +0800 [thread overview]
Message-ID: <e46aeb10-b596-4bb7-8f52-638a07bdbc6f@linux.intel.com> (raw)
In-Reply-To: <8-v1-bdb01ffac49c+be-iommu_pt_vtd_jgg@nvidia.com>
On 7/17/25 03:57, Jason Gunthorpe wrote:
> Replace the VT-D iommu_domain implementation of the VTD second stage and
> first stage page tables with the iommupt VTDSS and x86_64
> pagetables. x86_64 is shared with the AMD driver.
>
> There are a couple notable things in VT-D:
> - Like AMD the second stage format is not sign extended, unlike AMD it
> cannot decode a full 64 bits. The first stage format is a normal sign
> extended x86 page table
> - The HW caps can indicate how many levels, how many address bits and what
> leaf page sizes are supported in HW. As before the highest number of
> levels that can translate the entire supported address width is used.
> The supported page sizes are adjusted directly from the dedicated
> first/second stage cap bits.
> - VTD requires flushing 'write buffers'. This logic is left unchanged,
> the write buffer flushes on any gather flush or through iotlb_sync_map.
> - Like ARM, VTD has an optional non-coherent page table walker that
> requires cache flushing. This is supported through PT_FEAT_DMA_INCOHERENT
> the same as ARM, however x86 can't use the DMA API for flush, it must
> call the arch function clflush_cache_range()
> - The PT_FEAT_DYNAMIC_TOP can probably be supported on VTD someday for the
> second stage when it uses 128 bit atomic stores for the HW context
> structures.
> - PT_FEAT_VTDSS_FORCE_WRITEABLE is used to work around ERRATA_772415_SPR17
> - A kernel command line parameter "sp_off" disables all page sizes except 4k
>
> Remove all the unused iommu_domain page table code. The debufs paths have
> their own independent page table walker that is left alone for now.
>
> This corrects a race with the non-coherent walker that the ARM
> implementations have fixed:
>
> CPU 0 CPU 1
> pfn_to_dma_pte() pfn_to_dma_pte()
> pte = &parent[offset];
> if (!dma_pte_present(pte)) {
> try_cmpxchg64(&pte->val)
> pte = &parent[offset];
> .. dma_pte_present(pte) ..
> [...]
> // iommu_map() completes
> // Device does DMA
> domain_flush_cache(pte)
>
> The CPU 1 mapping operation shares a page table level with the CPU 0
> mapping operation. CPU 0 installed a new page table level but has not
> flushed it yet. CPU1 returns from iommu_map() and the device does DMA. The
> non coherent walker fails to see the new table level installed by CPU 0
> and fails the DMA with non-present.
>
> The iommupt PT_FEAT_DMA_INCOHERENT implementation uses the ARM design of
> storing a flag when CPU 0 completes the flush. If the flag is not set CPU
> 1 will also flush to ensure the HW can fully walk to the PTE being
> installed.
>
> Cc: Tina Zhang <tina.zhang@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> drivers/iommu/intel/Kconfig | 4 +
> drivers/iommu/intel/iommu.c | 903 ++++++-----------------------------
> drivers/iommu/intel/iommu.h | 99 +---
> drivers/iommu/intel/nested.c | 5 -
> drivers/iommu/intel/pasid.c | 29 +-
> 5 files changed, 182 insertions(+), 858 deletions(-)
>
[...]
> @@ -3338,7 +2805,9 @@ static struct iommu_domain *
> intel_iommu_domain_alloc_first_stage(struct device *dev,
> struct intel_iommu *iommu, u32 flags)
> {
> + struct pt_iommu_x86_64_cfg cfg = {};
> struct dmar_domain *dmar_domain;
> + int ret;
>
> if (flags & ~IOMMU_HWPT_ALLOC_PASID)
> return ERR_PTR(-EOPNOTSUPP);
> @@ -3347,19 +2816,72 @@ intel_iommu_domain_alloc_first_stage(struct device *dev,
> if (!sm_supported(iommu) || !ecap_flts(iommu->ecap))
> return ERR_PTR(-EOPNOTSUPP);
>
> - dmar_domain = paging_domain_alloc(dev, true);
> + dmar_domain = paging_domain_alloc();
> if (IS_ERR(dmar_domain))
> return ERR_CAST(dmar_domain);
>
> + if (cap_fl5lp_support(iommu->cap))
> + cfg.common.hw_max_vasz_lg2 = 57;
> + else
> + cfg.common.hw_max_vasz_lg2 = 48;
> + cfg.common.hw_max_oasz_lg2 = 52;
> + cfg.common.features = BIT(PT_FEAT_SIGN_EXTEND) |
> + BIT(PT_FEAT_FLUSH_RANGE);
> + /* First stage always uses scalable mode */
> + if (WARN_ON(!ecap_smpwc(iommu->ecap)))
> + return ERR_PTR(-EINVAL);
I don't follow here. Why WARN_ON and return failure when hardware
doesn't support a feature?
> + if (ecap_smpwc(iommu->ecap))
> + cfg.common.features |= BIT(PT_FEAT_DMA_INCOHERENT);
My understanding is that if hardware possibly walks the page table
incoherently, we need to set up the PT_FEAT_DMA_INCOHERENT feature;
otherwise, there is no need.
If that's correct, perhaps what we need here is:
if (!ecap_smpwc(iommu->ecap))
cfg.common.features |= BIT(PT_FEAT_DMA_INCOHERENT);
The Intel iommu driver always enforces page walk coherence in the PASID
table entry if the capability is supported.
> + dmar_domain->iommu.iommu_device = dev;
> + dmar_domain->iommu.nid = dev_to_node(dev);
> dmar_domain->domain.ops = &intel_fs_paging_domain_ops;
> +
> + ret = pt_iommu_x86_64_init(&dmar_domain->fspt, &cfg, GFP_KERNEL);
> + if (ret) {
> + kfree(dmar_domain);
> + return ERR_PTR(ret);
> + }
> +
> + if (!cap_fl1gp_support(iommu->cap))
> + dmar_domain->domain.pgsize_bitmap &= ~(u64)SZ_1G;
> + if (!intel_iommu_superpage)
> + dmar_domain->domain.pgsize_bitmap = SZ_4K;
> +
> return &dmar_domain->domain;
> }
[...]
> static struct iommu_domain *
> intel_iommu_domain_alloc_second_stage(struct device *dev,
> struct intel_iommu *iommu, u32 flags)
> {
> + struct pt_iommu_vtdss_cfg cfg = {};
> struct dmar_domain *dmar_domain;
> + unsigned int sslps;
> + int ret;
>
> if (flags &
> (~(IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> @@ -3376,15 +2898,48 @@ intel_iommu_domain_alloc_second_stage(struct device *dev,
> if (sm_supported(iommu) && !ecap_slts(iommu->ecap))
> return ERR_PTR(-EOPNOTSUPP);
>
> - dmar_domain = paging_domain_alloc(dev, false);
> + dmar_domain = paging_domain_alloc();
> if (IS_ERR(dmar_domain))
> return ERR_CAST(dmar_domain);
>
> + cfg.common.hw_max_vasz_lg2 = compute_vasz_lg2_ss(iommu);
> + cfg.common.hw_max_oasz_lg2 = 52;
> + cfg.common.features = BIT(PT_FEAT_FLUSH_RANGE);
> +
> + /*
> + * Read-only mapping is disallowed on the domain which serves as the
> + * parent in a nested configuration, due to HW errata
> + * (ERRATA_772415_SPR17)
> + */
> + if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT)
> + cfg.common.features |= BIT(PT_FEAT_VTDSS_FORCE_WRITEABLE);
> +
> + if (WARN_ON(!iommu_paging_structure_coherency(iommu)))
> + return ERR_PTR(-EINVAL);
> + if (!iommu_paging_structure_coherency(iommu))
> + cfg.common.features |= BIT(PT_FEAT_DMA_INCOHERENT);
Similarly here.
> + dmar_domain->iommu.iommu_device = dev;
> + dmar_domain->iommu.nid = dev_to_node(dev);
Thanks,
baolu
next prev parent reply other threads:[~2025-07-22 6:46 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-16 19:57 [PATCH 0/9] Convert Intel VT-D to use the generic iommu page table Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 1/9] iommu/pages: Add support for a incoherent IOMMU page walker Jason Gunthorpe
2025-07-21 8:41 ` Baolu Lu
2025-07-29 22:32 ` Jason Gunthorpe
2025-07-30 1:49 ` Baolu Lu
2025-08-11 21:21 ` Jason Gunthorpe
2025-08-15 11:28 ` Tian, Kevin
2025-08-22 21:13 ` Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 2/9] iommupt: Add basic support for SW bits in the page table Jason Gunthorpe
2025-08-15 11:29 ` Tian, Kevin
2025-08-18 23:35 ` Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 3/9] iommupt: Use the incoherent start/stop functions for PT_FEAT_DMA_INCOHERENT Jason Gunthorpe
2025-08-15 11:35 ` Tian, Kevin
2025-08-22 20:45 ` Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 4/9] iommupt: Flush the CPU cache after any writes to the page table Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 5/9] iommupt: Add the Intel VT-D second stage page table format Jason Gunthorpe
2025-07-22 3:11 ` Baolu Lu
2025-07-29 23:05 ` Jason Gunthorpe
2025-07-30 2:00 ` Baolu Lu
2025-08-22 9:14 ` Tian, Kevin
2025-08-22 14:53 ` Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 6/9] iommupt/x86: Set the dirty bit only for writable PTEs Jason Gunthorpe
2025-07-21 10:02 ` Baolu Lu
2025-07-16 19:57 ` [PATCH 7/9] iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENT Jason Gunthorpe
2025-07-22 5:17 ` Baolu Lu
2025-07-29 23:13 ` Jason Gunthorpe
2025-07-30 2:35 ` Baolu Lu
2025-08-22 9:17 ` Tian, Kevin
2025-08-22 14:55 ` Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 8/9] iommu/vt-d: Use the generic iommu page table Jason Gunthorpe
2025-07-22 6:44 ` Baolu Lu [this message]
2025-07-29 23:39 ` Jason Gunthorpe
2025-08-22 9:35 ` Tian, Kevin
2025-08-22 20:43 ` Jason Gunthorpe
2025-07-16 19:57 ` [PATCH 9/9] iommupt: Add a kunit test for the SW bits Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e46aeb10-b596-4bb7-8f52-638a07bdbc6f@linux.intel.com \
--to=baolu.lu@linux.intel.com \
--cc=dwmw2@infradead.org \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=patches@lists.linux.dev \
--cc=robin.murphy@arm.com \
--cc=tina.zhang@intel.com \
--cc=wei.w.wang@intel.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.