* [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation @ 2026-04-20 12:32 Evangelos Petrongonas 2026-04-20 12:40 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Evangelos Petrongonas @ 2026-04-20 12:32 UTC (permalink / raw) To: Will Deacon Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Jason Gunthorpe, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman When the hardware advertises both Stage 1 and Stage 2 translation, the driver prefers Stage 1 for DMA domain allocation and only falls back to Stage 2 if Stage 1 is not supported. Some configurations may want to force Stage 2 translation even when the hardware supports Stage 1. Introduce a module parameter 'disable_s1' that, when set, prevents the driver from advertising ARM_SMMU_FEAT_TRANS_S1, causing all DMA domains to use Stage 2 instead. Co-developed-by: Zeev Zilberman <zeev@amazon.com> Signed-off-by: Zeev Zilberman <zeev@amazon.com> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index e8d7dbe495f0..afb21c210e24 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -39,6 +39,11 @@ module_param(disable_msipolling, bool, 0444); MODULE_PARM_DESC(disable_msipolling, "Disable MSI-based polling for CMD_SYNC completion."); +static bool disable_s1; +module_param(disable_s1, bool, 0444); +MODULE_PARM_DESC(disable_s1, + "Disable Stage 1 translation even if supported by hardware."); + static const struct iommu_ops arm_smmu_ops; static struct iommu_dirty_ops arm_smmu_dirty_ops; @@ -5087,13 +5092,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) smmu->features |= ARM_SMMU_FEAT_STALLS; } - if (reg & IDR0_S1P) + if ((reg & IDR0_S1P) && !disable_s1) smmu->features |= ARM_SMMU_FEAT_TRANS_S1; if (reg & IDR0_S2P) smmu->features |= ARM_SMMU_FEAT_TRANS_S2; - if (!(reg & (IDR0_S1P | IDR0_S2P))) { + if (!(smmu->features & (ARM_SMMU_FEAT_TRANS_S1 | ARM_SMMU_FEAT_TRANS_S2))) { dev_err(smmu->dev, "no translation support!\n"); return -ENXIO; } -- 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-20 12:32 [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation Evangelos Petrongonas @ 2026-04-20 12:40 ` Jason Gunthorpe 2026-04-22 6:44 ` Evangelos Petrongonas 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2026-04-20 12:40 UTC (permalink / raw) To: Evangelos Petrongonas Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Mon, Apr 20, 2026 at 12:32:01PM +0000, Evangelos Petrongonas wrote: > When the hardware advertises both Stage 1 and Stage 2 translation, the > driver prefers Stage 1 for DMA domain allocation and only falls back to > Stage 2 if Stage 1 is not supported. > > Some configurations may want to force Stage 2 translation even when the > hardware supports Stage 1. Why? You really need to explain why for a patch like this. If there really is some HW issue I think it is more appropriate to get an IORT flag or IDR detection that the HW has a problem. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-20 12:40 ` Jason Gunthorpe @ 2026-04-22 6:44 ` Evangelos Petrongonas 2026-04-22 15:44 ` Pranjal Shrivastava 2026-04-22 16:23 ` Jason Gunthorpe 0 siblings, 2 replies; 16+ messages in thread From: Evangelos Petrongonas @ 2026-04-22 6:44 UTC (permalink / raw) To: Jason Gunthorpe Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Mon, Apr 20, 2026 at 09:40:32AM -0300 Jason Gunthorpe wrote: > On Mon, Apr 20, 2026 at 12:32:01PM +0000, Evangelos Petrongonas wrote: > > When the hardware advertises both Stage 1 and Stage 2 translation, the > > driver prefers Stage 1 for DMA domain allocation and only falls back to > > Stage 2 if Stage 1 is not supported. > > > > Some configurations may want to force Stage 2 translation even when the > > hardware supports Stage 1. > > Why? You really need to explain why for a patch like this. > > If there really is some HW issue I think it is more appropriate to get > an IORT flag or IDR detection that the HW has a problem. It's not a hardware bug there's no IORT or IDR bit that would make sense here. The motivation is live update of the hypervisor: we want to kexec into a new kernel while keeping DMA from passthrough devices flowing, which means the SMMU's translation state has to survive the handover. The Live Update Orchestrator work [1] and the in-progress "iommu: Add live update state preservation" series [2] are building exactly this plumbing on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future work, and an earlier RFC from Amazon [3] sketched the same idea for iommufd. For this use case, Stage 2 is materially easier to persist than Stage 1, for structural rather than performance reasons: An S2 STE carries the whole translation configuration inline. To hand over an S2 domain, the pre-kexec kernel only needs to preserve the stream table pages and the S2 pgtable pages. An S1 STE points at a Context Descriptor table and as a result Persisting S1 therefore requires preserving the CD table pages too, and because the CD is keyed by ASID coordinating ASID identity across the handover. In the long term the plan should be to persist both stages. However, until a patch series that properly introduces SMMU support for is developed/posted we would like to experiment with S1+S2-capable hardware with an easier to implement handover machinery, that relies on S2 translations. [1] https://lwn.net/Articles/1021442/ — Live Update Orchestrator [2] https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com/ — [PATCH 00/14] iommu: Add live update state preservation [3] https://lore.kernel.org/all/20240916113102.710522-1-jgowans@amazon.com/ — [RFC PATCH 00/13] Support iommu(fd) persistence for live update > Jason Kind Regards, Evangelos Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-22 6:44 ` Evangelos Petrongonas @ 2026-04-22 15:44 ` Pranjal Shrivastava 2026-04-22 16:23 ` Jason Gunthorpe 1 sibling, 0 replies; 16+ messages in thread From: Pranjal Shrivastava @ 2026-04-22 15:44 UTC (permalink / raw) To: Evangelos Petrongonas Cc: Jason Gunthorpe, Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote: > On Mon, Apr 20, 2026 at 09:40:32AM -0300 Jason Gunthorpe wrote: > > On Mon, Apr 20, 2026 at 12:32:01PM +0000, Evangelos Petrongonas wrote: > > > When the hardware advertises both Stage 1 and Stage 2 translation, the > > > driver prefers Stage 1 for DMA domain allocation and only falls back to > > > Stage 2 if Stage 1 is not supported. > > > > > > Some configurations may want to force Stage 2 translation even when the > > > hardware supports Stage 1. > > > > Why? You really need to explain why for a patch like this. > > > > If there really is some HW issue I think it is more appropriate to get > > an IORT flag or IDR detection that the HW has a problem. > > It's not a hardware bug there's no IORT or IDR bit that would make sense > here. > > The motivation is live update of the hypervisor: we want to kexec into a > new kernel while keeping DMA from passthrough devices flowing, which > means the SMMU's translation state has to survive the handover. The Live > Update Orchestrator work [1] and the in-progress "iommu: Add live > update state preservation" series [2] are building exactly this plumbing > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future > work, and an earlier RFC from Amazon [3] sketched the same idea for > iommufd. > > For this use case, Stage 2 is materially easier to persist than Stage 1, > for structural rather than performance reasons: An S2 STE carries the > whole translation configuration inline. To hand over an S2 domain, the > pre-kexec kernel only needs to preserve the stream table pages and the > S2 pgtable pages. An S1 STE points at a Context Descriptor table and as > a result Persisting S1 therefore requires preserving the CD table pages > too, and because the CD is keyed by ASID coordinating ASID identity > across the handover. > > In the long term the plan should be to persist both stages. > However, until a patch series that properly introduces SMMU support for > is developed/posted we would like to experiment with S1+S2-capable > hardware with an easier to implement handover machinery, that relies on > S2 translations. > Hi Evangelos, We (Google) currently have a series in the works specifically for arm-smmu-v3 state preservation. Our plan is to post it in phases (S2 preservation first then the S1 + CD series) once the iommu: liveupdate persistence series has stabilized. Since the iommu core liveupdate framework itself is still in flux, it’s a bit premature to accept/merge this patch before both the series. Furthermore, it must be noted that even if the iommu liveupdate series is merged, until the framework is fully integrated with the SMMU driver, liveupdate shall remain essentially non-functional or 'broken' for drivers that haven't yet implemented the necessary support hooks. We’d prefer to wait until the core infrastructure is solid so we can ensure the SMMUv3 implementation aligns perfectly with the final requirements of the iommu liveupdate persistence series. That said, we don't mind posting our arm-smmu-v3 series with S2-only preservation early as an early RFC if that helps align on the design and implementation details. Thanks, Praan > [1] https://lwn.net/Articles/1021442/ — Live Update Orchestrator > [2] https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com/ — > [PATCH 00/14] iommu: Add live update state preservation > [3] https://lore.kernel.org/all/20240916113102.710522-1-jgowans@amazon.com/ — [RFC > PATCH 00/13] Support iommu(fd) persistence for live update > > > Jason > > Kind Regards, > Evangelos > > > > Amazon Web Services Development Center Germany GmbH > Tamara-Danz-Str. 13 > 10243 Berlin > Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger > Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B > Sitz: Berlin > Ust-ID: DE 365 538 597 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-22 6:44 ` Evangelos Petrongonas 2026-04-22 15:44 ` Pranjal Shrivastava @ 2026-04-22 16:23 ` Jason Gunthorpe 2026-04-22 16:36 ` Robin Murphy 2026-04-23 9:44 ` Will Deacon 1 sibling, 2 replies; 16+ messages in thread From: Jason Gunthorpe @ 2026-04-22 16:23 UTC (permalink / raw) To: Evangelos Petrongonas Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote: > The motivation is live update of the hypervisor: we want to kexec into a > new kernel while keeping DMA from passthrough devices flowing, which > means the SMMU's translation state has to survive the handover. The Live > Update Orchestrator work [1] and the in-progress "iommu: Add live > update state preservation" series [2] are building exactly this plumbing > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future > work, and an earlier RFC from Amazon [3] sketched the same idea for > iommufd. It would be appropriate to keep this patch with the rest of that out of tree pile, for example in the series that enables s2 only support in smmuv3. > For this use case, Stage 2 is materially easier to persist than Stage 1, > for structural rather than performance reasons: I don't think so. The driver needs to know each and every STE that will survive KHO. The ones that don't survive need to be reset to abort STEs. From that point it is trivial enough to include the CD memory in the preservation. It would help to send a preparation series to switch the ARM STE and CD logic away from dma_alloc_coherent and use iommu-pages instead, since we only expect iommu-pages to support preservation.. I could maybe see only supporting non-PASID as a first-series, but a CD table with SSID 0 only populated is still pretty trivial. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-22 16:23 ` Jason Gunthorpe @ 2026-04-22 16:36 ` Robin Murphy 2026-04-23 9:44 ` Will Deacon 1 sibling, 0 replies; 16+ messages in thread From: Robin Murphy @ 2026-04-22 16:36 UTC (permalink / raw) To: Jason Gunthorpe, Evangelos Petrongonas Cc: Will Deacon, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On 2026-04-22 5:23 pm, Jason Gunthorpe wrote: > On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote: >> The motivation is live update of the hypervisor: we want to kexec into a >> new kernel while keeping DMA from passthrough devices flowing, which >> means the SMMU's translation state has to survive the handover. The Live >> Update Orchestrator work [1] and the in-progress "iommu: Add live >> update state preservation" series [2] are building exactly this plumbing >> on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future >> work, and an earlier RFC from Amazon [3] sketched the same idea for >> iommufd. > > It would be appropriate to keep this patch with the rest of that out > of tree pile, for example in the series that enables s2 only support > in smmuv3. Or even better, just make sure that whatever hypervisor supports this half-finished WIP mechanism also uses IOMMU_HWPT_ALLOC_NEST_PARENT to explicitly get stage 2 domains for VM-assigned devices in the first place, rather than swing a big hammer at the kernel (that takes out SVA/PASID support as collateral damage...) Thanks, Robin. >> For this use case, Stage 2 is materially easier to persist than Stage 1, >> for structural rather than performance reasons: > > I don't think so. The driver needs to know each and every STE that > will survive KHO. The ones that don't survive need to be reset to > abort STEs. From that point it is trivial enough to include the CD > memory in the preservation. > > It would help to send a preparation series to switch the ARM STE and > CD logic away from dma_alloc_coherent and use iommu-pages instead, > since we only expect iommu-pages to support preservation.. > > I could maybe see only supporting non-PASID as a first-series, but a > CD table with SSID 0 only populated is still pretty trivial. > > Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-22 16:23 ` Jason Gunthorpe 2026-04-22 16:36 ` Robin Murphy @ 2026-04-23 9:44 ` Will Deacon 2026-04-23 9:47 ` Will Deacon 1 sibling, 1 reply; 16+ messages in thread From: Will Deacon @ 2026-04-23 9:44 UTC (permalink / raw) To: Jason Gunthorpe Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Wed, Apr 22, 2026 at 01:23:51PM -0300, Jason Gunthorpe wrote: > On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote: > > The motivation is live update of the hypervisor: we want to kexec into a > > new kernel while keeping DMA from passthrough devices flowing, which > > means the SMMU's translation state has to survive the handover. The Live > > Update Orchestrator work [1] and the in-progress "iommu: Add live > > update state preservation" series [2] are building exactly this plumbing > > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future > > work, and an earlier RFC from Amazon [3] sketched the same idea for > > iommufd. > > It would be appropriate to keep this patch with the rest of that out > of tree pile, for example in the series that enables s2 only support > in smmuv3. > > > For this use case, Stage 2 is materially easier to persist than Stage 1, > > for structural rather than performance reasons: > > I don't think so. The driver needs to know each and every STE that > will survive KHO. The ones that don't survive need to be reset to > abort STEs. From that point it is trivial enough to include the CD > memory in the preservation. > > It would help to send a preparation series to switch the ARM STE and > CD logic away from dma_alloc_coherent and use iommu-pages instead, > since we only expect iommu-pages to support preservation.. Does iommu-pages provide a mechanism to map the memory as non-cacheable if the SMMU isn't coherent? I really don't want to entertain CMOs for the queues. Will ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-23 9:44 ` Will Deacon @ 2026-04-23 9:47 ` Will Deacon 2026-04-23 14:23 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Will Deacon @ 2026-04-23 9:47 UTC (permalink / raw) To: Jason Gunthorpe Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Thu, Apr 23, 2026 at 10:44:08AM +0100, Will Deacon wrote: > On Wed, Apr 22, 2026 at 01:23:51PM -0300, Jason Gunthorpe wrote: > > On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote: > > > The motivation is live update of the hypervisor: we want to kexec into a > > > new kernel while keeping DMA from passthrough devices flowing, which > > > means the SMMU's translation state has to survive the handover. The Live > > > Update Orchestrator work [1] and the in-progress "iommu: Add live > > > update state preservation" series [2] are building exactly this plumbing > > > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future > > > work, and an earlier RFC from Amazon [3] sketched the same idea for > > > iommufd. > > > > It would be appropriate to keep this patch with the rest of that out > > of tree pile, for example in the series that enables s2 only support > > in smmuv3. > > > > > For this use case, Stage 2 is materially easier to persist than Stage 1, > > > for structural rather than performance reasons: > > > > I don't think so. The driver needs to know each and every STE that > > will survive KHO. The ones that don't survive need to be reset to > > abort STEs. From that point it is trivial enough to include the CD > > memory in the preservation. > > > > It would help to send a preparation series to switch the ARM STE and > > CD logic away from dma_alloc_coherent and use iommu-pages instead, > > since we only expect iommu-pages to support preservation.. > > Does iommu-pages provide a mechanism to map the memory as non-cacheable > if the SMMU isn't coherent? I really don't want to entertain CMOs for > the queues. Sorry, I said "queues" here but I was really referring to any of the current dma_alloc_coherent() allocations and it's the CDs that matter in this thread. The rationale being that: 1. A cacheable mapping is going to pollute the cache unnecessarily. 2. Reasoning about atomicity and ordering is a lot more subtle with CMOs. 3. It seems like a pretty invasive driver change to support live update, which isn't relevant for a lot of systems. Will ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-23 9:47 ` Will Deacon @ 2026-04-23 14:23 ` Jason Gunthorpe 2026-04-23 17:07 ` Will Deacon 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2026-04-23 14:23 UTC (permalink / raw) To: Will Deacon Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Thu, Apr 23, 2026 at 10:47:49AM +0100, Will Deacon wrote: > On Thu, Apr 23, 2026 at 10:44:08AM +0100, Will Deacon wrote: > > On Wed, Apr 22, 2026 at 01:23:51PM -0300, Jason Gunthorpe wrote: > > > On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote: > > > > The motivation is live update of the hypervisor: we want to kexec into a > > > > new kernel while keeping DMA from passthrough devices flowing, which > > > > means the SMMU's translation state has to survive the handover. The Live > > > > Update Orchestrator work [1] and the in-progress "iommu: Add live > > > > update state preservation" series [2] are building exactly this plumbing > > > > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future > > > > work, and an earlier RFC from Amazon [3] sketched the same idea for > > > > iommufd. > > > > > > It would be appropriate to keep this patch with the rest of that out > > > of tree pile, for example in the series that enables s2 only support > > > in smmuv3. > > > > > > > For this use case, Stage 2 is materially easier to persist than Stage 1, > > > > for structural rather than performance reasons: > > > > > > I don't think so. The driver needs to know each and every STE that > > > will survive KHO. The ones that don't survive need to be reset to > > > abort STEs. From that point it is trivial enough to include the CD > > > memory in the preservation. > > > > > > It would help to send a preparation series to switch the ARM STE and > > > CD logic away from dma_alloc_coherent and use iommu-pages instead, > > > since we only expect iommu-pages to support preservation.. > > > > Does iommu-pages provide a mechanism to map the memory as non-cacheable > > if the SMMU isn't coherent? No, it has to use CMOs today. It looks like all the stuff dma_alloc_coherent does to make a non-cached mapping are pretty arch specific. I don't know if there is a way we could make more general code get a struct page into an uncached KVA and meet all the arch rules? I also think dma_alloc_coherent is far to complex, with pools and more, to support KHO. > > I really don't want to entertain CMOs for > the queues. > > Sorry, I said "queues" here but I was really referring to any of the > current dma_alloc_coherent() allocations and it's the CDs that matter > in this thread. queues shouldn't change they are too performance sensitive > The rationale being that: > > 1. A cacheable mapping is going to pollute the cache unnecessarily. > 2. Reasoning about atomicity and ordering is a lot more subtle with CMOs. The page table suffers from all of these draw backs, and the STE/CD is touched alot less frequently. It is kind of odd to focus on these issues with STE/CD when page table is a much bigger problem. STE/CD is pretty simple now, there is only one place to put the CMO and the ordering is all handled with that shared code. We no longer care about ordering beyond all the writes must be visible to HW before issuing the CMDQ invalidation command - which is the same environment as the pagetable. > 3. It seems like a pretty invasive driver change to support live update, > which isn't relevant for a lot of systems. That's sort of the whole story of live update.. Trying to keep it small means using the abstractions that support it like iommu-pages. IMHO live update is OK to require coherent only, so at worst it could use iommu-pages on coherent systems and keep using the dma_alloc_coherent() for others. I also don't like this "lot of systems thing". I don't want these powerful capabilities locked up in some giant CSP's proprietary kernel. I want all the companies in the cloud market to have access to the same feature set. That's what open source is supposed to be driving toward. I have several interesting use cases for this functionality already. It will run probably $50-100B of AI cloud servers at least, I think that is enough justification. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-23 14:23 ` Jason Gunthorpe @ 2026-04-23 17:07 ` Will Deacon 2026-04-23 18:43 ` Samiullah Khawaja 2026-04-23 22:37 ` Jason Gunthorpe 0 siblings, 2 replies; 16+ messages in thread From: Will Deacon @ 2026-04-23 17:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Thu, Apr 23, 2026 at 11:23:26AM -0300, Jason Gunthorpe wrote: > On Thu, Apr 23, 2026 at 10:47:49AM +0100, Will Deacon wrote: > > > Does iommu-pages provide a mechanism to map the memory as non-cacheable > > > if the SMMU isn't coherent? > > No, it has to use CMOs today. > > It looks like all the stuff dma_alloc_coherent does to make a > non-cached mapping are pretty arch specific. I don't know if there is > a way we could make more general code get a struct page into an > uncached KVA and meet all the arch rules? > > I also think dma_alloc_coherent is far to complex, with pools and > more, to support KHO. I wonder if there's scope for supporting just some subset of it? > > > I really don't want to entertain CMOs for > the queues. > > > > Sorry, I said "queues" here but I was really referring to any of the > > current dma_alloc_coherent() allocations and it's the CDs that matter > > in this thread. > > queues shouldn't change they are too performance sensitive > > > The rationale being that: > > > > 1. A cacheable mapping is going to pollute the cache unnecessarily. > > 2. Reasoning about atomicity and ordering is a lot more subtle with CMOs. > > The page table suffers from all of these draw backs, and the STE/CD is > touched alot less frequently. It is kind of odd to focus on these > issues with STE/CD when page table is a much bigger problem. I don't think it's that odd given that the STE/CD entries are bigger than PTEs and the SMMU permits a lot more relaxations about how they are accessed and cached compared to the PTW. Having said that, the page-table code looks broken to me even in the coherent case: ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data); as the compiler can theoretically make a right mess of that. The non-coherent case looks more fragile, because I don't _think_ the architecture provides any ordering or atomicity guarantees about cache cleaning to the PoC. Presumably, the correct sequence would be to write the PTE with the valid bit clear, do the CMO (with completion barrier), *then* write the bottom byte with the valid bit set and do another CMO. Sounds great! > STE/CD is pretty simple now, there is only one place to put the CMO > and the ordering is all handled with that shared code. We no longer > care about ordering beyond all the writes must be visible to HW before > issuing the CMDQ invalidation command - which is the same environment > as the pagetable. You presumably rely on 64-bit single-copy atomicity for hitless updates, no? > > 3. It seems like a pretty invasive driver change to support live update, > > which isn't relevant for a lot of systems. > > That's sort of the whole story of live update.. Trying to keep it > small means using the abstractions that support it like iommu-pages. > > IMHO live update is OK to require coherent only, so at worst it could > use iommu-pages on coherent systems and keep using the > dma_alloc_coherent() for others. That would be unfortunate, but if we can wrap the two allocators in some common helpers then it's probably fine. > I also don't like this "lot of systems thing". I don't want these > powerful capabilities locked up in some giant CSP's proprietary > kernel. I want all the companies in the cloud market to have access > to the same feature set. That's what open source is supposed to be > driving toward. I have several interesting use cases for this > functionality already. Sorry, the point here was definitely _not_ about keeping this out of tree, nor was I trying to say that this stuff isn't important. But the mobile world doesn't give a hoot about KHO and _does_ tend to care about the impact of CMO, so we have to find a way to balance the two worlds. > It will run probably $50-100B of AI cloud servers at least, I think > that is enough justification. I wasn't asking for justification but I honestly don't care about the money involved :) People need this, so we should find a way to support it -- it just needs to fit in with everything else. Will ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-23 17:07 ` Will Deacon @ 2026-04-23 18:43 ` Samiullah Khawaja 2026-04-23 22:37 ` Jason Gunthorpe 1 sibling, 0 replies; 16+ messages in thread From: Samiullah Khawaja @ 2026-04-23 18:43 UTC (permalink / raw) To: Will Deacon Cc: Jason Gunthorpe, Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman, dmatlack, pasha.tatashin On Thu, Apr 23, 2026 at 06:07:23PM +0100, Will Deacon wrote: >On Thu, Apr 23, 2026 at 11:23:26AM -0300, Jason Gunthorpe wrote: >> On Thu, Apr 23, 2026 at 10:47:49AM +0100, Will Deacon wrote: >> > > Does iommu-pages provide a mechanism to map the memory as non-cacheable >> > > if the SMMU isn't coherent? >> >> No, it has to use CMOs today. >> >> It looks like all the stuff dma_alloc_coherent does to make a >> non-cached mapping are pretty arch specific. I don't know if there is >> a way we could make more general code get a struct page into an >> uncached KVA and meet all the arch rules? >> >> I also think dma_alloc_coherent is far to complex, with pools and >> more, to support KHO. Agreed. dma_alloc_* is too complex with pools, CMAs and what not to support fully in KHO. > >I wonder if there's scope for supporting just some subset of it? We have been experimenting with something like this. We have a usecase where memory needs to be preserved but we want to avoid invasive changes in the driver. If it's not a crazy idea, maybe we can start with a very limited scope of providing preservation for a subset of allocations done through the DMA API? I can send out my proof of concept as an RFC after I'm done with the next revision of my IOMMU persistence series. WDYT? Sami ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-23 17:07 ` Will Deacon 2026-04-23 18:43 ` Samiullah Khawaja @ 2026-04-23 22:37 ` Jason Gunthorpe 2026-04-24 15:16 ` Will Deacon 1 sibling, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2026-04-23 22:37 UTC (permalink / raw) To: Will Deacon Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Thu, Apr 23, 2026 at 06:07:23PM +0100, Will Deacon wrote: > I don't think it's that odd given that the STE/CD entries are bigger > than PTEs and the SMMU permits a lot more relaxations about how they are > accessed and cached compared to the PTW. Well I'm not sure bigger really matters, but I wasn't aware there was a spec relaxation here that would make the cachable path not viable for STE but not PTW... > Having said that, the page-table code looks broken to me even in the > coherent case: > > ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data); > > as the compiler can theoretically make a right mess of that. Heh, great. The iommupt stuff does better.. It does a 64 bit cmpxchg to store a table pointer and a 64 bit WRITE_ONCE to store the pte, then a CMO through the DMA API. DMA API has to guarentee the right ordering, so we only have the question below: > > STE/CD is pretty simple now, there is only one place to put the CMO > > and the ordering is all handled with that shared code. We no longer > > care about ordering beyond all the writes must be visible to HW before > > issuing the CMDQ invalidation command - which is the same environment > > as the pagetable. > > You presumably rely on 64-bit single-copy atomicity for hitless updates, > no? Yes, just like the page table does.. I hope that's not a problem or we have a issue with the PTW :) > > I also don't like this "lot of systems thing". I don't want these > > powerful capabilities locked up in some giant CSP's proprietary > > kernel. I want all the companies in the cloud market to have access > > to the same feature set. That's what open source is supposed to be > > driving toward. I have several interesting use cases for this > > functionality already. > > Sorry, the point here was definitely _not_ about keeping this out of > tree, nor was I trying to say that this stuff isn't important. But the > mobile world doesn't give a hoot about KHO and _does_ tend to care about > the impact of CMO, so we have to find a way to balance the two worlds. Yes, that make sense. My argument is that the CMO on STE/CD shouldn't bother mobile, you could even view it as an micro-optimization because we do occasionally read-back the STE/CD fields. But if you say the SMM STE/CD fetch doesn't have to follow the single copy rules and PTW does, then ok.. And if Samiullah can tackle dma_alloc_coherent then maybe the whole question is moot. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-23 22:37 ` Jason Gunthorpe @ 2026-04-24 15:16 ` Will Deacon 2026-04-24 15:42 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Will Deacon @ 2026-04-24 15:16 UTC (permalink / raw) To: Jason Gunthorpe Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Thu, Apr 23, 2026 at 07:37:16PM -0300, Jason Gunthorpe wrote: > On Thu, Apr 23, 2026 at 06:07:23PM +0100, Will Deacon wrote: > > > I don't think it's that odd given that the STE/CD entries are bigger > > than PTEs and the SMMU permits a lot more relaxations about how they are > > accessed and cached compared to the PTW. > > Well I'm not sure bigger really matters, but I wasn't aware there was > a spec relaxation here that would make the cachable path not viable > for STE but not PTW... Things like the SMMU being allowed to cache invalid structures and loading structures using multiple, unordered accesses are the things that worry me relative to the page-tables. But see below. > > Having said that, the page-table code looks broken to me even in the > > coherent case: > > > > ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data); > > > > as the compiler can theoretically make a right mess of that. > > Heh, great. The iommupt stuff does better.. It does a 64 bit cmpxchg > to store a table pointer and a 64 bit WRITE_ONCE to store the pte, > then a CMO through the DMA API. > > DMA API has to guarentee the right ordering, so we only have the > question below: > > > > STE/CD is pretty simple now, there is only one place to put the CMO > > > and the ordering is all handled with that shared code. We no longer > > > care about ordering beyond all the writes must be visible to HW before > > > issuing the CMDQ invalidation command - which is the same environment > > > as the pagetable. > > > > You presumably rely on 64-bit single-copy atomicity for hitless updates, > > no? > > Yes, just like the page table does.. > > I hope that's not a problem or we have a issue with the PTW :) You trimmed the part from my reply where I think we _do_ have an issue with the PTW. Here it is again: The non-coherent case looks more fragile, because I don't _think_ the architecture provides any ordering or atomicity guarantees about cache cleaning to the PoC. Presumably, the correct sequence would be to write the PTE with the valid bit clear, do the CMO (with completion barrier), *then* write the bottom byte with the valid bit set and do another CMO. > > > I also don't like this "lot of systems thing". I don't want these > > > powerful capabilities locked up in some giant CSP's proprietary > > > kernel. I want all the companies in the cloud market to have access > > > to the same feature set. That's what open source is supposed to be > > > driving toward. I have several interesting use cases for this > > > functionality already. > > > > Sorry, the point here was definitely _not_ about keeping this out of > > tree, nor was I trying to say that this stuff isn't important. But the > > mobile world doesn't give a hoot about KHO and _does_ tend to care about > > the impact of CMO, so we have to find a way to balance the two worlds. > > Yes, that make sense. > > My argument is that the CMO on STE/CD shouldn't bother mobile, you > could even view it as an micro-optimization because we do occasionally > read-back the STE/CD fields. I was against that read-back, iirc :) > And if Samiullah can tackle dma_alloc_coherent then maybe the whole > question is moot. Yes, that would be great, but we probably need to fix the page-table code too. Will ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-24 15:16 ` Will Deacon @ 2026-04-24 15:42 ` Jason Gunthorpe 2026-04-24 16:01 ` Will Deacon 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2026-04-24 15:42 UTC (permalink / raw) To: Will Deacon Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Fri, Apr 24, 2026 at 04:16:17PM +0100, Will Deacon wrote: > > > > STE/CD is pretty simple now, there is only one place to put the CMO > > > > and the ordering is all handled with that shared code. We no longer > > > > care about ordering beyond all the writes must be visible to HW before > > > > issuing the CMDQ invalidation command - which is the same environment > > > > as the pagetable. > > > > > > You presumably rely on 64-bit single-copy atomicity for hitless updates, > > > no? > > > > Yes, just like the page table does.. > > > > I hope that's not a problem or we have a issue with the PTW :) > > You trimmed the part from my reply where I think we _do_ have an issue > with the PTW. Here it is again: > > The non-coherent case looks more fragile, because I don't _think_ the > architecture provides any ordering or atomicity guarantees about cache > cleaning to the PoC. Presumably, the correct sequence would be to write > the PTE with the valid bit clear, do the CMO (with completion barrier), > *then* write the bottom byte with the valid bit set and do another CMO. I wasn't sure if you are being serious. CMO + barriers must provide an ordering guarentee about cache cleaning to POC otherwise the entire Linux DMA API is broken. dma_sync must order with following device DMA. IMHO that's not negotiable for Linux. All ARM iommus rely on 64 bit atomic non tearing. No bugs reported? Any fix to that is going to have major performance downsides.. I also strongly suspect it is provided on real HW. It would be hard to even build HW where <= 64 bit quanta can tear. Maybe this is something ARM should take a look at. At the very least it would warrant an IORT flag for safe HW to use to opt into the faster cachable flow. > > My argument is that the CMO on STE/CD shouldn't bother mobile, you > > could even view it as an micro-optimization because we do occasionally > > read-back the STE/CD fields. > > I was against that read-back, iirc :) Yes, but it is OK :) > > And if Samiullah can tackle dma_alloc_coherent then maybe the whole > > question is moot. > > Yes, that would be great, but we probably need to fix the page-table > code too. You really want to deal with the likely perf regressions that would cause on Android/etc? Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-24 15:42 ` Jason Gunthorpe @ 2026-04-24 16:01 ` Will Deacon 2026-04-24 16:39 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Will Deacon @ 2026-04-24 16:01 UTC (permalink / raw) To: Jason Gunthorpe Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Fri, Apr 24, 2026 at 12:42:56PM -0300, Jason Gunthorpe wrote: > On Fri, Apr 24, 2026 at 04:16:17PM +0100, Will Deacon wrote: > > > > > STE/CD is pretty simple now, there is only one place to put the CMO > > > > > and the ordering is all handled with that shared code. We no longer > > > > > care about ordering beyond all the writes must be visible to HW before > > > > > issuing the CMDQ invalidation command - which is the same environment > > > > > as the pagetable. > > > > > > > > You presumably rely on 64-bit single-copy atomicity for hitless updates, > > > > no? > > > > > > Yes, just like the page table does.. > > > > > > I hope that's not a problem or we have a issue with the PTW :) > > > > You trimmed the part from my reply where I think we _do_ have an issue > > with the PTW. Here it is again: > > > > The non-coherent case looks more fragile, because I don't _think_ the > > architecture provides any ordering or atomicity guarantees about cache > > cleaning to the PoC. Presumably, the correct sequence would be to write > > the PTE with the valid bit clear, do the CMO (with completion barrier), > > *then* write the bottom byte with the valid bit set and do another CMO. > > I wasn't sure if you are being serious. > > CMO + barriers must provide an ordering guarentee about cache cleaning > to POC otherwise the entire Linux DMA API is broken. dma_sync must > order with following device DMA. IMHO that's not negotiable for Linux. The problem is with concurrent DMA (from the page-table walker) and I don't see anything that guarantees that in the CPU architecture. I don't think the streaming DMA API pretends to handle that case, does it? It relies on a pretty rigid ownership concept from what I understand. > All ARM iommus rely on 64 bit atomic non tearing. No bugs reported? It's hard to judge as I don't think SMMUs tend to perform a lot of speculative address translation when DMA isn't active. > Any fix to that is going to have major performance downsides.. > > I also strongly suspect it is provided on real HW. It would be hard to > even build HW where <= 64 bit quanta can tear. > > Maybe this is something ARM should take a look at. Yes, we should ask. Maybe I missed something in the Arm ARM, but I can also seeing it being a pain to specify this behaviour all the way out to the PoC and I wouldn't be so bold as to say that it's hard to build HW that would exhibit problems here. > > > And if Samiullah can tackle dma_alloc_coherent then maybe the whole > > > question is moot. > > > > Yes, that would be great, but we probably need to fix the page-table > > code too. > > You really want to deal with the likely perf regressions that would > cause on Android/etc? Of course I'd rather that the architecture said that our current code is fine, but if it doesn't then I don't have much choice, really. At the very least, we should minimise the number of places where we rely on non-architected behaviour and so keeping the CDs and STEs non-cacheable remains my preference. Will ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation 2026-04-24 16:01 ` Will Deacon @ 2026-04-24 16:39 ` Jason Gunthorpe 0 siblings, 0 replies; 16+ messages in thread From: Jason Gunthorpe @ 2026-04-24 16:39 UTC (permalink / raw) To: Will Deacon Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source, Zeev Zilberman On Fri, Apr 24, 2026 at 05:01:27PM +0100, Will Deacon wrote: > On Fri, Apr 24, 2026 at 12:42:56PM -0300, Jason Gunthorpe wrote: > > On Fri, Apr 24, 2026 at 04:16:17PM +0100, Will Deacon wrote: > > > > > > STE/CD is pretty simple now, there is only one place to put the CMO > > > > > > and the ordering is all handled with that shared code. We no longer > > > > > > care about ordering beyond all the writes must be visible to HW before > > > > > > issuing the CMDQ invalidation command - which is the same environment > > > > > > as the pagetable. > > > > > > > > > > You presumably rely on 64-bit single-copy atomicity for hitless updates, > > > > > no? > > > > > > > > Yes, just like the page table does.. > > > > > > > > I hope that's not a problem or we have a issue with the PTW :) > > > > > > You trimmed the part from my reply where I think we _do_ have an issue > > > with the PTW. Here it is again: > > > > > > The non-coherent case looks more fragile, because I don't _think_ the > > > architecture provides any ordering or atomicity guarantees about cache > > > cleaning to the PoC. Presumably, the correct sequence would be to write > > > the PTE with the valid bit clear, do the CMO (with completion barrier), > > > *then* write the bottom byte with the valid bit set and do another CMO. > > > > I wasn't sure if you are being serious. > > > > CMO + barriers must provide an ordering guarentee about cache cleaning > > to POC otherwise the entire Linux DMA API is broken. dma_sync must > > order with following device DMA. IMHO that's not negotiable for Linux. > > The problem is with concurrent DMA (from the page-table walker) and I > don't see anything that guarantees that in the CPU architecture. I don't > think the streaming DMA API pretends to handle that case, does it? It > relies on a pretty rigid ownership concept from what I understand. I think you pointed out two things, ordering and tearing. Ordering is OK. If I write a PTE, dma_sync, then command a device to use that IOVA the PTW must observe the new PTE value. Otherwise dma_sync isn't doing what Linux requires. Tearing is a different issue, if the device uses the IOVA and races with the PTE write changing it then you say maybe it can mis-read it with tearing. However, this race only happens if the PTE is currently non-valid or being changed to non-valid. Meaning randomly you will be getting an invalid IOVA event. In non-coherent mode we don't allow SVA and we don't allow VFIO. Only the DMA API and drivers open coding things. For VFIO and SVA, yes, we need the HW to work and properly, userspcae can trigger invalid IOVA, we can't tolerate a corrupted PTE. In embedded I suppose you could make an argument you don't care about it since invalid IOVA would have to be caused by a buggy kernel driver, it should never happen, and thus this is really a debug feature. If the race will never be hit in a working system maybe it is fine to leave it as is. Would be good to document this detail :) > Of course I'd rather that the architecture said that our current code > is fine, but if it doesn't then I don't have much choice, really. At the > very least, we should minimise the number of places where we rely on > non-architected behaviour and so keeping the CDs and STEs non-cacheable > remains my preference. So, I am convinced, PTW has that escape above that doesn't apply to STE/CD. Those can be accessed truely at any time and we can't ever leave a 64 bit value in a strange state. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-04-24 16:39 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-20 12:32 [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation Evangelos Petrongonas 2026-04-20 12:40 ` Jason Gunthorpe 2026-04-22 6:44 ` Evangelos Petrongonas 2026-04-22 15:44 ` Pranjal Shrivastava 2026-04-22 16:23 ` Jason Gunthorpe 2026-04-22 16:36 ` Robin Murphy 2026-04-23 9:44 ` Will Deacon 2026-04-23 9:47 ` Will Deacon 2026-04-23 14:23 ` Jason Gunthorpe 2026-04-23 17:07 ` Will Deacon 2026-04-23 18:43 ` Samiullah Khawaja 2026-04-23 22:37 ` Jason Gunthorpe 2026-04-24 15:16 ` Will Deacon 2026-04-24 15:42 ` Jason Gunthorpe 2026-04-24 16:01 ` Will Deacon 2026-04-24 16:39 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox