* [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
@ 2026-04-20 12:32 Evangelos Petrongonas
2026-04-20 12:40 ` Jason Gunthorpe
0 siblings, 1 reply; 12+ messages in thread
From: Evangelos Petrongonas @ 2026-04-20 12:32 UTC (permalink / raw)
To: Will Deacon
Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel,
Jason Gunthorpe, Nicolin Chen, Pranjal Shrivastava, Lu Baolu,
linux-arm-kernel, iommu, linux-kernel, nh-open-source,
Zeev Zilberman
When the hardware advertises both Stage 1 and Stage 2 translation, the
driver prefers Stage 1 for DMA domain allocation and only falls back to
Stage 2 if Stage 1 is not supported.
Some configurations may want to force Stage 2 translation even when the
hardware supports Stage 1. Introduce a module parameter 'disable_s1'
that, when set, prevents the driver from advertising
ARM_SMMU_FEAT_TRANS_S1, causing all DMA domains to use Stage 2 instead.
Co-developed-by: Zeev Zilberman <zeev@amazon.com>
Signed-off-by: Zeev Zilberman <zeev@amazon.com>
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e8d7dbe495f0..afb21c210e24 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -39,6 +39,11 @@ module_param(disable_msipolling, bool, 0444);
MODULE_PARM_DESC(disable_msipolling,
"Disable MSI-based polling for CMD_SYNC completion.");
+static bool disable_s1;
+module_param(disable_s1, bool, 0444);
+MODULE_PARM_DESC(disable_s1,
+ "Disable Stage 1 translation even if supported by hardware.");
+
static const struct iommu_ops arm_smmu_ops;
static struct iommu_dirty_ops arm_smmu_dirty_ops;
@@ -5087,13 +5092,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_STALLS;
}
- if (reg & IDR0_S1P)
+ if ((reg & IDR0_S1P) && !disable_s1)
smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
if (reg & IDR0_S2P)
smmu->features |= ARM_SMMU_FEAT_TRANS_S2;
- if (!(reg & (IDR0_S1P | IDR0_S2P))) {
+ if (!(smmu->features & (ARM_SMMU_FEAT_TRANS_S1 | ARM_SMMU_FEAT_TRANS_S2))) {
dev_err(smmu->dev, "no translation support!\n");
return -ENXIO;
}
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-20 12:32 [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation Evangelos Petrongonas
@ 2026-04-20 12:40 ` Jason Gunthorpe
2026-04-22 6:44 ` Evangelos Petrongonas
0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2026-04-20 12:40 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Mon, Apr 20, 2026 at 12:32:01PM +0000, Evangelos Petrongonas wrote:
> When the hardware advertises both Stage 1 and Stage 2 translation, the
> driver prefers Stage 1 for DMA domain allocation and only falls back to
> Stage 2 if Stage 1 is not supported.
>
> Some configurations may want to force Stage 2 translation even when the
> hardware supports Stage 1.
Why? You really need to explain why for a patch like this.
If there really is some HW issue I think it is more appropriate to get
an IORT flag or IDR detection that the HW has a problem.
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-20 12:40 ` Jason Gunthorpe
@ 2026-04-22 6:44 ` Evangelos Petrongonas
2026-04-22 15:44 ` Pranjal Shrivastava
2026-04-22 16:23 ` Jason Gunthorpe
0 siblings, 2 replies; 12+ messages in thread
From: Evangelos Petrongonas @ 2026-04-22 6:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Mon, Apr 20, 2026 at 09:40:32AM -0300 Jason Gunthorpe wrote:
> On Mon, Apr 20, 2026 at 12:32:01PM +0000, Evangelos Petrongonas wrote:
> > When the hardware advertises both Stage 1 and Stage 2 translation, the
> > driver prefers Stage 1 for DMA domain allocation and only falls back to
> > Stage 2 if Stage 1 is not supported.
> >
> > Some configurations may want to force Stage 2 translation even when the
> > hardware supports Stage 1.
>
> Why? You really need to explain why for a patch like this.
>
> If there really is some HW issue I think it is more appropriate to get
> an IORT flag or IDR detection that the HW has a problem.
It's not a hardware bug there's no IORT or IDR bit that would make sense
here.
The motivation is live update of the hypervisor: we want to kexec into a
new kernel while keeping DMA from passthrough devices flowing, which
means the SMMU's translation state has to survive the handover. The Live
Update Orchestrator work [1] and the in-progress "iommu: Add live
update state preservation" series [2] are building exactly this plumbing
on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
work, and an earlier RFC from Amazon [3] sketched the same idea for
iommufd.
For this use case, Stage 2 is materially easier to persist than Stage 1,
for structural rather than performance reasons: An S2 STE carries the
whole translation configuration inline. To hand over an S2 domain, the
pre-kexec kernel only needs to preserve the stream table pages and the
S2 pgtable pages. An S1 STE points at a Context Descriptor table and as
a result Persisting S1 therefore requires preserving the CD table pages
too, and because the CD is keyed by ASID coordinating ASID identity
across the handover.
In the long term the plan should be to persist both stages.
However, until a patch series that properly introduces SMMU support for
is developed/posted we would like to experiment with S1+S2-capable
hardware with an easier to implement handover machinery, that relies on
S2 translations.
[1] https://lwn.net/Articles/1021442/ — Live Update Orchestrator
[2] https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com/ —
[PATCH 00/14] iommu: Add live update state preservation
[3] https://lore.kernel.org/all/20240916113102.710522-1-jgowans@amazon.com/ — [RFC
PATCH 00/13] Support iommu(fd) persistence for live update
> Jason
Kind Regards,
Evangelos
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-22 6:44 ` Evangelos Petrongonas
@ 2026-04-22 15:44 ` Pranjal Shrivastava
2026-04-22 16:23 ` Jason Gunthorpe
1 sibling, 0 replies; 12+ messages in thread
From: Pranjal Shrivastava @ 2026-04-22 15:44 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Jason Gunthorpe, Will Deacon, Robin Murphy, Joerg Roedel,
Nicolin Chen, Lu Baolu, linux-arm-kernel, iommu, linux-kernel,
nh-open-source, Zeev Zilberman
On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote:
> On Mon, Apr 20, 2026 at 09:40:32AM -0300 Jason Gunthorpe wrote:
> > On Mon, Apr 20, 2026 at 12:32:01PM +0000, Evangelos Petrongonas wrote:
> > > When the hardware advertises both Stage 1 and Stage 2 translation, the
> > > driver prefers Stage 1 for DMA domain allocation and only falls back to
> > > Stage 2 if Stage 1 is not supported.
> > >
> > > Some configurations may want to force Stage 2 translation even when the
> > > hardware supports Stage 1.
> >
> > Why? You really need to explain why for a patch like this.
> >
> > If there really is some HW issue I think it is more appropriate to get
> > an IORT flag or IDR detection that the HW has a problem.
>
> It's not a hardware bug there's no IORT or IDR bit that would make sense
> here.
>
> The motivation is live update of the hypervisor: we want to kexec into a
> new kernel while keeping DMA from passthrough devices flowing, which
> means the SMMU's translation state has to survive the handover. The Live
> Update Orchestrator work [1] and the in-progress "iommu: Add live
> update state preservation" series [2] are building exactly this plumbing
> on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
> work, and an earlier RFC from Amazon [3] sketched the same idea for
> iommufd.
>
> For this use case, Stage 2 is materially easier to persist than Stage 1,
> for structural rather than performance reasons: An S2 STE carries the
> whole translation configuration inline. To hand over an S2 domain, the
> pre-kexec kernel only needs to preserve the stream table pages and the
> S2 pgtable pages. An S1 STE points at a Context Descriptor table and as
> a result Persisting S1 therefore requires preserving the CD table pages
> too, and because the CD is keyed by ASID coordinating ASID identity
> across the handover.
>
> In the long term the plan should be to persist both stages.
> However, until a patch series that properly introduces SMMU support for
> is developed/posted we would like to experiment with S1+S2-capable
> hardware with an easier to implement handover machinery, that relies on
> S2 translations.
>
Hi Evangelos,
We (Google) currently have a series in the works specifically for
arm-smmu-v3 state preservation. Our plan is to post it in phases (S2
preservation first then the S1 + CD series) once the iommu: liveupdate
persistence series has stabilized.
Since the iommu core liveupdate framework itself is still in flux,
it’s a bit premature to accept/merge this patch before both the series.
Furthermore, it must be noted that even if the iommu liveupdate series
is merged, until the framework is fully integrated with the SMMU driver,
liveupdate shall remain essentially non-functional or 'broken' for
drivers that haven't yet implemented the necessary support hooks.
We’d prefer to wait until the core infrastructure is solid so we can
ensure the SMMUv3 implementation aligns perfectly with the final
requirements of the iommu liveupdate persistence series.
That said, we don't mind posting our arm-smmu-v3 series with S2-only
preservation early as an early RFC if that helps align on the design
and implementation details.
Thanks,
Praan
> [1] https://lwn.net/Articles/1021442/ — Live Update Orchestrator
> [2] https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com/ —
> [PATCH 00/14] iommu: Add live update state preservation
> [3] https://lore.kernel.org/all/20240916113102.710522-1-jgowans@amazon.com/ — [RFC
> PATCH 00/13] Support iommu(fd) persistence for live update
>
> > Jason
>
> Kind Regards,
> Evangelos
>
>
>
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-22 6:44 ` Evangelos Petrongonas
2026-04-22 15:44 ` Pranjal Shrivastava
@ 2026-04-22 16:23 ` Jason Gunthorpe
2026-04-22 16:36 ` Robin Murphy
2026-04-23 9:44 ` Will Deacon
1 sibling, 2 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-04-22 16:23 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote:
> The motivation is live update of the hypervisor: we want to kexec into a
> new kernel while keeping DMA from passthrough devices flowing, which
> means the SMMU's translation state has to survive the handover. The Live
> Update Orchestrator work [1] and the in-progress "iommu: Add live
> update state preservation" series [2] are building exactly this plumbing
> on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
> work, and an earlier RFC from Amazon [3] sketched the same idea for
> iommufd.
It would be appropriate to keep this patch with the rest of that out
of tree pile, for example in the series that enables s2 only support
in smmuv3.
> For this use case, Stage 2 is materially easier to persist than Stage 1,
> for structural rather than performance reasons:
I don't think so. The driver needs to know each and every STE that
will survive KHO. The ones that don't survive need to be reset to
abort STEs. From that point it is trivial enough to include the CD
memory in the preservation.
It would help to send a preparation series to switch the ARM STE and
CD logic away from dma_alloc_coherent and use iommu-pages instead,
since we only expect iommu-pages to support preservation..
I could maybe see only supporting non-PASID as a first-series, but a
CD table with SSID 0 only populated is still pretty trivial.
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-22 16:23 ` Jason Gunthorpe
@ 2026-04-22 16:36 ` Robin Murphy
2026-04-23 9:44 ` Will Deacon
1 sibling, 0 replies; 12+ messages in thread
From: Robin Murphy @ 2026-04-22 16:36 UTC (permalink / raw)
To: Jason Gunthorpe, Evangelos Petrongonas
Cc: Will Deacon, Joerg Roedel, Nicolin Chen, Pranjal Shrivastava,
Lu Baolu, linux-arm-kernel, iommu, linux-kernel, nh-open-source,
Zeev Zilberman
On 2026-04-22 5:23 pm, Jason Gunthorpe wrote:
> On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote:
>> The motivation is live update of the hypervisor: we want to kexec into a
>> new kernel while keeping DMA from passthrough devices flowing, which
>> means the SMMU's translation state has to survive the handover. The Live
>> Update Orchestrator work [1] and the in-progress "iommu: Add live
>> update state preservation" series [2] are building exactly this plumbing
>> on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
>> work, and an earlier RFC from Amazon [3] sketched the same idea for
>> iommufd.
>
> It would be appropriate to keep this patch with the rest of that out
> of tree pile, for example in the series that enables s2 only support
> in smmuv3.
Or even better, just make sure that whatever hypervisor supports this
half-finished WIP mechanism also uses IOMMU_HWPT_ALLOC_NEST_PARENT to
explicitly get stage 2 domains for VM-assigned devices in the first
place, rather than swing a big hammer at the kernel (that takes out
SVA/PASID support as collateral damage...)
Thanks,
Robin.
>> For this use case, Stage 2 is materially easier to persist than Stage 1,
>> for structural rather than performance reasons:
>
> I don't think so. The driver needs to know each and every STE that
> will survive KHO. The ones that don't survive need to be reset to
> abort STEs. From that point it is trivial enough to include the CD
> memory in the preservation.
>
> It would help to send a preparation series to switch the ARM STE and
> CD logic away from dma_alloc_coherent and use iommu-pages instead,
> since we only expect iommu-pages to support preservation..
>
> I could maybe see only supporting non-PASID as a first-series, but a
> CD table with SSID 0 only populated is still pretty trivial.
>
> Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-22 16:23 ` Jason Gunthorpe
2026-04-22 16:36 ` Robin Murphy
@ 2026-04-23 9:44 ` Will Deacon
2026-04-23 9:47 ` Will Deacon
1 sibling, 1 reply; 12+ messages in thread
From: Will Deacon @ 2026-04-23 9:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Wed, Apr 22, 2026 at 01:23:51PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote:
> > The motivation is live update of the hypervisor: we want to kexec into a
> > new kernel while keeping DMA from passthrough devices flowing, which
> > means the SMMU's translation state has to survive the handover. The Live
> > Update Orchestrator work [1] and the in-progress "iommu: Add live
> > update state preservation" series [2] are building exactly this plumbing
> > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
> > work, and an earlier RFC from Amazon [3] sketched the same idea for
> > iommufd.
>
> It would be appropriate to keep this patch with the rest of that out
> of tree pile, for example in the series that enables s2 only support
> in smmuv3.
>
> > For this use case, Stage 2 is materially easier to persist than Stage 1,
> > for structural rather than performance reasons:
>
> I don't think so. The driver needs to know each and every STE that
> will survive KHO. The ones that don't survive need to be reset to
> abort STEs. From that point it is trivial enough to include the CD
> memory in the preservation.
>
> It would help to send a preparation series to switch the ARM STE and
> CD logic away from dma_alloc_coherent and use iommu-pages instead,
> since we only expect iommu-pages to support preservation..
Does iommu-pages provide a mechanism to map the memory as non-cacheable
if the SMMU isn't coherent? I really don't want to entertain CMOs for
the queues.
Will
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-23 9:44 ` Will Deacon
@ 2026-04-23 9:47 ` Will Deacon
2026-04-23 14:23 ` Jason Gunthorpe
0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2026-04-23 9:47 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Thu, Apr 23, 2026 at 10:44:08AM +0100, Will Deacon wrote:
> On Wed, Apr 22, 2026 at 01:23:51PM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote:
> > > The motivation is live update of the hypervisor: we want to kexec into a
> > > new kernel while keeping DMA from passthrough devices flowing, which
> > > means the SMMU's translation state has to survive the handover. The Live
> > > Update Orchestrator work [1] and the in-progress "iommu: Add live
> > > update state preservation" series [2] are building exactly this plumbing
> > > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
> > > work, and an earlier RFC from Amazon [3] sketched the same idea for
> > > iommufd.
> >
> > It would be appropriate to keep this patch with the rest of that out
> > of tree pile, for example in the series that enables s2 only support
> > in smmuv3.
> >
> > > For this use case, Stage 2 is materially easier to persist than Stage 1,
> > > for structural rather than performance reasons:
> >
> > I don't think so. The driver needs to know each and every STE that
> > will survive KHO. The ones that don't survive need to be reset to
> > abort STEs. From that point it is trivial enough to include the CD
> > memory in the preservation.
> >
> > It would help to send a preparation series to switch the ARM STE and
> > CD logic away from dma_alloc_coherent and use iommu-pages instead,
> > since we only expect iommu-pages to support preservation..
>
> Does iommu-pages provide a mechanism to map the memory as non-cacheable
> if the SMMU isn't coherent? I really don't want to entertain CMOs for
> the queues.
Sorry, I said "queues" here but I was really referring to any of the
current dma_alloc_coherent() allocations and it's the CDs that matter
in this thread.
The rationale being that:
1. A cacheable mapping is going to pollute the cache unnecessarily.
2. Reasoning about atomicity and ordering is a lot more subtle with CMOs.
3. It seems like a pretty invasive driver change to support live update,
which isn't relevant for a lot of systems.
Will
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-23 9:47 ` Will Deacon
@ 2026-04-23 14:23 ` Jason Gunthorpe
2026-04-23 17:07 ` Will Deacon
0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2026-04-23 14:23 UTC (permalink / raw)
To: Will Deacon
Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Thu, Apr 23, 2026 at 10:47:49AM +0100, Will Deacon wrote:
> On Thu, Apr 23, 2026 at 10:44:08AM +0100, Will Deacon wrote:
> > On Wed, Apr 22, 2026 at 01:23:51PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 22, 2026 at 06:44:31AM +0000, Evangelos Petrongonas wrote:
> > > > The motivation is live update of the hypervisor: we want to kexec into a
> > > > new kernel while keeping DMA from passthrough devices flowing, which
> > > > means the SMMU's translation state has to survive the handover. The Live
> > > > Update Orchestrator work [1] and the in-progress "iommu: Add live
> > > > update state preservation" series [2] are building exactly this plumbing
> > > > on top of KHO; [2]'s cover letter calls out Arm SMMUv3 support as future
> > > > work, and an earlier RFC from Amazon [3] sketched the same idea for
> > > > iommufd.
> > >
> > > It would be appropriate to keep this patch with the rest of that out
> > > of tree pile, for example in the series that enables s2 only support
> > > in smmuv3.
> > >
> > > > For this use case, Stage 2 is materially easier to persist than Stage 1,
> > > > for structural rather than performance reasons:
> > >
> > > I don't think so. The driver needs to know each and every STE that
> > > will survive KHO. The ones that don't survive need to be reset to
> > > abort STEs. From that point it is trivial enough to include the CD
> > > memory in the preservation.
> > >
> > > It would help to send a preparation series to switch the ARM STE and
> > > CD logic away from dma_alloc_coherent and use iommu-pages instead,
> > > since we only expect iommu-pages to support preservation..
> >
> > Does iommu-pages provide a mechanism to map the memory as non-cacheable
> > if the SMMU isn't coherent?
No, it has to use CMOs today.
It looks like all the stuff dma_alloc_coherent does to make a
non-cached mapping are pretty arch specific. I don't know if there is
a way we could make more general code get a struct page into an
uncached KVA and meet all the arch rules?
I also think dma_alloc_coherent is far to complex, with pools and
more, to support KHO.
> > I really don't want to entertain CMOs for > the queues.
>
> Sorry, I said "queues" here but I was really referring to any of the
> current dma_alloc_coherent() allocations and it's the CDs that matter
> in this thread.
queues shouldn't change they are too performance sensitive
> The rationale being that:
>
> 1. A cacheable mapping is going to pollute the cache unnecessarily.
> 2. Reasoning about atomicity and ordering is a lot more subtle with CMOs.
The page table suffers from all of these draw backs, and the STE/CD is
touched alot less frequently. It is kind of odd to focus on these
issues with STE/CD when page table is a much bigger problem.
STE/CD is pretty simple now, there is only one place to put the CMO
and the ordering is all handled with that shared code. We no longer
care about ordering beyond all the writes must be visible to HW before
issuing the CMDQ invalidation command - which is the same environment
as the pagetable.
> 3. It seems like a pretty invasive driver change to support live update,
> which isn't relevant for a lot of systems.
That's sort of the whole story of live update.. Trying to keep it
small means using the abstractions that support it like iommu-pages.
IMHO live update is OK to require coherent only, so at worst it could
use iommu-pages on coherent systems and keep using the
dma_alloc_coherent() for others.
I also don't like this "lot of systems thing". I don't want these
powerful capabilities locked up in some giant CSP's proprietary
kernel. I want all the companies in the cloud market to have access
to the same feature set. That's what open source is supposed to be
driving toward. I have several interesting use cases for this
functionality already.
It will run probably $50-100B of AI cloud servers at least, I think
that is enough justification.
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-23 14:23 ` Jason Gunthorpe
@ 2026-04-23 17:07 ` Will Deacon
2026-04-23 18:43 ` Samiullah Khawaja
2026-04-23 22:37 ` Jason Gunthorpe
0 siblings, 2 replies; 12+ messages in thread
From: Will Deacon @ 2026-04-23 17:07 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Thu, Apr 23, 2026 at 11:23:26AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 23, 2026 at 10:47:49AM +0100, Will Deacon wrote:
> > > Does iommu-pages provide a mechanism to map the memory as non-cacheable
> > > if the SMMU isn't coherent?
>
> No, it has to use CMOs today.
>
> It looks like all the stuff dma_alloc_coherent does to make a
> non-cached mapping are pretty arch specific. I don't know if there is
> a way we could make more general code get a struct page into an
> uncached KVA and meet all the arch rules?
>
> I also think dma_alloc_coherent is far to complex, with pools and
> more, to support KHO.
I wonder if there's scope for supporting just some subset of it?
> > > I really don't want to entertain CMOs for > the queues.
> >
> > Sorry, I said "queues" here but I was really referring to any of the
> > current dma_alloc_coherent() allocations and it's the CDs that matter
> > in this thread.
>
> queues shouldn't change they are too performance sensitive
>
> > The rationale being that:
> >
> > 1. A cacheable mapping is going to pollute the cache unnecessarily.
> > 2. Reasoning about atomicity and ordering is a lot more subtle with CMOs.
>
> The page table suffers from all of these draw backs, and the STE/CD is
> touched alot less frequently. It is kind of odd to focus on these
> issues with STE/CD when page table is a much bigger problem.
I don't think it's that odd given that the STE/CD entries are bigger
than PTEs and the SMMU permits a lot more relaxations about how they are
accessed and cached compared to the PTW.
Having said that, the page-table code looks broken to me even in the
coherent case:
ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
as the compiler can theoretically make a right mess of that.
The non-coherent case looks more fragile, because I don't _think_ the
architecture provides any ordering or atomicity guarantees about cache
cleaning to the PoC. Presumably, the correct sequence would be to write
the PTE with the valid bit clear, do the CMO (with completion barrier),
*then* write the bottom byte with the valid bit set and do another CMO.
Sounds great!
> STE/CD is pretty simple now, there is only one place to put the CMO
> and the ordering is all handled with that shared code. We no longer
> care about ordering beyond all the writes must be visible to HW before
> issuing the CMDQ invalidation command - which is the same environment
> as the pagetable.
You presumably rely on 64-bit single-copy atomicity for hitless updates,
no?
> > 3. It seems like a pretty invasive driver change to support live update,
> > which isn't relevant for a lot of systems.
>
> That's sort of the whole story of live update.. Trying to keep it
> small means using the abstractions that support it like iommu-pages.
>
> IMHO live update is OK to require coherent only, so at worst it could
> use iommu-pages on coherent systems and keep using the
> dma_alloc_coherent() for others.
That would be unfortunate, but if we can wrap the two allocators in
some common helpers then it's probably fine.
> I also don't like this "lot of systems thing". I don't want these
> powerful capabilities locked up in some giant CSP's proprietary
> kernel. I want all the companies in the cloud market to have access
> to the same feature set. That's what open source is supposed to be
> driving toward. I have several interesting use cases for this
> functionality already.
Sorry, the point here was definitely _not_ about keeping this out of
tree, nor was I trying to say that this stuff isn't important. But the
mobile world doesn't give a hoot about KHO and _does_ tend to care about
the impact of CMO, so we have to find a way to balance the two worlds.
> It will run probably $50-100B of AI cloud servers at least, I think
> that is enough justification.
I wasn't asking for justification but I honestly don't care about the
money involved :) People need this, so we should find a way to support
it -- it just needs to fit in with everything else.
Will
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-23 17:07 ` Will Deacon
@ 2026-04-23 18:43 ` Samiullah Khawaja
2026-04-23 22:37 ` Jason Gunthorpe
1 sibling, 0 replies; 12+ messages in thread
From: Samiullah Khawaja @ 2026-04-23 18:43 UTC (permalink / raw)
To: Will Deacon
Cc: Jason Gunthorpe, Evangelos Petrongonas, Robin Murphy,
Joerg Roedel, Nicolin Chen, Pranjal Shrivastava, Lu Baolu,
linux-arm-kernel, iommu, linux-kernel, nh-open-source,
Zeev Zilberman, dmatlack, pasha.tatashin
On Thu, Apr 23, 2026 at 06:07:23PM +0100, Will Deacon wrote:
>On Thu, Apr 23, 2026 at 11:23:26AM -0300, Jason Gunthorpe wrote:
>> On Thu, Apr 23, 2026 at 10:47:49AM +0100, Will Deacon wrote:
>> > > Does iommu-pages provide a mechanism to map the memory as non-cacheable
>> > > if the SMMU isn't coherent?
>>
>> No, it has to use CMOs today.
>>
>> It looks like all the stuff dma_alloc_coherent does to make a
>> non-cached mapping are pretty arch specific. I don't know if there is
>> a way we could make more general code get a struct page into an
>> uncached KVA and meet all the arch rules?
>>
>> I also think dma_alloc_coherent is far to complex, with pools and
>> more, to support KHO.
Agreed. dma_alloc_* is too complex with pools, CMAs and what not to
support fully in KHO.
>
>I wonder if there's scope for supporting just some subset of it?
We have been experimenting with something like this. We have a usecase
where memory needs to be preserved but we want to avoid invasive changes
in the driver.
If it's not a crazy idea, maybe we can start with a very limited scope
of providing preservation for a subset of allocations done through the
DMA API? I can send out my proof of concept as an RFC after I'm done
with the next revision of my IOMMU persistence series. WDYT?
Sami
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation
2026-04-23 17:07 ` Will Deacon
2026-04-23 18:43 ` Samiullah Khawaja
@ 2026-04-23 22:37 ` Jason Gunthorpe
1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-04-23 22:37 UTC (permalink / raw)
To: Will Deacon
Cc: Evangelos Petrongonas, Robin Murphy, Joerg Roedel, Nicolin Chen,
Pranjal Shrivastava, Lu Baolu, linux-arm-kernel, iommu,
linux-kernel, nh-open-source, Zeev Zilberman
On Thu, Apr 23, 2026 at 06:07:23PM +0100, Will Deacon wrote:
> I don't think it's that odd given that the STE/CD entries are bigger
> than PTEs and the SMMU permits a lot more relaxations about how they are
> accessed and cached compared to the PTW.
Well I'm not sure bigger really matters, but I wasn't aware there was
a spec relaxation here that would make the cachable path not viable
for STE but not PTW...
> Having said that, the page-table code looks broken to me even in the
> coherent case:
>
> ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
>
> as the compiler can theoretically make a right mess of that.
Heh, great. The iommupt stuff does better.. It does a 64 bit cmpxchg
to store a table pointer and a 64 bit WRITE_ONCE to store the pte,
then a CMO through the DMA API.
DMA API has to guarentee the right ordering, so we only have the
question below:
> > STE/CD is pretty simple now, there is only one place to put the CMO
> > and the ordering is all handled with that shared code. We no longer
> > care about ordering beyond all the writes must be visible to HW before
> > issuing the CMDQ invalidation command - which is the same environment
> > as the pagetable.
>
> You presumably rely on 64-bit single-copy atomicity for hitless updates,
> no?
Yes, just like the page table does..
I hope that's not a problem or we have a issue with the PTW :)
> > I also don't like this "lot of systems thing". I don't want these
> > powerful capabilities locked up in some giant CSP's proprietary
> > kernel. I want all the companies in the cloud market to have access
> > to the same feature set. That's what open source is supposed to be
> > driving toward. I have several interesting use cases for this
> > functionality already.
>
> Sorry, the point here was definitely _not_ about keeping this out of
> tree, nor was I trying to say that this stuff isn't important. But the
> mobile world doesn't give a hoot about KHO and _does_ tend to care about
> the impact of CMO, so we have to find a way to balance the two worlds.
Yes, that make sense.
My argument is that the CMO on STE/CD shouldn't bother mobile, you
could even view it as an micro-optimization because we do occasionally
read-back the STE/CD fields.
But if you say the SMM STE/CD fetch doesn't have to follow the single
copy rules and PTW does, then ok..
And if Samiullah can tackle dma_alloc_coherent then maybe the whole
question is moot.
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-04-23 22:37 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 12:32 [PATCH] iommu/arm-smmu-v3: Allow disabling Stage 1 translation Evangelos Petrongonas
2026-04-20 12:40 ` Jason Gunthorpe
2026-04-22 6:44 ` Evangelos Petrongonas
2026-04-22 15:44 ` Pranjal Shrivastava
2026-04-22 16:23 ` Jason Gunthorpe
2026-04-22 16:36 ` Robin Murphy
2026-04-23 9:44 ` Will Deacon
2026-04-23 9:47 ` Will Deacon
2026-04-23 14:23 ` Jason Gunthorpe
2026-04-23 17:07 ` Will Deacon
2026-04-23 18:43 ` Samiullah Khawaja
2026-04-23 22:37 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox