[PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
@ 2025-08-07 16:55 Milan Djokic
  2025-08-07 16:57 ` [PATCH 01/20] xen/arm: smmuv3: Maintain a SID->device structure Milan Djokic
                   ` (20 more replies)
  0 siblings, 21 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:55 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Milan Djokic, Bertrand Marquis, Rahul Singh, Stefano Stabellini,
	Julien Grall, Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

This patch series represents a rebase of an older patch series implemented and
sumbitted by Rahul Singh as an RFC: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/.
Original patch series content is aligned with the latest xen structure in terms of common/arch-specific code structuring. 
Some minor bugfixes are also applied:
- Sanity checks / error handling
- Non-pci devices support for emulated iommu

Overall description of stage-1 support is available in the original
patch series cover letter. Original commits structure with detailed explanation for each commit
functionality is maintained.

Patch series testing is performed in qemu arm environment. Additionally,
stage-1 translation for non-pci devices is verified on a Renesas platform.

Jean-Philippe Brucker (1):
  xen/arm: smmuv3: Maintain a SID->device structure

Rahul Singh (19):
  xen/arm: smmuv3: Add support for stage-1 and nested stage translation
  xen/arm: smmuv3: Alloc io_domain for each device
  xen/arm: vIOMMU: add generic vIOMMU framework
  xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests
  xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param
  xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
  xen/arm: vsmmuv3: Add support for registers emulation
  xen/arm: vsmmuv3: Add support for cmdqueue handling
  xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE
  xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware
  xen/arm: vsmmuv3: Add support for event queue and global error
  xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices
  xen/arm: vIOMMU: IOMMU device tree node for dom0
  xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less
  arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl
  xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3
  xen/arm: vsmmuv3: Add support to send stage-1 event to guest
  libxl/arm: vIOMMU: Modify the partial device tree for iommus
  xen/arm: vIOMMU: Modify the partial device tree for dom0less

 docs/man/xl.cfg.5.pod.in                |  13 +
 docs/misc/xen-command-line.pandoc       |   7 +
 tools/golang/xenlight/helpers.gen.go    |   2 +
 tools/golang/xenlight/types.gen.go      |   1 +
 tools/include/libxl.h                   |   5 +
 tools/libs/light/libxl_arm.c            | 123 +++-
 tools/libs/light/libxl_types.idl        |   6 +
 tools/xl/xl_parse.c                     |  10 +
 xen/arch/arm/dom0less-build.c           |  72 ++
 xen/arch/arm/domain.c                   |  26 +
 xen/arch/arm/domain_build.c             | 103 ++-
 xen/arch/arm/include/asm/domain.h       |   4 +
 xen/arch/arm/include/asm/viommu.h       | 102 +++
 xen/common/device-tree/dom0less-build.c |  31 +-
 xen/drivers/passthrough/Kconfig         |  14 +
 xen/drivers/passthrough/arm/Makefile    |   2 +
 xen/drivers/passthrough/arm/smmu-v3.c   | 369 +++++++++-
 xen/drivers/passthrough/arm/smmu-v3.h   |  49 +-
 xen/drivers/passthrough/arm/viommu.c    |  87 +++
 xen/drivers/passthrough/arm/vsmmu-v3.c  | 895 ++++++++++++++++++++++++
 xen/drivers/passthrough/arm/vsmmu-v3.h  |  32 +
 xen/include/public/arch-arm.h           |  14 +-
 xen/include/public/device_tree_defs.h   |   1 +
 xen/include/xen/iommu.h                 |  14 +
 24 files changed, 1935 insertions(+), 47 deletions(-)
 create mode 100644 xen/arch/arm/include/asm/viommu.h
 create mode 100644 xen/drivers/passthrough/arm/viommu.c
 create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c
 create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h

-- 
2.43.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/20] xen/arm: smmuv3: Maintain a SID->device structure
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
@ 2025-08-07 16:57 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 02/20] xen/arm: smmuv3: Add support for stage-1 and nested stage translation Milan Djokic
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:57 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Jean-Philippe Brucker, Bertrand Marquis, Rahul Singh,
	Stefano Stabellini, Julien Grall, Michal Orzel, Volodymyr Babchuk,
	Jonathan Cameron, Eric Auger, Keqian Zhu, Will Deacon,
	Joerg Roedel

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Backport Linux commit cdf315f907d4. This is the clean backport without
any changes.

When handling faults from the event or PRI queue, we need to find the
struct device associated with a SID. Add a rb_tree to keep track of
SIDs.

Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210401154718.307519-8-jean-philippe@linaro.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cdf315f907d4
Signed-off-by: Rahul Singh <rahul.singh@arm.com>
---
 xen/drivers/passthrough/arm/smmu-v3.c | 131 +++++++++++++++++++++-----
 xen/drivers/passthrough/arm/smmu-v3.h |  13 ++-
 2 files changed, 118 insertions(+), 26 deletions(-)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c
index 5e9e3e048e..307057ad8a 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -793,6 +793,27 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
+__maybe_unused
+static struct arm_smmu_master *
+arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
+{
+	struct rb_node *node;
+	struct arm_smmu_stream *stream;
+
+	node = smmu->streams.rb_node;
+	while (node) {
+		stream = rb_entry(node, struct arm_smmu_stream, node);
+		if (stream->id < sid)
+			node = node->rb_right;
+		else if (stream->id > sid)
+			node = node->rb_left;
+		else
+			return stream->master;
+	}
+
+	return NULL;
+}
+
 /* IRQ and event handlers */
 static void arm_smmu_evtq_tasklet(void *dev)
 {
@@ -1026,8 +1047,8 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 	if (!master->ats_enabled)
 		return 0;
 
-	for (i = 0; i < master->num_sids; i++) {
-		cmd->atc.sid = master->sids[i];
+	for (i = 0; i < master->num_streams; i++) {
+		cmd->atc.sid = master->streams[i].id;
 		arm_smmu_cmdq_issue_cmd(master->smmu, cmd);
 	}
 
@@ -1255,13 +1276,13 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 	int i, j;
 	struct arm_smmu_device *smmu = master->smmu;
 
-	for (i = 0; i < master->num_sids; ++i) {
-		u32 sid = master->sids[i];
+    for (i = 0; i < master->num_streams; ++i) {
+		u32 sid = master->streams[i].id;
 		__le64 *step = arm_smmu_get_step_for_sid(smmu, sid);
 
 		/* Bridged PCI devices may end up with duplicated IDs */
 		for (j = 0; j < i; j++)
-			if (master->sids[j] == sid)
+			if (master->streams[j].id == sid)
 				break;
 		if (j < i)
 			continue;
@@ -1470,6 +1491,80 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device *smmu, u32 sid)
 
 	return sid < limit;
 }
+
+static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
+				  struct arm_smmu_master *master)
+{
+	int i;
+	int ret = 0;
+	struct arm_smmu_stream *new_stream, *cur_stream;
+	struct rb_node **new_node, *parent_node = NULL;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
+
+	master->streams = _xzalloc_array(sizeof(*master->streams), sizeof(void *),
+					fwspec->num_ids);
+	if (!master->streams)
+		return -ENOMEM;
+	master->num_streams = fwspec->num_ids;
+
+	mutex_lock(&smmu->streams_mutex);
+	for (i = 0; i < fwspec->num_ids; i++) {
+		u32 sid = fwspec->ids[i];
+
+		new_stream = &master->streams[i];
+		new_stream->id = sid;
+		new_stream->master = master;
+
+		/*
+		 * Check the SIDs are in range of the SMMU and our stream table
+		 */
+		if (!arm_smmu_sid_in_range(smmu, sid)) {
+			ret = -ERANGE;
+			break;
+		}
+
+		/* Ensure l2 strtab is initialised */
+		if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+			ret = arm_smmu_init_l2_strtab(smmu, sid);
+			if (ret)
+				break;
+		}
+
+		/* Insert into SID tree */
+		new_node = &(smmu->streams.rb_node);
+		while (*new_node) {
+			cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
+					      node);
+			parent_node = *new_node;
+			if (cur_stream->id > new_stream->id) {
+				new_node = &((*new_node)->rb_left);
+			} else if (cur_stream->id < new_stream->id) {
+				new_node = &((*new_node)->rb_right);
+			} else {
+				dev_warn(master->dev,
+					 "stream %u already in tree\n",
+					 cur_stream->id);
+				ret = -EINVAL;
+				break;
+			}
+		}
+		if (ret)
+			break;
+
+		rb_link_node(&new_stream->node, parent_node, new_node);
+		rb_insert_color(&new_stream->node, &smmu->streams);
+	}
+
+	if (ret) {
+		for (i--; i >= 0; i--)
+			rb_erase(&master->streams[i].node, &smmu->streams);
+		xfree(master->streams);
+	}
+	mutex_unlock(&smmu->streams_mutex);
+
+	return ret;
+}
+
 /* Forward declaration */
 static struct arm_smmu_device *arm_smmu_get_by_dev(const struct device *dev);
 static int arm_smmu_assign_dev(struct domain *d, u8 devfn, struct device *dev,
@@ -1479,7 +1574,7 @@ static int arm_smmu_deassign_dev(struct domain *d, uint8_t devfn,
 
 static int arm_smmu_add_device(u8 devfn, struct device *dev)
 {
-	int i, ret;
+	int ret;
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_master *master;
 	struct iommu_fwspec *fwspec;
@@ -1516,26 +1611,11 @@ static int arm_smmu_add_device(u8 devfn, struct device *dev)
 
 	master->dev = dev;
 	master->smmu = smmu;
-	master->sids = fwspec->ids;
-	master->num_sids = fwspec->num_ids;
 	dev_iommu_priv_set(dev, master);
 
-	/* Check the SIDs are in range of the SMMU and our stream table */
-	for (i = 0; i < master->num_sids; i++) {
-		u32 sid = master->sids[i];
-
-		if (!arm_smmu_sid_in_range(smmu, sid)) {
-			ret = -ERANGE;
-			goto err_free_master;
-		}
-
-		/* Ensure l2 strtab is initialised */
-		if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-			ret = arm_smmu_init_l2_strtab(smmu, sid);
-			if (ret)
-				goto err_free_master;
-		}
-	}
+	ret = arm_smmu_insert_master(smmu, master);
+	if (ret)
+		goto err_free_master;
 
 	/*
 	 * Note that PASID must be enabled before, and disabled after ATS:
@@ -1777,6 +1857,9 @@ static int __init arm_smmu_init_structures(struct arm_smmu_device *smmu)
 {
 	int ret;
 
+	mutex_init(&smmu->streams_mutex);
+	smmu->streams = RB_ROOT;
+
 	ret = arm_smmu_init_queues(smmu);
 	if (ret)
 		return ret;
diff --git a/xen/drivers/passthrough/arm/smmu-v3.h b/xen/drivers/passthrough/arm/smmu-v3.h
index f09048812c..fe651ca5a7 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.h
+++ b/xen/drivers/passthrough/arm/smmu-v3.h
@@ -636,6 +636,15 @@ struct arm_smmu_device {
 	struct tasklet		evtq_irq_tasklet;
 	struct tasklet		priq_irq_tasklet;
 	struct tasklet		combined_irq_tasklet;
+
+	struct rb_root		streams;
+	struct mutex		streams_mutex;
+};
+
+struct arm_smmu_stream {
+	u32							id;
+	struct arm_smmu_master		*master;
+	struct rb_node				node;
 };
 
 /* SMMU private data for each master */
@@ -644,8 +653,8 @@ struct arm_smmu_master {
 	struct device			*dev;
 	struct arm_smmu_domain		*domain;
 	struct list_head		domain_head;
-	u32				*sids;
-	unsigned int			num_sids;
+	struct arm_smmu_stream		*streams;
+	unsigned int				num_streams;
 	bool				ats_enabled;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/20] xen/arm: smmuv3: Add support for stage-1 and nested stage translation
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
  2025-08-07 16:57 ` [PATCH 01/20] xen/arm: smmuv3: Maintain a SID->device structure Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 03/20] xen/arm: smmuv3: Alloc io_domain for each device Milan Djokic
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Bertrand Marquis, Stefano Stabellini, Julien Grall,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Xen SMMUv3 driver only supports stage-2 translation. Add support for
Stage-1 translation that is required to support nested stage
translation.

In true nested mode, both s1_cfg and s2_cfg will coexist.
Let's remove the union. When nested stage translation is setup, both
s1_cfg and s2_cfg are valid.

We introduce a new smmu_domain abort field that will be set
upon guest stage-1 configuration passing. If no guest stage-1
config has been attached, it is ignored when writing the STE.

arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE and deal with the abort field.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/smmu-v3.c | 93 +++++++++++++++++++++++----
 xen/drivers/passthrough/arm/smmu-v3.h |  9 +++
 2 files changed, 91 insertions(+), 11 deletions(-)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c
index 307057ad8a..5c96e8ec7c 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -667,8 +667,10 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	 * 3. Update Config, sync
 	 */
 	u64 val = le64_to_cpu(dst[0]);
-	bool ste_live = false;
+	bool s1_live = false, s2_live = false, ste_live = false;
+	bool abort, translate = false;
 	struct arm_smmu_device *smmu = NULL;
+	struct arm_smmu_s1_cfg *s1_cfg = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = NULL;
 	struct arm_smmu_cmdq_ent prefetch_cmd = {
@@ -683,30 +685,54 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		smmu = master->smmu;
 	}
 
-	if (smmu_domain)
-		s2_cfg = &smmu_domain->s2_cfg;
+	if (smmu_domain) {
+		switch (smmu_domain->stage) {
+		case ARM_SMMU_DOMAIN_NESTED:
+			s1_cfg = &smmu_domain->s1_cfg;
+			fallthrough;
+		case ARM_SMMU_DOMAIN_S2:
+			s2_cfg = &smmu_domain->s2_cfg;
+			break;
+		default:
+			break;
+		}
+		translate = !!s1_cfg || !!s2_cfg;
+	}
 
 	if (val & STRTAB_STE_0_V) {
 		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
 		case STRTAB_STE_0_CFG_BYPASS:
 			break;
+		case STRTAB_STE_0_CFG_S1_TRANS:
+			s1_live = true;
+			break;
 		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
+			s2_live = true;
+			break;
+		case STRTAB_STE_0_CFG_NESTED:
+			s1_live = true;
+			s2_live = true;
 			break;
 		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
 			break;
 		default:
 			BUG(); /* STE corruption */
 		}
 	}
 
+	ste_live = s1_live || s2_live;
+
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
 	/* Bypass/fault */
-	if (!smmu_domain || !(s2_cfg)) {
-		if (!smmu_domain && disable_bypass)
+	if (!smmu_domain)
+		abort = disable_bypass;
+	else
+		abort = smmu_domain->abort;
+
+	if (abort || !translate) {
+		if (abort)
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
@@ -724,7 +750,33 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		return;
 	}
 
+	if (ste_live) {
+		/* First invalidate the live STE */
+		dst[0] = cpu_to_le64(STRTAB_STE_0_CFG_ABORT);
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+	}
+
+	if (s1_cfg) {
+		BUG_ON(s1_live);
+		dst[1] = cpu_to_le64(
+			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
+			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
+			 FIELD_PREP(STRTAB_STE_1_STRW, STRTAB_STE_1_STRW_NSEL1));
+
+		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
+		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
+			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+
+		val |= (s1_cfg->s1ctxptr & STRTAB_STE_0_S1CTXPTR_MASK) |
+			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
+			FIELD_PREP(STRTAB_STE_0_S1CDMAX, s1_cfg->s1cdmax) |
+			FIELD_PREP(STRTAB_STE_0_S1FMT, s1_cfg->s1fmt);
+	}
+
 	if (s2_cfg) {
+		u64 vttbr = s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK;
 		u64 strtab =
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
@@ -734,12 +786,19 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R;
 
-		BUG_ON(ste_live);
+		if (s2_live) {
+			u64 s2ttb = le64_to_cpu(dst[3]) & STRTAB_STE_3_S2TTB_MASK;
+			BUG_ON(s2ttb != vttbr);
+		}
+
 		dst[2] = cpu_to_le64(strtab);
 
-		dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		dst[3] = cpu_to_le64(vttbr);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
+	} else {
+		dst[2] = 0;
+		dst[3] = 0;
 	}
 
 	if (master->ats_enabled)
@@ -1238,6 +1297,15 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 {
 	int ret;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED &&
+		(!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
+		 !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))) {
+			dev_info(smmu_domain->smmu->dev,
+					"does not implement two stages\n");
+			return -EINVAL;
+	}
 
 	/* Restrict the stage to what we can actually support */
 	smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
@@ -2334,11 +2402,14 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 		break;
 	}
 
+	if (reg & IDR0_S1P)
+		smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
+
 	if (reg & IDR0_S2P)
 		smmu->features |= ARM_SMMU_FEAT_TRANS_S2;
 
-	if (!(reg & IDR0_S2P)) {
-		dev_err(smmu->dev, "no stage-2 translation support!\n");
+	if (!(reg & (IDR0_S1P | IDR0_S2P))) {
+		dev_err(smmu->dev, "no translation support!\n");
 		return -ENXIO;
 	}
 
diff --git a/xen/drivers/passthrough/arm/smmu-v3.h b/xen/drivers/passthrough/arm/smmu-v3.h
index fe651ca5a7..3411edc47f 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.h
+++ b/xen/drivers/passthrough/arm/smmu-v3.h
@@ -197,6 +197,7 @@
 #define STRTAB_STE_0_CFG_BYPASS		4
 #define STRTAB_STE_0_CFG_S1_TRANS	5
 #define STRTAB_STE_0_CFG_S2_TRANS	6
+#define STRTAB_STE_0_CFG_NESTED		7
 
 #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
 #define STRTAB_STE_0_S1FMT_LINEAR	0
@@ -546,6 +547,12 @@ struct arm_smmu_strtab_l1_desc {
 	dma_addr_t			l2ptr_dma;
 };
 
+struct arm_smmu_s1_cfg {
+	u64				s1ctxptr;
+	u8				s1fmt;
+	u8				s1cdmax;
+};
+
 struct arm_smmu_s2_cfg {
 	u16				vmid;
 	u64				vttbr;
@@ -666,7 +673,9 @@ struct arm_smmu_domain {
 	atomic_t			nr_ats_masters;
 
 	enum arm_smmu_domain_stage	stage;
+	struct arm_smmu_s1_cfg	s1_cfg;
 	struct arm_smmu_s2_cfg	s2_cfg;
+	bool			abort;
 
 	/* Xen domain associated with this SMMU domain */
 	struct domain		*d;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/20] xen/arm: smmuv3: Alloc io_domain for each device
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
  2025-08-07 16:57 ` [PATCH 01/20] xen/arm: smmuv3: Maintain a SID->device structure Milan Djokic
  2025-08-07 16:59 ` [PATCH 02/20] xen/arm: smmuv3: Add support for stage-1 and nested stage translation Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 04/20] xen/arm: vIOMMU: add generic vIOMMU framework Milan Djokic
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Bertrand Marquis, Stefano Stabellini, Julien Grall,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

In current implementation io_domain is allocated once for each xen
domain as Stage2 translation is common for all devices in same xen
domain.

Nested stage supports S1 and S2 configuration at the same time. Stage1
translation will be different for each device as linux kernel will
allocate page-table for each device.

Alloc io_domain for each device so that each device can have different
Stage-1 and Stage-2 configuration structure.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/smmu-v3.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c
index 5c96e8ec7c..15ff703458 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -2790,11 +2790,13 @@ static struct arm_smmu_device *arm_smmu_get_by_dev(const struct device *dev)
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
 				struct device *dev)
 {
+	unsigned long flags;
 	struct iommu_domain *io_domain;
 	struct arm_smmu_domain *smmu_domain;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct arm_smmu_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
 	struct arm_smmu_device *smmu = arm_smmu_get_by_dev(fwspec->iommu_dev);
+	struct arm_smmu_master *master;
 
 	if (!smmu)
 		return NULL;
@@ -2805,8 +2807,15 @@ static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
 	 */
 	list_for_each_entry(io_domain, &xen_domain->contexts, list) {
 		smmu_domain = to_smmu_domain(io_domain);
-		if (smmu_domain->smmu == smmu)
-			return io_domain;
+
+		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+		list_for_each_entry(master, &smmu_domain->devices, domain_head) {
+			if (master->dev == dev) {
+				spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+				return io_domain;
+			}
+		}
+		spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 	}
 	return NULL;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/20] xen/arm: vIOMMU: add generic vIOMMU framework
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (2 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 03/20] xen/arm: smmuv3: Alloc io_domain for each device Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-08  6:16   ` Jan Beulich
  2025-08-07 16:59 ` [PATCH 05/20] xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests Milan Djokic
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

This patch adds basic framework for vIOMMU.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/arch/arm/dom0less-build.c        |  2 +
 xen/arch/arm/domain.c                | 25 ++++++++++
 xen/arch/arm/domain_build.c          |  2 +
 xen/arch/arm/include/asm/viommu.h    | 70 ++++++++++++++++++++++++++++
 xen/drivers/passthrough/Kconfig      |  6 +++
 xen/drivers/passthrough/arm/Makefile |  1 +
 xen/drivers/passthrough/arm/viommu.c | 48 +++++++++++++++++++
 xen/include/public/arch-arm.h        |  4 ++
 8 files changed, 158 insertions(+)
 create mode 100644 xen/arch/arm/include/asm/viommu.h
 create mode 100644 xen/drivers/passthrough/arm/viommu.c

diff --git a/xen/arch/arm/dom0less-build.c b/xen/arch/arm/dom0less-build.c
index c8d07213e2..1fa6df214b 100644
--- a/xen/arch/arm/dom0less-build.c
+++ b/xen/arch/arm/dom0less-build.c
@@ -22,6 +22,7 @@
 
 #include <asm/arm64/sve.h>
 #include <asm/domain_build.h>
+#include <asm/viommu.h>
 #include <asm/grant_table.h>
 #include <asm/setup.h>
 
@@ -280,6 +281,7 @@ int __init arch_parse_dom0less_node(struct dt_device_node *node,
     uint32_t val;
 
     d_cfg->arch.gic_version = XEN_DOMCTL_CONFIG_GIC_NATIVE;
+    d_cfg->arch.viommu_type = viommu_get_type();
     d_cfg->flags |= XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap;
 
     if ( !dt_property_read_u32(node, "nr_spis", &d_cfg->arch.nr_spis) )
diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 310c578909..a66f71da17 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -27,6 +27,7 @@
 #include <asm/tee/tee.h>
 #include <asm/vfp.h>
 #include <asm/vgic.h>
+#include <asm/viommu.h>
 #include <asm/vtimer.h>
 
 #include "vpci.h"
@@ -699,6 +700,21 @@ int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
         return -EINVAL;
     }
 
+    if ( !(config->flags & XEN_DOMCTL_CDF_iommu) &&
+         config->arch.viommu_type != XEN_DOMCTL_CONFIG_VIOMMU_NONE )
+    {
+        dprintk(XENLOG_INFO,
+                "vIOMMU requested while iommu not enabled for domain\n");
+        return -EINVAL;
+    }
+
+    if ( config->arch.viommu_type != XEN_DOMCTL_CONFIG_VIOMMU_NONE )
+    {
+        dprintk(XENLOG_INFO,
+                "vIOMMU type requested not supported by the platform or Xen\n");
+        return -EINVAL;
+    }
+
     return 0;
 }
 
@@ -791,6 +807,9 @@ int arch_domain_create(struct domain *d,
     d->arch.sve_vl = config->arch.sve_vl;
 #endif
 
+    if ( (rc = domain_viommu_init(d, config->arch.viommu_type)) != 0 )
+        goto fail;
+
     return 0;
 
 fail:
@@ -1038,6 +1057,7 @@ static int relinquish_memory(struct domain *d, struct page_list_head *list)
 enum {
     PROG_pci = 1,
     PROG_tee,
+    PROG_viommu,
     PROG_xen,
     PROG_page,
     PROG_mapping,
@@ -1089,6 +1109,11 @@ int domain_relinquish_resources(struct domain *d)
         if (ret )
             return ret;
 
+    PROGRESS(viommu):
+        ret = viommu_relinquish_resources(d);
+        if (ret )
+            return ret;
+
     PROGRESS(xen):
         ret = relinquish_memory(d, &d->xenpage_list);
         if ( ret )
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 463ae4474d..50e4f5fee4 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -34,6 +34,7 @@
 #include <asm/arm64/sve.h>
 #include <asm/cpufeature.h>
 #include <asm/domain_build.h>
+#include <asm/viommu.h>
 #include <xen/event.h>
 
 #include <xen/irq.h>
@@ -2057,6 +2058,7 @@ void __init create_dom0(void)
     dom0_cfg.arch.nr_spis = VGIC_DEF_NR_SPIS;
     dom0_cfg.arch.tee_type = tee_get_type();
     dom0_cfg.max_vcpus = dom0_max_vcpus();
+    dom0_cfg.arch.viommu_type = viommu_get_type();
 
     if ( iommu_enabled )
         dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu;
diff --git a/xen/arch/arm/include/asm/viommu.h b/xen/arch/arm/include/asm/viommu.h
new file mode 100644
index 0000000000..7cd3818a12
--- /dev/null
+++ b/xen/arch/arm/include/asm/viommu.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-Clause) */
+#ifndef __ARCH_ARM_VIOMMU_H__
+#define __ARCH_ARM_VIOMMU_H__
+
+#ifdef CONFIG_VIRTUAL_IOMMU
+
+#include <xen/lib.h>
+#include <xen/types.h>
+#include <public/xen.h>
+
+struct viommu_ops {
+    /*
+     * Called during domain construction if toolstack requests to enable
+     * vIOMMU support.
+     */
+    int (*domain_init)(struct domain *d);
+
+    /*
+     * Called during domain destruction to free resources used by vIOMMU.
+     */
+    int (*relinquish_resources)(struct domain *d);
+};
+
+struct viommu_desc {
+    /* vIOMMU domains init/free operations described above. */
+    const struct viommu_ops *ops;
+
+    /*
+     * ID of vIOMMU. Corresponds to xen_arch_domainconfig.viommu_type.
+     * Should be one of XEN_DOMCTL_CONFIG_VIOMMU_xxx
+     */
+    uint16_t viommu_type;
+};
+
+int domain_viommu_init(struct domain *d, uint16_t viommu_type);
+int viommu_relinquish_resources(struct domain *d);
+uint16_t viommu_get_type(void);
+
+#else
+
+static inline uint8_t viommu_get_type(void)
+{
+    return XEN_DOMCTL_CONFIG_VIOMMU_NONE;
+}
+
+static inline int domain_viommu_init(struct domain *d, uint16_t viommu_type)
+{
+    if ( likely(viommu_type == XEN_DOMCTL_CONFIG_VIOMMU_NONE) )
+        return 0;
+
+    return -ENODEV;
+}
+
+static inline int viommu_relinquish_resources(struct domain *d)
+{
+    return 0;
+}
+
+#endif /* CONFIG_VIRTUAL_IOMMU */
+
+#endif /* __ARCH_ARM_VIOMMU_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/drivers/passthrough/Kconfig b/xen/drivers/passthrough/Kconfig
index b413c33a4c..fbd56eb990 100644
--- a/xen/drivers/passthrough/Kconfig
+++ b/xen/drivers/passthrough/Kconfig
@@ -35,6 +35,12 @@ config IPMMU_VMSA
 	  (H3 ES3.0, M3-W+, etc) or Gen4 SoCs which IPMMU hardware supports stage 2
 	  translation table format and is able to use CPU's P2M table as is.
 
+config VIRTUAL_IOMMU
+	bool "Virtual IOMMU Support (UNSUPPORTED)" if UNSUPPORTED
+	default n
+	help
+	 Support virtual IOMMU infrastructure to implement vIOMMU.
+
 endif
 
 config AMD_IOMMU
diff --git a/xen/drivers/passthrough/arm/Makefile b/xen/drivers/passthrough/arm/Makefile
index c5fb3b58a5..4cc54f3f4d 100644
--- a/xen/drivers/passthrough/arm/Makefile
+++ b/xen/drivers/passthrough/arm/Makefile
@@ -2,3 +2,4 @@ obj-y += iommu.o iommu_helpers.o iommu_fwspec.o
 obj-$(CONFIG_ARM_SMMU) += smmu.o
 obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
 obj-$(CONFIG_ARM_SMMU_V3) += smmu-v3.o
+obj-$(CONFIG_VIRTUAL_IOMMU) += viommu.o
diff --git a/xen/drivers/passthrough/arm/viommu.c b/xen/drivers/passthrough/arm/viommu.c
new file mode 100644
index 0000000000..7ab6061e34
--- /dev/null
+++ b/xen/drivers/passthrough/arm/viommu.c
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-Clause) */
+
+#include <xen/errno.h>
+#include <xen/init.h>
+#include <xen/types.h>
+
+#include <asm/viommu.h>
+
+const struct viommu_desc __read_mostly *cur_viommu;
+
+int domain_viommu_init(struct domain *d, uint16_t viommu_type)
+{
+    if ( viommu_type == XEN_DOMCTL_CONFIG_VIOMMU_NONE )
+        return 0;
+
+    if ( !cur_viommu )
+        return -ENODEV;
+
+    if ( cur_viommu->viommu_type != viommu_type )
+        return -EINVAL;
+
+    return cur_viommu->ops->domain_init(d);
+}
+
+int viommu_relinquish_resources(struct domain *d)
+{
+    if ( !cur_viommu )
+        return 0;
+
+    return cur_viommu->ops->relinquish_resources(d);
+}
+
+uint16_t viommu_get_type(void)
+{
+    if ( !cur_viommu )
+        return XEN_DOMCTL_CONFIG_VIOMMU_NONE;
+
+    return cur_viommu->viommu_type;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h
index e2412a1747..0a155fe615 100644
--- a/xen/include/public/arch-arm.h
+++ b/xen/include/public/arch-arm.h
@@ -327,12 +327,16 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_guest_context_t);
 #define XEN_DOMCTL_CONFIG_TEE_OPTEE     1
 #define XEN_DOMCTL_CONFIG_TEE_FFA       2
 
+#define XEN_DOMCTL_CONFIG_VIOMMU_NONE   0
+
 struct xen_arch_domainconfig {
     /* IN/OUT */
     uint8_t gic_version;
     /* IN - Contains SVE vector length divided by 128 */
     uint8_t sve_vl;
     /* IN */
+    uint8_t viommu_type;
+    /* IN */
     uint16_t tee_type;
     /* IN */
     uint32_t nr_spis;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/20] xen/arm: vIOMMU: add generic vIOMMU framework
  2025-08-07 16:59 ` [PATCH 04/20] xen/arm: vIOMMU: add generic vIOMMU framework Milan Djokic
@ 2025-08-08  6:16   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-08-08  6:16 UTC (permalink / raw)
  To: Milan Djokic
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 07.08.2025 18:59, Milan Djokic wrote:
> --- a/xen/drivers/passthrough/Kconfig
> +++ b/xen/drivers/passthrough/Kconfig
> @@ -35,6 +35,12 @@ config IPMMU_VMSA
>  	  (H3 ES3.0, M3-W+, etc) or Gen4 SoCs which IPMMU hardware supports stage 2
>  	  translation table format and is able to use CPU's P2M table as is.
>  
> +config VIRTUAL_IOMMU
> +	bool "Virtual IOMMU Support (UNSUPPORTED)" if UNSUPPORTED
> +	default n

Nit: Please omit this redundant line.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 05/20] xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (3 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 04/20] xen/arm: vIOMMU: add generic vIOMMU framework Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 06/20] xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param Milan Djokic
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

domain_viommu_init() will be called during domain creation and will add
the dummy trap handler for virtual IOMMUs for guests.

A host IOMMU list will be created when host IOMMU devices are probed
and this list will be used to create the IOMMU device tree node for
dom0. For dom0, 1-1 mapping will be established between vIOMMU in dom0
and physical IOMMU.

For domUs, the 1-N mapping will be established between domU and physical
IOMMUs. A new area has been reserved in the arm guest physical map at
which the emulated vIOMMU node is created in the device tree.

Also set the vIOMMU type to vSMMUv3 to enable vIOMMU framework to call
vSMMUv3 domain creation/destroy functions.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/arch/arm/domain.c                  |   3 +-
 xen/arch/arm/include/asm/domain.h      |   4 +
 xen/arch/arm/include/asm/viommu.h      |  20 ++++
 xen/drivers/passthrough/Kconfig        |   8 ++
 xen/drivers/passthrough/arm/Makefile   |   1 +
 xen/drivers/passthrough/arm/smmu-v3.c  |   7 ++
 xen/drivers/passthrough/arm/viommu.c   |  30 ++++++
 xen/drivers/passthrough/arm/vsmmu-v3.c | 124 +++++++++++++++++++++++++
 xen/drivers/passthrough/arm/vsmmu-v3.h |  20 ++++
 xen/include/public/arch-arm.h          |   7 +-
 10 files changed, 222 insertions(+), 2 deletions(-)
 create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c
 create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index a66f71da17..f0971a76cb 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -708,7 +708,8 @@ int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
         return -EINVAL;
     }
 
-    if ( config->arch.viommu_type != XEN_DOMCTL_CONFIG_VIOMMU_NONE )
+    if ( config->arch.viommu_type != XEN_DOMCTL_CONFIG_VIOMMU_NONE &&
+         config->arch.viommu_type != viommu_get_type() )
     {
         dprintk(XENLOG_INFO,
                 "vIOMMU type requested not supported by the platform or Xen\n");
diff --git a/xen/arch/arm/include/asm/domain.h b/xen/arch/arm/include/asm/domain.h
index a3487ca713..25ee7c4290 100644
--- a/xen/arch/arm/include/asm/domain.h
+++ b/xen/arch/arm/include/asm/domain.h
@@ -121,6 +121,10 @@ struct arch_domain
     void *tee;
 #endif
 
+#ifdef CONFIG_VIRTUAL_IOMMU
+    struct list_head viommu_list;     /* List of virtual IOMMUs */
+#endif
+
 }  __cacheline_aligned;
 
 struct arch_vcpu
diff --git a/xen/arch/arm/include/asm/viommu.h b/xen/arch/arm/include/asm/viommu.h
index 7cd3818a12..4785877e2a 100644
--- a/xen/arch/arm/include/asm/viommu.h
+++ b/xen/arch/arm/include/asm/viommu.h
@@ -5,9 +5,21 @@
 #ifdef CONFIG_VIRTUAL_IOMMU
 
 #include <xen/lib.h>
+#include <xen/list.h>
 #include <xen/types.h>
 #include <public/xen.h>
 
+extern struct list_head host_iommu_list;
+
+/* data structure for each hardware IOMMU */
+struct host_iommu {
+    struct list_head entry;
+    const struct dt_device_node *dt_node;
+    paddr_t addr;
+    paddr_t size;
+    uint32_t irq;
+};
+
 struct viommu_ops {
     /*
      * Called during domain construction if toolstack requests to enable
@@ -35,6 +47,8 @@ struct viommu_desc {
 int domain_viommu_init(struct domain *d, uint16_t viommu_type);
 int viommu_relinquish_resources(struct domain *d);
 uint16_t viommu_get_type(void);
+void add_to_host_iommu_list(paddr_t addr, paddr_t size,
+                            const struct dt_device_node *node);
 
 #else
 
@@ -56,6 +70,12 @@ static inline int viommu_relinquish_resources(struct domain *d)
     return 0;
 }
 
+static inline void add_to_host_iommu_list(paddr_t addr, paddr_t size,
+                                          const struct dt_device_node *node)
+{
+    return;
+}
+
 #endif /* CONFIG_VIRTUAL_IOMMU */
 
 #endif /* __ARCH_ARM_VIOMMU_H__ */
diff --git a/xen/drivers/passthrough/Kconfig b/xen/drivers/passthrough/Kconfig
index fbd56eb990..d711123cac 100644
--- a/xen/drivers/passthrough/Kconfig
+++ b/xen/drivers/passthrough/Kconfig
@@ -41,6 +41,14 @@ config VIRTUAL_IOMMU
 	help
 	 Support virtual IOMMU infrastructure to implement vIOMMU.
 
+config VIRTUAL_ARM_SMMU_V3
+	bool "ARM Ltd. Virtual SMMUv3 Support (UNSUPPORTED)" if UNSUPPORTED
+	depends on ARM_SMMU_V3 && VIRTUAL_IOMMU
+	help
+	 Support for implementations of the virtual ARM System MMU architecture
+	 version 3. Virtual SMMUv3 is unsupported feature and should not be used
+	 in production.
+
 endif
 
 config AMD_IOMMU
diff --git a/xen/drivers/passthrough/arm/Makefile b/xen/drivers/passthrough/arm/Makefile
index 4cc54f3f4d..e758a9d6aa 100644
--- a/xen/drivers/passthrough/arm/Makefile
+++ b/xen/drivers/passthrough/arm/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_ARM_SMMU) += smmu.o
 obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
 obj-$(CONFIG_ARM_SMMU_V3) += smmu-v3.o
 obj-$(CONFIG_VIRTUAL_IOMMU) += viommu.o
+obj-$(CONFIG_VIRTUAL_ARM_SMMU_V3) += vsmmu-v3.o
diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c
index 15ff703458..193c892fcd 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -93,6 +93,7 @@
 #include <asm/platform.h>
 
 #include "smmu-v3.h"
+#include "vsmmu-v3.h"
 
 #define ARM_SMMU_VTCR_SH_IS		3
 #define ARM_SMMU_VTCR_RGN_WBWA		1
@@ -2708,6 +2709,9 @@ static int __init arm_smmu_device_probe(struct platform_device *pdev)
 	list_add(&smmu->devices, &arm_smmu_devices);
 	spin_unlock(&arm_smmu_devices_lock);
 
+    /* Add to host IOMMU list to initialize vIOMMU for dom0 */
+	add_to_host_iommu_list(ioaddr, iosize, dev_to_dt(pdev));
+
 	return 0;
 
 
@@ -3039,6 +3043,9 @@ static __init int arm_smmu_dt_init(struct dt_device_node *dev,
 
 	platform_features &= smmu->features;
 
+	/* Set vIOMMU type to SMMUv3 */
+	vsmmuv3_set_type();
+
 	return 0;
 }
 
diff --git a/xen/drivers/passthrough/arm/viommu.c b/xen/drivers/passthrough/arm/viommu.c
index 7ab6061e34..53ae46349a 100644
--- a/xen/drivers/passthrough/arm/viommu.c
+++ b/xen/drivers/passthrough/arm/viommu.c
@@ -2,12 +2,42 @@
 
 #include <xen/errno.h>
 #include <xen/init.h>
+#include <xen/irq.h>
 #include <xen/types.h>
 
 #include <asm/viommu.h>
 
+/* List of all host IOMMUs */
+LIST_HEAD(host_iommu_list);
+
 const struct viommu_desc __read_mostly *cur_viommu;
 
+/* Common function for adding to host_iommu_list */
+void add_to_host_iommu_list(paddr_t addr, paddr_t size,
+                            const struct dt_device_node *node)
+{
+    struct host_iommu *iommu_data;
+
+    iommu_data = xzalloc(struct host_iommu);
+    if ( !iommu_data )
+        panic("vIOMMU: Cannot allocate memory for host IOMMU data\n");
+
+    iommu_data->addr = addr;
+    iommu_data->size = size;
+    iommu_data->dt_node = node;
+    iommu_data->irq = platform_get_irq(node, 0);
+    if ( iommu_data->irq < 0 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "vIOMMU: Cannot find a valid IOMMU irq\n");
+        return;
+    }
+
+    printk("vIOMMU: Found IOMMU @0x%"PRIx64"\n", addr);
+
+    list_add_tail(&iommu_data->entry, &host_iommu_list);
+}
+
 int domain_viommu_init(struct domain *d, uint16_t viommu_type)
 {
     if ( viommu_type == XEN_DOMCTL_CONFIG_VIOMMU_NONE )
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
new file mode 100644
index 0000000000..6b4009e5ef
--- /dev/null
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-Clause) */
+
+#include <xen/param.h>
+#include <xen/sched.h>
+#include <asm/mmio.h>
+#include <asm/viommu.h>
+
+/* Struct to hold the vIOMMU ops and vIOMMU type */
+extern const struct viommu_desc __read_mostly *cur_viommu;
+
+struct virt_smmu {
+    struct      domain *d;
+    struct      list_head viommu_list;
+};
+
+static int vsmmuv3_mmio_write(struct vcpu *v, mmio_info_t *info,
+                              register_t r, void *priv)
+{
+    return IO_HANDLED;
+}
+
+static int vsmmuv3_mmio_read(struct vcpu *v, mmio_info_t *info,
+                             register_t *r, void *priv)
+{
+    return IO_HANDLED;
+}
+
+static const struct mmio_handler_ops vsmmuv3_mmio_handler = {
+    .read  = vsmmuv3_mmio_read,
+    .write = vsmmuv3_mmio_write,
+};
+
+static int vsmmuv3_init_single(struct domain *d, paddr_t addr, paddr_t size)
+{
+    struct virt_smmu *smmu;
+
+    smmu = xzalloc(struct virt_smmu);
+    if ( !smmu )
+        return -ENOMEM;
+
+    smmu->d = d;
+
+    register_mmio_handler(d, &vsmmuv3_mmio_handler, addr, size, smmu);
+
+    /* Register the vIOMMU to be able to clean it up later. */
+    list_add_tail(&smmu->viommu_list, &d->arch.viommu_list);
+
+    return 0;
+}
+
+int domain_vsmmuv3_init(struct domain *d)
+{
+    int ret;
+    INIT_LIST_HEAD(&d->arch.viommu_list);
+
+    if ( is_hardware_domain(d) )
+    {
+        struct host_iommu *hw_iommu;
+
+        list_for_each_entry(hw_iommu, &host_iommu_list, entry)
+        {
+            ret = vsmmuv3_init_single(d, hw_iommu->addr, hw_iommu->size);
+            if ( ret )
+                return ret;
+        }
+    }
+    else
+    {
+        ret = vsmmuv3_init_single(d, GUEST_VSMMUV3_BASE, GUEST_VSMMUV3_SIZE);
+        if ( ret )
+            return ret;
+    }
+
+    return 0;
+}
+
+int vsmmuv3_relinquish_resources(struct domain *d)
+{
+    struct virt_smmu *pos, *temp;
+
+    /* Cope with unitialized vIOMMU */
+    if ( list_head_is_null(&d->arch.viommu_list) )
+        return 0;
+
+    list_for_each_entry_safe(pos, temp, &d->arch.viommu_list, viommu_list )
+    {
+        list_del(&pos->viommu_list);
+        xfree(pos);
+    }
+
+    return 0;
+}
+
+static const struct viommu_ops vsmmuv3_ops = {
+    .domain_init = domain_vsmmuv3_init,
+    .relinquish_resources = vsmmuv3_relinquish_resources,
+};
+
+static const struct viommu_desc vsmmuv3_desc = {
+    .ops = &vsmmuv3_ops,
+    .viommu_type = XEN_DOMCTL_CONFIG_VIOMMU_SMMUV3,
+};
+
+void __init vsmmuv3_set_type(void)
+{
+    const struct viommu_desc *desc = &vsmmuv3_desc;
+
+    if ( cur_viommu && (cur_viommu != desc) )
+    {
+        printk("WARNING: Cannot set vIOMMU, already set to a different value\n");
+        return;
+    }
+
+    cur_viommu = desc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.h b/xen/drivers/passthrough/arm/vsmmu-v3.h
new file mode 100644
index 0000000000..e11f85b431
--- /dev/null
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-Clause) */
+#ifndef __ARCH_ARM_VSMMU_V3_H__
+#define __ARCH_ARM_VSMMU_V3_H__
+
+#include <asm/viommu.h>
+
+#ifdef CONFIG_VIRTUAL_ARM_SMMU_V3
+
+void vsmmuv3_set_type(void);
+
+#else
+
+static inline void vsmmuv3_set_type(void)
+{
+    return;
+}
+
+#endif /* CONFIG_VIRTUAL_ARM_SMMU_V3 */
+
+#endif /* __ARCH_ARM_VSMMU_V3_H__ */
diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h
index 0a155fe615..d6201df06a 100644
--- a/xen/include/public/arch-arm.h
+++ b/xen/include/public/arch-arm.h
@@ -327,7 +327,8 @@ DEFINE_XEN_GUEST_HANDLE(vcpu_guest_context_t);
 #define XEN_DOMCTL_CONFIG_TEE_OPTEE     1
 #define XEN_DOMCTL_CONFIG_TEE_FFA       2
 
-#define XEN_DOMCTL_CONFIG_VIOMMU_NONE   0
+#define XEN_DOMCTL_CONFIG_VIOMMU_NONE       0
+#define XEN_DOMCTL_CONFIG_VIOMMU_SMMUV3     1
 
 struct xen_arch_domainconfig {
     /* IN/OUT */
@@ -450,6 +451,10 @@ typedef uint64_t xen_callback_t;
 #define GUEST_GICV3_GICR0_BASE     xen_mk_ullong(0x03020000) /* vCPU0..127 */
 #define GUEST_GICV3_GICR0_SIZE     xen_mk_ullong(0x01000000)
 
+/* vsmmuv3 ITS mappings */
+#define GUEST_VSMMUV3_BASE     xen_mk_ullong(0x04040000)
+#define GUEST_VSMMUV3_SIZE     xen_mk_ullong(0x00040000)
+
 /*
  * 256 MB is reserved for VPCI configuration space based on calculation
  * 256 buses x 32 devices x 8 functions x 4 KB = 256 MB
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/20] xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (4 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 05/20] xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>" Milan Djokic
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Anthony PERARD, Nick Rosbrook, George Dunlap,
	Juergen Gross, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Add new viommu_type field and field values XEN_DOMCTL_CONFIG_VIOMMU_NONE
XEN_DOMCTL_CONFIG_VIOMMU_SMMUV3 in xen_arch_domainconfig to
enable/disable vIOMMU support for domains.

Also add viommu="N" parameter to xl domain configuration to enable the
vIOMMU for the domains. Currently, only the "smmuv3" type is supported
for ARM.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 docs/man/xl.cfg.5.pod.in             | 13 +++++++++++++
 tools/golang/xenlight/helpers.gen.go |  2 ++
 tools/golang/xenlight/types.gen.go   |  1 +
 tools/include/libxl.h                |  5 +++++
 tools/libs/light/libxl_arm.c         | 13 +++++++++++++
 tools/libs/light/libxl_types.idl     |  6 ++++++
 tools/xl/xl_parse.c                  | 10 ++++++++++
 7 files changed, 50 insertions(+)

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index f0c920b39d..25e9be073c 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -3125,6 +3125,19 @@ This option is only implemented for Arm where the default is enabled.
 
 =back
 
+=over 4
+
+=item B<viommu="N">
+
+To enable viommu, user must specify the following option in the VM
+config file:
+
+viommu = "smmuv3"
+
+Currently, only the "smmuv3" type is supported for ARM.
+
+=back
+
 =head3 x86
 
 =over 4
diff --git a/tools/golang/xenlight/helpers.gen.go b/tools/golang/xenlight/helpers.gen.go
index 667030cbd7..23635afefe 100644
--- a/tools/golang/xenlight/helpers.gen.go
+++ b/tools/golang/xenlight/helpers.gen.go
@@ -1163,6 +1163,7 @@ x.ArchArm.GicVersion = GicVersion(xc.arch_arm.gic_version)
 x.ArchArm.Vuart = VuartType(xc.arch_arm.vuart)
 x.ArchArm.SveVl = SveType(xc.arch_arm.sve_vl)
 x.ArchArm.NrSpis = uint32(xc.arch_arm.nr_spis)
+x.ArchArm.Viommu = ViommuType(xc.arch_arm.viommu)
 if err := x.ArchX86.MsrRelaxed.fromC(&xc.arch_x86.msr_relaxed);err != nil {
 return fmt.Errorf("converting field ArchX86.MsrRelaxed: %v", err)
 }
@@ -1699,6 +1700,7 @@ xc.arch_arm.gic_version = C.libxl_gic_version(x.ArchArm.GicVersion)
 xc.arch_arm.vuart = C.libxl_vuart_type(x.ArchArm.Vuart)
 xc.arch_arm.sve_vl = C.libxl_sve_type(x.ArchArm.SveVl)
 xc.arch_arm.nr_spis = C.uint32_t(x.ArchArm.NrSpis)
+xc.arch_arm.viommu = C.libxl_viommu_type(x.ArchArm.Viommu)
 if err := x.ArchX86.MsrRelaxed.toC(&xc.arch_x86.msr_relaxed); err != nil {
 return fmt.Errorf("converting field ArchX86.MsrRelaxed: %v", err)
 }
diff --git a/tools/golang/xenlight/types.gen.go b/tools/golang/xenlight/types.gen.go
index e26b3cdfc7..23f5a300e0 100644
--- a/tools/golang/xenlight/types.gen.go
+++ b/tools/golang/xenlight/types.gen.go
@@ -599,6 +599,7 @@ GicVersion GicVersion
 Vuart VuartType
 SveVl SveType
 NrSpis uint32
+Viommu ViommuType
 }
 ArchX86 struct {
 MsrRelaxed Defbool
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index 185f74d8a8..f985426ae1 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -313,6 +313,11 @@
  */
 #define LIBXL_HAVE_BUILDINFO_ARCH_NR_SPIS 1
 
+/*
+ * libxl_domain_build_info has the arch_arm.viommu_type field.
+ */
+#define LIBXL_HAVE_BUILDINFO_ARM_VIOMMU 1
+
 /*
  * LIBXL_HAVE_SOFT_RESET indicates that libxl supports performing
  * 'soft reset' for domains and there is 'soft_reset' shutdown reason
diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index 4a19a8d22b..44b7bcd672 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -233,6 +233,19 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
         config->arch.sve_vl = d_config->b_info.arch_arm.sve_vl / 128U;
     }
 
+    switch (d_config->b_info.arch_arm.viommu_type) {
+    case LIBXL_VIOMMU_TYPE_NONE:
+        config->arch.viommu_type = XEN_DOMCTL_CONFIG_VIOMMU_NONE;
+        break;
+    case LIBXL_VIOMMU_TYPE_SMMUV3:
+        config->arch.viommu_type = XEN_DOMCTL_CONFIG_VIOMMU_SMMUV3;
+        break;
+    default:
+        LOG(ERROR, "Unknown vIOMMU type %d",
+            d_config->b_info.arch_arm.viommu_type);
+        return ERROR_FAIL;
+    }
+
     return 0;
 }
 
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index a6030a2dbd..0c8d2c7156 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -551,6 +551,11 @@ libxl_sve_type = Enumeration("sve_type", [
     (2048, "2048")
     ], init_val = "LIBXL_SVE_TYPE_DISABLED")
 
+libxl_viommu_type = Enumeration("viommu_type", [
+    (0, "none"),
+    (1, "smmuv3")
+    ], init_val = "LIBXL_VIOMMU_TYPE_NONE")
+
 libxl_rdm_reserve = Struct("rdm_reserve", [
     ("strategy",    libxl_rdm_reserve_strategy),
     ("policy",      libxl_rdm_reserve_policy),
@@ -725,6 +730,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
                                ("vuart", libxl_vuart_type),
                                ("sve_vl", libxl_sve_type),
                                ("nr_spis", uint32, {'init_val': 'LIBXL_NR_SPIS_DEFAULT'}),
+                               ("viommu_type", libxl_viommu_type),
                               ])),
     ("arch_x86", Struct(None, [("msr_relaxed", libxl_defbool),
                               ])),
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 90c9386f5b..32236f7e5f 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -2989,6 +2989,16 @@ skip_usbdev:
     xlu_cfg_get_defbool(config, "trap_unmapped_accesses",
                         &b_info->trap_unmapped_accesses, 0);
 
+
+    if (!xlu_cfg_get_string (config, "viommu", &buf, 1)) {
+        e = libxl_viommu_type_from_string(buf, &b_info->arch_arm.viommu_type);
+        if (e) {
+            fprintf(stderr,
+                    "Unknown vIOMMU type \"%s\" specified\n", buf);
+            exit(-ERROR_FAIL);
+        }
+    }
+
     parse_vkb_list(config, d_config);
 
     d_config->virtios = NULL;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (5 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 06/20] xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-08  6:20   ` Jan Beulich
  2025-08-07 16:59 ` [PATCH 08/20] xen/arm: vsmmuv3: Add support for registers emulation Milan Djokic
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Add cmdline boot option "viommu = <boolean>" to enable or disable the
virtual iommu support for guests on ARM.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 docs/misc/xen-command-line.pandoc      |  7 +++++++
 xen/arch/arm/include/asm/viommu.h      | 11 +++++++++++
 xen/drivers/passthrough/arm/viommu.c   |  9 +++++++++
 xen/drivers/passthrough/arm/vsmmu-v3.c |  3 +++
 4 files changed, 30 insertions(+)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 6865a61220..d6a2a6fbdd 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2058,6 +2058,13 @@ This option can be specified more than once (up to 8 times at present).
 
 Flag to enable or disable support for PCI passthrough
 
+### viommu (arm)
+> `= <boolean>`
+
+> Default: `false`
+
+Flag to enable or disable support for Virtual IOMMU for guests.
+
 ### pcid (x86)
 > `= <boolean> | xpti=<bool>`
 
diff --git a/xen/arch/arm/include/asm/viommu.h b/xen/arch/arm/include/asm/viommu.h
index 4785877e2a..4de4cceeda 100644
--- a/xen/arch/arm/include/asm/viommu.h
+++ b/xen/arch/arm/include/asm/viommu.h
@@ -10,6 +10,7 @@
 #include <public/xen.h>
 
 extern struct list_head host_iommu_list;
+extern bool viommu_enabled;
 
 /* data structure for each hardware IOMMU */
 struct host_iommu {
@@ -50,6 +51,11 @@ uint16_t viommu_get_type(void);
 void add_to_host_iommu_list(paddr_t addr, paddr_t size,
                             const struct dt_device_node *node);
 
+static always_inline bool is_viommu_enabled(void)
+{
+    return viommu_enabled;
+}
+
 #else
 
 static inline uint8_t viommu_get_type(void)
@@ -76,6 +82,11 @@ static inline void add_to_host_iommu_list(paddr_t addr, paddr_t size,
     return;
 }
 
+static always_inline bool is_viommu_enabled(void)
+{
+    return false;
+}
+
 #endif /* CONFIG_VIRTUAL_IOMMU */
 
 #endif /* __ARCH_ARM_VIOMMU_H__ */
diff --git a/xen/drivers/passthrough/arm/viommu.c b/xen/drivers/passthrough/arm/viommu.c
index 53ae46349a..a1d6a04ba9 100644
--- a/xen/drivers/passthrough/arm/viommu.c
+++ b/xen/drivers/passthrough/arm/viommu.c
@@ -3,6 +3,7 @@
 #include <xen/errno.h>
 #include <xen/init.h>
 #include <xen/irq.h>
+#include <xen/param.h>
 #include <xen/types.h>
 
 #include <asm/viommu.h>
@@ -38,8 +39,16 @@ void add_to_host_iommu_list(paddr_t addr, paddr_t size,
     list_add_tail(&iommu_data->entry, &host_iommu_list);
 }
 
+/* By default viommu is disabled. */
+bool __read_mostly viommu_enabled;
+boolean_param("viommu", viommu_enabled);
+
 int domain_viommu_init(struct domain *d, uint16_t viommu_type)
 {
+    /* Enable viommu when it has been enabled explicitly (viommu=on). */
+    if ( !viommu_enabled )
+        return 0;
+
     if ( viommu_type == XEN_DOMCTL_CONFIG_VIOMMU_NONE )
         return 0;
 
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index 6b4009e5ef..e36f200ba5 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -105,6 +105,9 @@ void __init vsmmuv3_set_type(void)
 {
     const struct viommu_desc *desc = &vsmmuv3_desc;
 
+    if ( !is_viommu_enabled() )
+        return;
+
     if ( cur_viommu && (cur_viommu != desc) )
     {
         printk("WARNING: Cannot set vIOMMU, already set to a different value\n");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
  2025-08-07 16:59 ` [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>" Milan Djokic
@ 2025-08-08  6:20   ` Jan Beulich
  2025-08-13 10:30     ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2025-08-08  6:20 UTC (permalink / raw)
  To: Milan Djokic
  Cc: Rahul Singh, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk,
	xen-devel@lists.xenproject.org

On 07.08.2025 18:59, Milan Djokic wrote:
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2058,6 +2058,13 @@ This option can be specified more than once (up to 8 times at present).
>  
>  Flag to enable or disable support for PCI passthrough
>  
> +### viommu (arm)
> +> `= <boolean>`
> +
> +> Default: `false`
> +
> +Flag to enable or disable support for Virtual IOMMU for guests.

Isn't this too coarse-grained? IOW don't you want to specify the kind of IOMMU
to use, even if right now only one is being implemented? Also - why on the
command line? Isn't this rather a per-guest setting?

>  ### pcid (x86)
>  > `= <boolean> | xpti=<bool>`

Also, seeing this adjacent entry, you're inserting in the wrong place. The
options are supposed to be alphabetically sorted.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
  2025-08-08  6:20   ` Jan Beulich
@ 2025-08-13 10:30     ` Milan Djokic
  0 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-13 10:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Rahul Singh, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk,
	xen-devel@lists.xenproject.org

Hello Jan,

On 8/8/25 08:20, Jan Beulich wrote:
> On 07.08.2025 18:59, Milan Djokic wrote:
>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -2058,6 +2058,13 @@ This option can be specified more than once (up to 8 times at present).
>>   
>>   Flag to enable or disable support for PCI passthrough
>>   
>> +### viommu (arm)
>> +> `= <boolean>`
>> +
>> +> Default: `false`
>> +
>> +Flag to enable or disable support for Virtual IOMMU for guests.
> 
> Isn't this too coarse-grained? IOW don't you want to specify the kind of IOMMU
> to use, even if right now only one is being implemented? Also - why on the
> command line? Isn't this rather a per-guest setting?
>
Yes, it would make more sense to specify IOMMU kind instead of plain 
enable/disable parameter. Updated in new patch series version.
As for reasons to have this parameter in xen command line, it is needed 
for xen smmuv3 driver initialization, where viommu initialization is 
done if 'viommu' is enabled. Also, for dom0less mode and dom0 itself, 
viommu for domain is enabled while domain is created based on this 
argument. We do have per-guest viommu enablement option for guests 
created with xl, but this does not cover dom0 (and dom0-less).

>>   ### pcid (x86)
>>   > `= <boolean> | xpti=<bool>`
> 
> Also, seeing this adjacent entry, you're inserting in the wrong place. The
> options are supposed to be alphabetically sorted.
>
We'll fix this in the new version with other suggested changes included.

> Jan

BR,
Milan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 08/20] xen/arm: vsmmuv3: Add support for registers emulation
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (6 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>" Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 09/20] xen/arm: vsmmuv3: Add support for cmdqueue handling Milan Djokic
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Add initial support for various emulated registers for virtual SMMUv3
for guests and also add support for virtual cmdq and eventq.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/smmu-v3.h  |   6 +
 xen/drivers/passthrough/arm/vsmmu-v3.c | 286 +++++++++++++++++++++++++
 2 files changed, 292 insertions(+)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.h b/xen/drivers/passthrough/arm/smmu-v3.h
index 3411edc47f..d54f0a79f2 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.h
+++ b/xen/drivers/passthrough/arm/smmu-v3.h
@@ -60,6 +60,12 @@
 #define IDR5_VAX			GENMASK(11, 10)
 #define IDR5_VAX_52_BIT			1
 
+#define ARM_SMMU_IIDR			0x18
+#define IIDR_PRODUCTID			GENMASK(31, 20)
+#define IIDR_VARIANT			GENMASK(19, 16)
+#define IIDR_REVISION			GENMASK(15, 12)
+#define IIDR_IMPLEMENTER		GENMASK(11, 0)
+
 #define ARM_SMMU_CR0			0x20
 #define CR0_ATSCHK			(1 << 4)
 #define CR0_CMDQEN			(1 << 3)
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index e36f200ba5..3ae1e62a50 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -3,25 +3,307 @@
 #include <xen/param.h>
 #include <xen/sched.h>
 #include <asm/mmio.h>
+#include <asm/vgic-emul.h>
 #include <asm/viommu.h>
+#include <asm/vreg.h>
+
+#include "smmu-v3.h"
+
+/* Register Definition */
+#define ARM_SMMU_IDR2       0x8
+#define ARM_SMMU_IDR3       0xc
+#define ARM_SMMU_IDR4       0x10
+#define IDR0_TERM_MODEL     (1 << 26)
+#define IDR3_RIL            (1 << 10)
+#define CR0_RESERVED        0xFFFFFC20
+#define SMMU_IDR1_SIDSIZE   16
+#define SMMU_CMDQS          19
+#define SMMU_EVTQS          19
+#define DWORDS_BYTES        8
+#define ARM_SMMU_IIDR_VAL   0x12
 
 /* Struct to hold the vIOMMU ops and vIOMMU type */
 extern const struct viommu_desc __read_mostly *cur_viommu;
 
+/* virtual smmu queue */
+struct arm_vsmmu_queue {
+    uint64_t    q_base; /* base register */
+    uint32_t    prod;
+    uint32_t    cons;
+    uint8_t     ent_size;
+    uint8_t     max_n_shift;
+};
+
 struct virt_smmu {
     struct      domain *d;
     struct      list_head viommu_list;
+    uint8_t     sid_split;
+    uint32_t    features;
+    uint32_t    cr[3];
+    uint32_t    cr0ack;
+    uint32_t    gerror;
+    uint32_t    gerrorn;
+    uint32_t    strtab_base_cfg;
+    uint64_t    strtab_base;
+    uint32_t    irq_ctrl;
+    uint64_t    gerror_irq_cfg0;
+    uint64_t    evtq_irq_cfg0;
+    struct      arm_vsmmu_queue evtq, cmdq;
 };
 
 static int vsmmuv3_mmio_write(struct vcpu *v, mmio_info_t *info,
                               register_t r, void *priv)
 {
+    struct virt_smmu *smmu = priv;
+    uint64_t reg;
+    uint32_t reg32;
+
+    switch ( info->gpa & 0xffff )
+    {
+    case VREG32(ARM_SMMU_CR0):
+        reg32 = smmu->cr[0];
+        vreg_reg32_update(&reg32, r, info);
+        smmu->cr[0] = reg32;
+        smmu->cr0ack = reg32 & ~CR0_RESERVED;
+        break;
+
+    case VREG32(ARM_SMMU_CR1):
+        reg32 = smmu->cr[1];
+        vreg_reg32_update(&reg32, r, info);
+        smmu->cr[1] = reg32;
+        break;
+
+    case VREG32(ARM_SMMU_CR2):
+        reg32 = smmu->cr[2];
+        vreg_reg32_update(&reg32, r, info);
+        smmu->cr[2] = reg32;
+        break;
+
+    case VREG64(ARM_SMMU_STRTAB_BASE):
+        reg = smmu->strtab_base;
+        vreg_reg64_update(&reg, r, info);
+        smmu->strtab_base = reg;
+        break;
+
+    case VREG32(ARM_SMMU_STRTAB_BASE_CFG):
+        reg32 = smmu->strtab_base_cfg;
+        vreg_reg32_update(&reg32, r, info);
+        smmu->strtab_base_cfg = reg32;
+
+        smmu->sid_split = FIELD_GET(STRTAB_BASE_CFG_SPLIT, reg32);
+        smmu->features |= STRTAB_BASE_CFG_FMT_2LVL;
+        break;
+
+    case VREG32(ARM_SMMU_CMDQ_BASE):
+        reg = smmu->cmdq.q_base;
+        vreg_reg64_update(&reg, r, info);
+        smmu->cmdq.q_base = reg;
+        smmu->cmdq.max_n_shift = FIELD_GET(Q_BASE_LOG2SIZE, smmu->cmdq.q_base);
+        if ( smmu->cmdq.max_n_shift > SMMU_CMDQS )
+            smmu->cmdq.max_n_shift = SMMU_CMDQS;
+        break;
+
+    case VREG32(ARM_SMMU_CMDQ_PROD):
+        reg32 = smmu->cmdq.prod;
+        vreg_reg32_update(&reg32, r, info);
+        smmu->cmdq.prod = reg32;
+        break;
+
+    case VREG32(ARM_SMMU_CMDQ_CONS):
+        reg32 = smmu->cmdq.cons;
+        vreg_reg32_update(&reg32, r, info);
+        smmu->cmdq.cons = reg32;
+        break;
+
+    case VREG32(ARM_SMMU_EVTQ_BASE):
+        reg = smmu->evtq.q_base;
+        vreg_reg64_update(&reg, r, info);
+        smmu->evtq.q_base = reg;
+        smmu->evtq.max_n_shift = FIELD_GET(Q_BASE_LOG2SIZE, smmu->evtq.q_base);
+        if ( smmu->cmdq.max_n_shift > SMMU_EVTQS )
+            smmu->cmdq.max_n_shift = SMMU_EVTQS;
+        break;
+
+    case VREG32(ARM_SMMU_EVTQ_PROD):
+        reg32 = smmu->evtq.prod;
+        vreg_reg32_update(&reg32, r, info);
+        smmu->evtq.prod = reg32;
+        break;
+
+    case VREG32(ARM_SMMU_EVTQ_CONS):
+        reg32 = smmu->evtq.cons;
+        vreg_reg32_update(&reg32, r, info);
+        smmu->evtq.cons = reg32;
+        break;
+
+    case VREG32(ARM_SMMU_IRQ_CTRL):
+        reg32 = smmu->irq_ctrl;
+        vreg_reg32_update(&reg32, r, info);
+        smmu->irq_ctrl = reg32;
+        break;
+
+    case VREG64(ARM_SMMU_GERROR_IRQ_CFG0):
+        reg = smmu->gerror_irq_cfg0;
+        vreg_reg64_update(&reg, r, info);
+        smmu->gerror_irq_cfg0 = reg;
+        break;
+
+    case VREG64(ARM_SMMU_EVTQ_IRQ_CFG0):
+        reg = smmu->evtq_irq_cfg0;
+        vreg_reg64_update(&reg, r, info);
+        smmu->evtq_irq_cfg0 = reg;
+        break;
+
+    case VREG32(ARM_SMMU_GERRORN):
+        reg = smmu->gerrorn;
+        vreg_reg64_update(&reg, r, info);
+        smmu->gerrorn = reg;
+        break;
+
+    default:
+        printk(XENLOG_G_ERR
+               "%pv: vSMMUv3: unhandled write r%d offset %"PRIpaddr"\n",
+               v, info->dabt.reg, (unsigned long)info->gpa & 0xffff);
+        return IO_ABORT;
+    }
+
     return IO_HANDLED;
 }
 
 static int vsmmuv3_mmio_read(struct vcpu *v, mmio_info_t *info,
                              register_t *r, void *priv)
 {
+    struct virt_smmu *smmu = priv;
+    uint64_t reg;
+
+    switch ( info->gpa & 0xffff )
+    {
+    case VREG32(ARM_SMMU_IDR0):
+        reg  = FIELD_PREP(IDR0_S1P, 1) | FIELD_PREP(IDR0_TTF, 2) |
+            FIELD_PREP(IDR0_COHACC, 0) | FIELD_PREP(IDR0_ASID16, 1) |
+            FIELD_PREP(IDR0_TTENDIAN, 0) | FIELD_PREP(IDR0_STALL_MODEL, 1) |
+            FIELD_PREP(IDR0_ST_LVL, 1) | FIELD_PREP(IDR0_TERM_MODEL, 1);
+        *r = vreg_reg32_extract(reg, info);
+        break;
+
+    case VREG32(ARM_SMMU_IDR1):
+        reg  = FIELD_PREP(IDR1_SIDSIZE, SMMU_IDR1_SIDSIZE) |
+            FIELD_PREP(IDR1_CMDQS, SMMU_CMDQS) |
+            FIELD_PREP(IDR1_EVTQS, SMMU_EVTQS);
+        *r = vreg_reg32_extract(reg, info);
+        break;
+
+    case VREG32(ARM_SMMU_IDR2):
+        goto read_reserved;
+
+    case VREG32(ARM_SMMU_IDR3):
+        reg  = FIELD_PREP(IDR3_RIL, 0);
+        *r = vreg_reg32_extract(reg, info);
+        break;
+
+    case VREG32(ARM_SMMU_IDR4):
+        goto read_impl_defined;
+
+    case VREG32(ARM_SMMU_IDR5):
+        reg  = FIELD_PREP(IDR5_GRAN4K, 1) | FIELD_PREP(IDR5_GRAN16K, 1) |
+            FIELD_PREP(IDR5_GRAN64K, 1) | FIELD_PREP(IDR5_OAS, IDR5_OAS_48_BIT);
+        *r = vreg_reg32_extract(reg, info);
+        break;
+
+    case VREG32(ARM_SMMU_IIDR):
+        *r = vreg_reg32_extract(ARM_SMMU_IIDR_VAL, info);
+        break;
+
+    case VREG32(ARM_SMMU_CR0):
+        *r = vreg_reg32_extract(smmu->cr[0], info);
+        break;
+
+    case VREG32(ARM_SMMU_CR0ACK):
+        *r = vreg_reg32_extract(smmu->cr0ack, info);
+        break;
+
+    case VREG32(ARM_SMMU_CR1):
+        *r = vreg_reg32_extract(smmu->cr[1], info);
+        break;
+
+    case VREG32(ARM_SMMU_CR2):
+        *r = vreg_reg32_extract(smmu->cr[2], info);
+        break;
+
+    case VREG32(ARM_SMMU_STRTAB_BASE):
+        *r = vreg_reg64_extract(smmu->strtab_base, info);
+        break;
+
+    case VREG32(ARM_SMMU_STRTAB_BASE_CFG):
+        *r = vreg_reg32_extract(smmu->strtab_base_cfg, info);
+        break;
+
+    case VREG32(ARM_SMMU_CMDQ_BASE):
+        *r = vreg_reg64_extract(smmu->cmdq.q_base, info);
+        break;
+
+    case VREG32(ARM_SMMU_CMDQ_PROD):
+        *r = vreg_reg32_extract(smmu->cmdq.prod, info);
+        break;
+
+    case VREG32(ARM_SMMU_CMDQ_CONS):
+        *r = vreg_reg32_extract(smmu->cmdq.cons, info);
+        break;
+
+    case VREG32(ARM_SMMU_EVTQ_BASE):
+        *r = vreg_reg64_extract(smmu->evtq.q_base, info);
+        break;
+
+    case VREG32(ARM_SMMU_EVTQ_PROD):
+        *r = vreg_reg32_extract(smmu->evtq.prod, info);
+        break;
+
+    case VREG32(ARM_SMMU_EVTQ_CONS):
+        *r = vreg_reg32_extract(smmu->evtq.cons, info);
+        break;
+
+    case VREG32(ARM_SMMU_IRQ_CTRL):
+    case VREG32(ARM_SMMU_IRQ_CTRLACK):
+        *r = vreg_reg32_extract(smmu->irq_ctrl, info);
+        break;
+
+    case VREG64(ARM_SMMU_GERROR_IRQ_CFG0):
+        *r = vreg_reg64_extract(smmu->gerror_irq_cfg0, info);
+        break;
+
+    case VREG64(ARM_SMMU_EVTQ_IRQ_CFG0):
+        *r = vreg_reg64_extract(smmu->evtq_irq_cfg0, info);
+        break;
+
+    case VREG32(ARM_SMMU_GERROR):
+        *r = vreg_reg64_extract(smmu->gerror, info);
+        break;
+
+    case VREG32(ARM_SMMU_GERRORN):
+        *r = vreg_reg64_extract(smmu->gerrorn, info);
+        break;
+
+    default:
+        printk(XENLOG_G_ERR
+               "%pv: vSMMUv3: unhandled read r%d offset %"PRIpaddr"\n",
+               v, info->dabt.reg, (unsigned long)info->gpa & 0xffff);
+        return IO_ABORT;
+    }
+
+    return IO_HANDLED;
+
+ read_impl_defined:
+    printk(XENLOG_G_DEBUG
+           "%pv: vSMMUv3: RAZ on implementation defined register offset %"PRIpaddr"\n",
+           v, info->gpa & 0xffff);
+    *r = 0;
+    return IO_HANDLED;
+
+ read_reserved:
+    printk(XENLOG_G_DEBUG
+           "%pv: vSMMUv3: RAZ on reserved register offset %"PRIpaddr"\n",
+           v, info->gpa & 0xffff);
+    *r = 0;
     return IO_HANDLED;
 }
 
@@ -39,6 +321,10 @@ static int vsmmuv3_init_single(struct domain *d, paddr_t addr, paddr_t size)
         return -ENOMEM;
 
     smmu->d = d;
+    smmu->cmdq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_CMDQS);
+    smmu->cmdq.ent_size = CMDQ_ENT_DWORDS * DWORDS_BYTES;
+    smmu->evtq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_EVTQS);
+    smmu->evtq.ent_size = EVTQ_ENT_DWORDS * DWORDS_BYTES;
 
     register_mmio_handler(d, &vsmmuv3_mmio_handler, addr, size, smmu);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/20] xen/arm: vsmmuv3: Add support for cmdqueue handling
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (7 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 08/20] xen/arm: vsmmuv3: Add support for registers emulation Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 10/20] xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE Milan Djokic
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Add support for virtual cmdqueue handling for guests

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/vsmmu-v3.c | 101 +++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index 3ae1e62a50..84bd1f7327 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-Clause) */
 
+#include <xen/guest_access.h>
 #include <xen/param.h>
 #include <xen/sched.h>
 #include <asm/mmio.h>
@@ -25,6 +26,26 @@
 /* Struct to hold the vIOMMU ops and vIOMMU type */
 extern const struct viommu_desc __read_mostly *cur_viommu;
 
+/* SMMUv3 command definitions */
+#define CMDQ_OP_PREFETCH_CFG    0x1
+#define CMDQ_OP_CFGI_STE        0x3
+#define CMDQ_OP_CFGI_ALL        0x4
+#define CMDQ_OP_CFGI_CD         0x5
+#define CMDQ_OP_CFGI_CD_ALL     0x6
+#define CMDQ_OP_TLBI_NH_ASID    0x11
+#define CMDQ_OP_TLBI_NH_VA      0x12
+#define CMDQ_OP_TLBI_NSNH_ALL   0x30
+#define CMDQ_OP_CMD_SYNC        0x46
+
+/* Queue Handling */
+#define Q_BASE(q)       ((q)->q_base & Q_BASE_ADDR_MASK)
+#define Q_CONS_ENT(q)   (Q_BASE(q) + Q_IDX(q, (q)->cons) * (q)->ent_size)
+#define Q_PROD_ENT(q)   (Q_BASE(q) + Q_IDX(q, (q)->prod) * (q)->ent_size)
+
+/* Helper Macros */
+#define smmu_get_cmdq_enabled(x)    FIELD_GET(CR0_CMDQEN, x)
+#define smmu_cmd_get_command(x)     FIELD_GET(CMDQ_0_OP, x)
+
 /* virtual smmu queue */
 struct arm_vsmmu_queue {
     uint64_t    q_base; /* base register */
@@ -49,8 +70,80 @@ struct virt_smmu {
     uint64_t    gerror_irq_cfg0;
     uint64_t    evtq_irq_cfg0;
     struct      arm_vsmmu_queue evtq, cmdq;
+    spinlock_t  cmd_queue_lock;
 };
 
+/* Queue manipulation functions */
+static bool queue_empty(struct arm_vsmmu_queue *q)
+{
+    return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
+           Q_WRP(q, q->prod) == Q_WRP(q, q->cons);
+}
+
+static void queue_inc_cons(struct arm_vsmmu_queue *q)
+{
+    uint32_t cons = (Q_WRP(q, q->cons) | Q_IDX(q, q->cons)) + 1;
+    q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
+}
+
+static void dump_smmu_command(uint64_t *command)
+{
+    gdprintk(XENLOG_ERR, "cmd 0x%02llx: %016lx %016lx\n",
+             smmu_cmd_get_command(command[0]), command[0], command[1]);
+}
+static int arm_vsmmu_handle_cmds(struct virt_smmu *smmu)
+{
+    struct arm_vsmmu_queue *q = &smmu->cmdq;
+    struct domain *d = smmu->d;
+    uint64_t command[CMDQ_ENT_DWORDS];
+    paddr_t addr;
+
+    if ( !smmu_get_cmdq_enabled(smmu->cr[0]) )
+        return 0;
+
+    while ( !queue_empty(q) )
+    {
+        int ret;
+
+        addr = Q_CONS_ENT(q);
+        ret = access_guest_memory_by_ipa(d, addr, command,
+                                         sizeof(command), false);
+        if ( ret )
+            return ret;
+
+        switch ( smmu_cmd_get_command(command[0]) )
+        {
+        case CMDQ_OP_CFGI_STE:
+            break;
+        case CMDQ_OP_PREFETCH_CFG:
+        case CMDQ_OP_CFGI_CD:
+        case CMDQ_OP_CFGI_CD_ALL:
+        case CMDQ_OP_CFGI_ALL:
+        case CMDQ_OP_CMD_SYNC:
+            break;
+        case CMDQ_OP_TLBI_NH_ASID:
+        case CMDQ_OP_TLBI_NSNH_ALL:
+        case CMDQ_OP_TLBI_NH_VA:
+            if ( !iommu_iotlb_flush_all(smmu->d, 1) )
+                break;
+        default:
+            gdprintk(XENLOG_ERR, "vSMMUv3: unhandled command\n");
+            dump_smmu_command(command);
+            break;
+        }
+
+        if ( ret )
+        {
+            gdprintk(XENLOG_ERR,
+                     "vSMMUv3: command error %d while handling command\n",
+                     ret);
+            dump_smmu_command(command);
+        }
+        queue_inc_cons(q);
+    }
+    return 0;
+}
+
 static int vsmmuv3_mmio_write(struct vcpu *v, mmio_info_t *info,
                               register_t r, void *priv)
 {
@@ -104,9 +197,15 @@ static int vsmmuv3_mmio_write(struct vcpu *v, mmio_info_t *info,
         break;
 
     case VREG32(ARM_SMMU_CMDQ_PROD):
+        spin_lock(&smmu->cmd_queue_lock);
         reg32 = smmu->cmdq.prod;
         vreg_reg32_update(&reg32, r, info);
         smmu->cmdq.prod = reg32;
+
+        if ( arm_vsmmu_handle_cmds(smmu) )
+            gdprintk(XENLOG_ERR, "error handling vSMMUv3 commands\n");
+
+        spin_unlock(&smmu->cmd_queue_lock);
         break;
 
     case VREG32(ARM_SMMU_CMDQ_CONS):
@@ -326,6 +425,8 @@ static int vsmmuv3_init_single(struct domain *d, paddr_t addr, paddr_t size)
     smmu->evtq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_EVTQS);
     smmu->evtq.ent_size = EVTQ_ENT_DWORDS * DWORDS_BYTES;
 
+    spin_lock_init(&smmu->cmd_queue_lock);
+
     register_mmio_handler(d, &vsmmuv3_mmio_handler, addr, size, smmu);
 
     /* Register the vIOMMU to be able to clean it up later. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/20] xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (8 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 09/20] xen/arm: vsmmuv3: Add support for cmdqueue handling Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 11/20] xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware Milan Djokic
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

CMD_CFGI_STE is used to invalidate/validate the STE. Emulated vSMMUv3
driver in XEN will read the STE from the guest memory space and capture
the Stage-1 configuration required to support nested translation.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/vsmmu-v3.c | 148 +++++++++++++++++++++++++
 1 file changed, 148 insertions(+)

diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index 84bd1f7327..3ecbe4861b 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -45,6 +45,21 @@ extern const struct viommu_desc __read_mostly *cur_viommu;
 /* Helper Macros */
 #define smmu_get_cmdq_enabled(x)    FIELD_GET(CR0_CMDQEN, x)
 #define smmu_cmd_get_command(x)     FIELD_GET(CMDQ_0_OP, x)
+#define smmu_cmd_get_sid(x)         FIELD_GET(CMDQ_PREFETCH_0_SID, x)
+#define smmu_get_ste_s1cdmax(x)     FIELD_GET(STRTAB_STE_0_S1CDMAX, x)
+#define smmu_get_ste_s1fmt(x)       FIELD_GET(STRTAB_STE_0_S1FMT, x)
+#define smmu_get_ste_s1stalld(x)    FIELD_GET(STRTAB_STE_1_S1STALLD, x)
+#define smmu_get_ste_s1ctxptr(x)    FIELD_PREP(STRTAB_STE_0_S1CTXPTR_MASK, \
+                                    FIELD_GET(STRTAB_STE_0_S1CTXPTR_MASK, x))
+
+/* stage-1 translation configuration */
+struct arm_vsmmu_s1_trans_cfg {
+    paddr_t s1ctxptr;
+    uint8_t s1fmt;
+    uint8_t s1cdmax;
+    bool    bypassed;             /* translation is bypassed */
+    bool    aborted;              /* translation is aborted */
+};
 
 /* virtual smmu queue */
 struct arm_vsmmu_queue {
@@ -91,6 +106,138 @@ static void dump_smmu_command(uint64_t *command)
     gdprintk(XENLOG_ERR, "cmd 0x%02llx: %016lx %016lx\n",
              smmu_cmd_get_command(command[0]), command[0], command[1]);
 }
+static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
+                              uint64_t *ste)
+{
+    paddr_t addr, strtab_base;
+    struct domain *d = smmu->d;
+    uint32_t log2size;
+    int strtab_size_shift;
+    int ret;
+
+    log2size = FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, smmu->strtab_base_cfg);
+
+    if ( sid >= (1 << MIN(log2size, SMMU_IDR1_SIDSIZE)) )
+        return -EINVAL;
+
+    if ( smmu->features & STRTAB_BASE_CFG_FMT_2LVL )
+    {
+        int idx, max_l2_ste, span;
+        paddr_t l1ptr, l2ptr;
+        uint64_t l1std;
+
+        strtab_size_shift = MAX(5, (int)log2size - smmu->sid_split - 1 + 3);
+        strtab_base = smmu->strtab_base & STRTAB_BASE_ADDR_MASK &
+                        ~GENMASK_ULL(strtab_size_shift, 0);
+        idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
+        l1ptr = (paddr_t)(strtab_base + idx * sizeof(l1std));
+
+        ret = access_guest_memory_by_ipa(d, l1ptr, &l1std,
+                                         sizeof(l1std), false);
+        if ( ret )
+        {
+            gdprintk(XENLOG_ERR,
+                     "Could not read L1PTR at 0X%"PRIx64"\n", l1ptr);
+            return ret;
+        }
+
+        span = FIELD_GET(STRTAB_L1_DESC_SPAN, l1std);
+        if ( !span )
+        {
+            gdprintk(XENLOG_ERR, "Bad StreamID span\n");
+            return -EINVAL;
+        }
+
+        max_l2_ste = (1 << span) - 1;
+        l2ptr = FIELD_PREP(STRTAB_L1_DESC_L2PTR_MASK,
+                    FIELD_GET(STRTAB_L1_DESC_L2PTR_MASK, l1std));
+        idx = sid & ((1 << smmu->sid_split) - 1);
+        if ( idx > max_l2_ste )
+        {
+            gdprintk(XENLOG_ERR, "idx=%d > max_l2_ste=%d\n",
+                     idx, max_l2_ste);
+            return -EINVAL;
+        }
+        addr = l2ptr + idx * sizeof(*ste) * STRTAB_STE_DWORDS;
+    }
+    else
+    {
+        strtab_size_shift = log2size + 5;
+        strtab_base = smmu->strtab_base & STRTAB_BASE_ADDR_MASK &
+                      ~GENMASK_ULL(strtab_size_shift, 0);
+        addr = strtab_base + sid * sizeof(*ste) * STRTAB_STE_DWORDS;
+    }
+    ret = access_guest_memory_by_ipa(d, addr, ste, sizeof(*ste), false);
+    if ( ret )
+    {
+        gdprintk(XENLOG_ERR,
+                "Cannot fetch pte at address=0x%"PRIx64"\n", addr);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int arm_vsmmu_decode_ste(struct virt_smmu *smmu, uint32_t sid,
+                                struct arm_vsmmu_s1_trans_cfg *cfg,
+                                uint64_t *ste)
+{
+    uint64_t val = ste[0];
+
+    if ( !(val & STRTAB_STE_0_V) )
+        return -EAGAIN;
+
+    switch ( FIELD_GET(STRTAB_STE_0_CFG, val) )
+    {
+    case STRTAB_STE_0_CFG_BYPASS:
+        cfg->bypassed = true;
+        return 0;
+    case STRTAB_STE_0_CFG_ABORT:
+        cfg->aborted = true;
+        return 0;
+    case STRTAB_STE_0_CFG_S1_TRANS:
+        break;
+    case STRTAB_STE_0_CFG_S2_TRANS:
+        gdprintk(XENLOG_ERR, "vSMMUv3 does not support stage 2 yet\n");
+        goto bad_ste;
+    default:
+        BUG(); /* STE corruption */
+    }
+
+    cfg->s1ctxptr = smmu_get_ste_s1ctxptr(val);
+    cfg->s1fmt = smmu_get_ste_s1fmt(val);
+    cfg->s1cdmax = smmu_get_ste_s1cdmax(val);
+    if ( cfg->s1cdmax != 0 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "vSMMUv3 does not support multiple context descriptors\n");
+        goto bad_ste;
+    }
+
+    return 0;
+
+bad_ste:
+    return -EINVAL;
+}
+
+static int arm_vsmmu_handle_cfgi_ste(struct virt_smmu *smmu, uint64_t *cmdptr)
+{
+    int ret;
+    uint64_t ste[STRTAB_STE_DWORDS];
+    struct arm_vsmmu_s1_trans_cfg s1_cfg = {0};
+    uint32_t sid = smmu_cmd_get_sid(cmdptr[0]);
+
+    ret = arm_vsmmu_find_ste(smmu, sid, ste);
+    if ( ret )
+        return ret;
+
+    ret = arm_vsmmu_decode_ste(smmu, sid, &s1_cfg, ste);
+    if ( ret )
+        return (ret == -EAGAIN ) ? 0 : ret;
+
+    return 0;
+}
+
 static int arm_vsmmu_handle_cmds(struct virt_smmu *smmu)
 {
     struct arm_vsmmu_queue *q = &smmu->cmdq;
@@ -114,6 +261,7 @@ static int arm_vsmmu_handle_cmds(struct virt_smmu *smmu)
         switch ( smmu_cmd_get_command(command[0]) )
         {
         case CMDQ_OP_CFGI_STE:
+            ret = arm_vsmmu_handle_cfgi_ste(smmu, command);
             break;
         case CMDQ_OP_PREFETCH_CFG:
         case CMDQ_OP_CFGI_CD:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/20] xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (9 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 10/20] xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-08  6:22   ` Jan Beulich
  2025-08-07 16:59 ` [PATCH 12/20] xen/arm: vsmmuv3: Add support for event queue and global error Milan Djokic
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Bertrand Marquis, Stefano Stabellini, Julien Grall,
	Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Attach the Stage-1 configuration to device STE to support nested
translation for the guests.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/smmu-v3.c  | 79 ++++++++++++++++++++++++++
 xen/drivers/passthrough/arm/smmu-v3.h  |  1 +
 xen/drivers/passthrough/arm/vsmmu-v3.c | 18 ++++++
 xen/include/xen/iommu.h                | 14 +++++
 4 files changed, 112 insertions(+)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c
index 193c892fcd..91bf72d420 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -2791,6 +2791,37 @@ static struct arm_smmu_device *arm_smmu_get_by_dev(const struct device *dev)
 	return NULL;
 }
 
+static struct iommu_domain *arm_smmu_get_domain_by_sid(struct domain *d,
+				u32 sid)
+{
+	int i;
+	unsigned long flags;
+	struct iommu_domain *io_domain;
+	struct arm_smmu_domain *smmu_domain;
+	struct arm_smmu_master *master;
+	struct arm_smmu_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+
+	/*
+	 * Loop through the &xen_domain->contexts to locate a context
+	 * assigned to this SMMU
+	 */
+	list_for_each_entry(io_domain, &xen_domain->contexts, list) {
+		smmu_domain = to_smmu_domain(io_domain);
+
+		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+		list_for_each_entry(master, &smmu_domain->devices, domain_head) {
+			for (i = 0; i < master->num_streams; i++) {
+				if (sid != master->streams[i].id)
+					continue;
+				spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+				return io_domain;
+			}
+		}
+		spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+	}
+	return NULL;
+}
+
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
 				struct device *dev)
 {
@@ -3003,6 +3034,53 @@ static void arm_smmu_iommu_xen_domain_teardown(struct domain *d)
 	xfree(xen_domain);
 }
 
+static int arm_smmu_attach_guest_config(struct domain *d, u32 sid,
+		struct iommu_guest_config *cfg)
+{
+	int ret = -EINVAL;
+	unsigned long flags;
+	struct arm_smmu_master *master;
+	struct arm_smmu_domain *smmu_domain;
+	struct arm_smmu_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
+	struct iommu_domain *io_domain = arm_smmu_get_domain_by_sid(d, sid);
+
+	if (!io_domain)
+		return -ENODEV;
+
+	smmu_domain = to_smmu_domain(io_domain);
+
+	spin_lock(&xen_domain->lock);
+
+	switch (cfg->config) {
+	case ARM_SMMU_DOMAIN_ABORT:
+		smmu_domain->abort = true;
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		smmu_domain->abort = false;
+		break;
+	case ARM_SMMU_DOMAIN_NESTED:
+		/* Enable Nested stage translation. */
+		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
+		smmu_domain->s1_cfg.s1ctxptr = cfg->s1ctxptr;
+		smmu_domain->s1_cfg.s1fmt = cfg->s1fmt;
+		smmu_domain->s1_cfg.s1cdmax = cfg->s1cdmax;
+		smmu_domain->abort = false;
+		break;
+	default:
+		goto out;
+	}
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_for_each_entry(master, &smmu_domain->devices, domain_head)
+		arm_smmu_install_ste_for_dev(master);
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+	ret = 0;
+out:
+	spin_unlock(&xen_domain->lock);
+	return ret;
+}
+
 static const struct iommu_ops arm_smmu_iommu_ops = {
 	.page_sizes		= PAGE_SIZE_4K,
 	.init			= arm_smmu_iommu_xen_domain_init,
@@ -3015,6 +3093,7 @@ static const struct iommu_ops arm_smmu_iommu_ops = {
 	.unmap_page		= arm_iommu_unmap_page,
 	.dt_xlate		= arm_smmu_dt_xlate,
 	.add_device		= arm_smmu_add_device,
+	.attach_guest_config = arm_smmu_attach_guest_config
 };
 
 static __init int arm_smmu_dt_init(struct dt_device_node *dev,
diff --git a/xen/drivers/passthrough/arm/smmu-v3.h b/xen/drivers/passthrough/arm/smmu-v3.h
index d54f0a79f2..3e3a6cd080 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.h
+++ b/xen/drivers/passthrough/arm/smmu-v3.h
@@ -398,6 +398,7 @@ enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S2,
 	ARM_SMMU_DOMAIN_NESTED,
 	ARM_SMMU_DOMAIN_BYPASS,
+	ARM_SMMU_DOMAIN_ABORT,
 };
 
 /* Xen specific code. */
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index 3ecbe4861b..3b073b9dac 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -224,8 +224,11 @@ static int arm_vsmmu_handle_cfgi_ste(struct virt_smmu *smmu, uint64_t *cmdptr)
 {
     int ret;
     uint64_t ste[STRTAB_STE_DWORDS];
+    struct domain *d = smmu->d;
+    struct domain_iommu *hd = dom_iommu(d);
     struct arm_vsmmu_s1_trans_cfg s1_cfg = {0};
     uint32_t sid = smmu_cmd_get_sid(cmdptr[0]);
+    struct iommu_guest_config guest_cfg = {0};
 
     ret = arm_vsmmu_find_ste(smmu, sid, ste);
     if ( ret )
@@ -235,6 +238,21 @@ static int arm_vsmmu_handle_cfgi_ste(struct virt_smmu *smmu, uint64_t *cmdptr)
     if ( ret )
         return (ret == -EAGAIN ) ? 0 : ret;
 
+    guest_cfg.s1ctxptr = s1_cfg.s1ctxptr;
+    guest_cfg.s1fmt = s1_cfg.s1fmt;
+    guest_cfg.s1cdmax = s1_cfg.s1cdmax;
+
+    if ( s1_cfg.bypassed )
+        guest_cfg.config = ARM_SMMU_DOMAIN_BYPASS;
+    else if ( s1_cfg.aborted )
+        guest_cfg.config = ARM_SMMU_DOMAIN_ABORT;
+    else
+        guest_cfg.config = ARM_SMMU_DOMAIN_NESTED;
+
+    ret = hd->platform_ops->attach_guest_config(d, sid, &guest_cfg);
+    if ( ret )
+        return ret;
+
     return 0;
 }
 
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 37c4a1dc82..21f905d44f 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -311,6 +311,15 @@ static inline int iommu_add_dt_pci_sideband_ids(struct pci_dev *pdev)
 
 #endif /* HAS_DEVICE_TREE_DISCOVERY */
 
+#ifdef CONFIG_ARM
+struct iommu_guest_config {
+    paddr_t     s1ctxptr;
+    uint8_t     config;
+    uint8_t     s1fmt;
+    uint8_t     s1cdmax;
+};
+#endif /* CONFIG_ARM */
+
 struct page_info;
 
 /*
@@ -387,6 +396,11 @@ struct iommu_ops {
 #endif
     /* Inhibit all interrupt generation, to be used at shutdown. */
     void (*quiesce)(void);
+
+#ifdef CONFIG_ARM
+    int (*attach_guest_config)(struct domain *d, u32 sid,
+                               struct iommu_guest_config *cfg);
+#endif
 };
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/20] xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware
  2025-08-07 16:59 ` [PATCH 11/20] xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware Milan Djokic
@ 2025-08-08  6:22   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-08-08  6:22 UTC (permalink / raw)
  To: Milan Djokic
  Cc: Rahul Singh, Bertrand Marquis, Stefano Stabellini, Julien Grall,
	Michal Orzel, Volodymyr Babchuk, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 07.08.2025 18:59, Milan Djokic wrote:
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -311,6 +311,15 @@ static inline int iommu_add_dt_pci_sideband_ids(struct pci_dev *pdev)
>  
>  #endif /* HAS_DEVICE_TREE_DISCOVERY */
>  
> +#ifdef CONFIG_ARM
> +struct iommu_guest_config {
> +    paddr_t     s1ctxptr;
> +    uint8_t     config;
> +    uint8_t     s1fmt;
> +    uint8_t     s1cdmax;
> +};
> +#endif /* CONFIG_ARM */

This looks like it should rather go into Arm's asm/iommu.h.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 12/20] xen/arm: vsmmuv3: Add support for event queue and global error
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (10 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 11/20] xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 13/20] xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices Milan Djokic
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Event queue is used to send the events to guest when there is an events/
faults. Add support for event queue to send events to guest.

Global error in SMMUv3 hw will be updated in smmu_gerror and
smmu_gerrorn register. Add support for global error registers to send
global error to guest.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/smmu-v3.h  |  20 +++
 xen/drivers/passthrough/arm/vsmmu-v3.c | 169 ++++++++++++++++++++++++-
 xen/include/public/arch-arm.h          |   5 +-
 3 files changed, 186 insertions(+), 8 deletions(-)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.h b/xen/drivers/passthrough/arm/smmu-v3.h
index 3e3a6cd080..820271a413 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.h
+++ b/xen/drivers/passthrough/arm/smmu-v3.h
@@ -354,6 +354,26 @@
 
 #define EVTQ_0_ID			GENMASK_ULL(7, 0)
 
+#define EVT_ID_BAD_STREAMID		0x02
+#define EVT_ID_BAD_STE		    	0x04
+#define EVT_ID_TRANSLATION_FAULT	0x10
+#define EVT_ID_ADDR_SIZE_FAULT		0x11
+#define EVT_ID_ACCESS_FAULT		0x12
+#define EVT_ID_PERMISSION_FAULT		0x13
+
+#define EVTQ_0_SSV			(1UL << 11)
+#define EVTQ_0_SSID			GENMASK_ULL(31, 12)
+#define EVTQ_0_SID			GENMASK_ULL(63, 32)
+#define EVTQ_1_STAG			GENMASK_ULL(15, 0)
+#define EVTQ_1_STALL			(1UL << 31)
+#define EVTQ_1_PnU			(1UL << 33)
+#define EVTQ_1_InD			(1UL << 34)
+#define EVTQ_1_RnW			(1UL << 35)
+#define EVTQ_1_S2			(1UL << 39)
+#define EVTQ_1_CLASS			GENMASK_ULL(41, 40)
+#define EVTQ_1_TT_READ			(1UL << 44)
+#define EVTQ_2_ADDR			GENMASK_ULL(63, 0)
+#define EVTQ_3_IPA			GENMASK_ULL(51, 12)
 /* PRI queue */
 #define PRIQ_ENT_SZ_SHIFT		4
 #define PRIQ_ENT_DWORDS			((1 << PRIQ_ENT_SZ_SHIFT) >> 3)
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index 3b073b9dac..7a6c18df53 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -44,6 +44,7 @@ extern const struct viommu_desc __read_mostly *cur_viommu;
 
 /* Helper Macros */
 #define smmu_get_cmdq_enabled(x)    FIELD_GET(CR0_CMDQEN, x)
+#define smmu_get_evtq_enabled(x)    FIELD_GET(CR0_EVTQEN, x)
 #define smmu_cmd_get_command(x)     FIELD_GET(CMDQ_0_OP, x)
 #define smmu_cmd_get_sid(x)         FIELD_GET(CMDQ_PREFETCH_0_SID, x)
 #define smmu_get_ste_s1cdmax(x)     FIELD_GET(STRTAB_STE_0_S1CDMAX, x)
@@ -52,6 +53,35 @@ extern const struct viommu_desc __read_mostly *cur_viommu;
 #define smmu_get_ste_s1ctxptr(x)    FIELD_PREP(STRTAB_STE_0_S1CTXPTR_MASK, \
                                     FIELD_GET(STRTAB_STE_0_S1CTXPTR_MASK, x))
 
+/* event queue entry */
+struct arm_smmu_evtq_ent {
+    /* Common fields */
+    uint8_t     opcode;
+    uint32_t    sid;
+
+    /* Event-specific fields */
+    union {
+        struct {
+            uint32_t ssid;
+            bool ssv;
+        } c_bad_ste_streamid;
+
+        struct {
+            bool stall;
+            uint16_t stag;
+            uint32_t ssid;
+            bool ssv;
+            bool s2;
+            uint64_t addr;
+            bool rnw;
+            bool pnu;
+            bool ind;
+            uint8_t class;
+            uint64_t addr2;
+        } f_translation;
+    };
+};
+
 /* stage-1 translation configuration */
 struct arm_vsmmu_s1_trans_cfg {
     paddr_t s1ctxptr;
@@ -82,6 +112,7 @@ struct virt_smmu {
     uint32_t    strtab_base_cfg;
     uint64_t    strtab_base;
     uint32_t    irq_ctrl;
+    uint32_t    virq;
     uint64_t    gerror_irq_cfg0;
     uint64_t    evtq_irq_cfg0;
     struct      arm_vsmmu_queue evtq, cmdq;
@@ -89,6 +120,12 @@ struct virt_smmu {
 };
 
 /* Queue manipulation functions */
+static bool queue_full(struct arm_vsmmu_queue *q)
+{
+    return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
+           Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
+}
+
 static bool queue_empty(struct arm_vsmmu_queue *q)
 {
     return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
@@ -101,11 +138,105 @@ static void queue_inc_cons(struct arm_vsmmu_queue *q)
     q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
 }
 
+static void queue_inc_prod(struct arm_vsmmu_queue *q)
+{
+    u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
+    q->prod = Q_OVF(q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
 static void dump_smmu_command(uint64_t *command)
 {
     gdprintk(XENLOG_ERR, "cmd 0x%02llx: %016lx %016lx\n",
              smmu_cmd_get_command(command[0]), command[0], command[1]);
 }
+
+static void arm_vsmmu_inject_irq(struct virt_smmu *smmu, bool is_gerror,
+                                uint32_t gerror_err)
+{
+    uint32_t new_gerrors, pending;
+
+    if ( is_gerror )
+    {
+        /* trigger global error irq to guest */
+        pending = smmu->gerror ^ smmu->gerrorn;
+        new_gerrors = ~pending & gerror_err;
+
+        /* only toggle non pending errors */
+        if (!new_gerrors)
+            return;
+
+        smmu->gerror ^= new_gerrors;
+    }
+
+    vgic_inject_irq(smmu->d, NULL, smmu->virq, true);
+}
+
+static int arm_vsmmu_write_evtq(struct virt_smmu *smmu, uint64_t *evt)
+{
+    struct arm_vsmmu_queue *q = &smmu->evtq;
+    struct domain *d = smmu->d;
+    paddr_t addr;
+    int ret;
+
+    if ( !smmu_get_evtq_enabled(smmu->cr[0]) )
+        return -EINVAL;
+
+    if ( queue_full(q) )
+        return -EINVAL;
+
+    addr = Q_PROD_ENT(q);
+    ret = access_guest_memory_by_gpa(d, addr, evt,
+                                     sizeof(*evt) * EVTQ_ENT_DWORDS, true);
+    if ( ret )
+        return ret;
+
+    queue_inc_prod(q);
+
+    /* trigger eventq irq to guest */
+    if ( !queue_empty(q) )
+        arm_vsmmu_inject_irq(smmu, false, 0);
+
+    return 0;
+}
+
+void arm_vsmmu_send_event(struct virt_smmu *smmu,
+                          struct arm_smmu_evtq_ent *ent)
+{
+    uint64_t evt[EVTQ_ENT_DWORDS];
+    int ret;
+
+    memset(evt, 0, 1 << EVTQ_ENT_SZ_SHIFT);
+
+    if ( !smmu_get_evtq_enabled(smmu->cr[0]) )
+        return;
+
+    evt[0] |= FIELD_PREP(EVTQ_0_ID, ent->opcode);
+    evt[0] |= FIELD_PREP(EVTQ_0_SID, ent->sid);
+
+    switch (ent->opcode)
+    {
+    case EVT_ID_BAD_STREAMID:
+    case EVT_ID_BAD_STE:
+        evt[0] |= FIELD_PREP(EVTQ_0_SSID, ent->c_bad_ste_streamid.ssid);
+        evt[0] |= FIELD_PREP(EVTQ_0_SSV, ent->c_bad_ste_streamid.ssv);
+        break;
+    case EVT_ID_TRANSLATION_FAULT:
+    case EVT_ID_ADDR_SIZE_FAULT:
+    case EVT_ID_ACCESS_FAULT:
+    case EVT_ID_PERMISSION_FAULT:
+        break;
+    default:
+        gdprintk(XENLOG_WARNING, "vSMMUv3: event opcode is bad\n");
+        break;
+    }
+
+    ret = arm_vsmmu_write_evtq(smmu, evt);
+    if ( ret )
+        arm_vsmmu_inject_irq(smmu, true, GERROR_EVTQ_ABT_ERR);
+
+    return;
+}
+
 static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
                               uint64_t *ste)
 {
@@ -114,11 +245,22 @@ static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
     uint32_t log2size;
     int strtab_size_shift;
     int ret;
+    struct arm_smmu_evtq_ent ent = {
+        .sid = sid,
+        .c_bad_ste_streamid = {
+            .ssid = 0,
+            .ssv = false,
+        },
+    };
 
     log2size = FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, smmu->strtab_base_cfg);
 
     if ( sid >= (1 << MIN(log2size, SMMU_IDR1_SIDSIZE)) )
+    {
+        ent.opcode = EVT_ID_BAD_STE;
+        arm_vsmmu_send_event(smmu, &ent);
         return -EINVAL;
+    }
 
     if ( smmu->features & STRTAB_BASE_CFG_FMT_2LVL )
     {
@@ -132,7 +274,7 @@ static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
         idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
         l1ptr = (paddr_t)(strtab_base + idx * sizeof(l1std));
 
-        ret = access_guest_memory_by_ipa(d, l1ptr, &l1std,
+        ret = access_guest_memory_by_gpa(d, l1ptr, &l1std,
                                          sizeof(l1std), false);
         if ( ret )
         {
@@ -156,6 +298,8 @@ static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
         {
             gdprintk(XENLOG_ERR, "idx=%d > max_l2_ste=%d\n",
                      idx, max_l2_ste);
+            ent.opcode = EVT_ID_BAD_STREAMID;
+            arm_vsmmu_send_event(smmu, &ent);
             return -EINVAL;
         }
         addr = l2ptr + idx * sizeof(*ste) * STRTAB_STE_DWORDS;
@@ -167,7 +311,7 @@ static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
                       ~GENMASK_ULL(strtab_size_shift, 0);
         addr = strtab_base + sid * sizeof(*ste) * STRTAB_STE_DWORDS;
     }
-    ret = access_guest_memory_by_ipa(d, addr, ste, sizeof(*ste), false);
+    ret = access_guest_memory_by_gpa(d, addr, ste, sizeof(*ste), false);
     if ( ret )
     {
         gdprintk(XENLOG_ERR,
@@ -183,6 +327,14 @@ static int arm_vsmmu_decode_ste(struct virt_smmu *smmu, uint32_t sid,
                                 uint64_t *ste)
 {
     uint64_t val = ste[0];
+    struct arm_smmu_evtq_ent ent = {
+        .opcode = EVT_ID_BAD_STE,
+        .sid = sid,
+        .c_bad_ste_streamid = {
+            .ssid = 0,
+            .ssv = false,
+        },
+    };
 
     if ( !(val & STRTAB_STE_0_V) )
         return -EAGAIN;
@@ -217,6 +369,7 @@ static int arm_vsmmu_decode_ste(struct virt_smmu *smmu, uint32_t sid,
     return 0;
 
 bad_ste:
+    arm_vsmmu_send_event(smmu, &ent);
     return -EINVAL;
 }
 
@@ -271,7 +424,7 @@ static int arm_vsmmu_handle_cmds(struct virt_smmu *smmu)
         int ret;
 
         addr = Q_CONS_ENT(q);
-        ret = access_guest_memory_by_ipa(d, addr, command,
+        ret = access_guest_memory_by_gpa(d, addr, command,
                                          sizeof(command), false);
         if ( ret )
             return ret;
@@ -577,7 +730,8 @@ static const struct mmio_handler_ops vsmmuv3_mmio_handler = {
     .write = vsmmuv3_mmio_write,
 };
 
-static int vsmmuv3_init_single(struct domain *d, paddr_t addr, paddr_t size)
+static int vsmmuv3_init_single(struct domain *d, paddr_t addr,
+                               paddr_t size, uint32_t virq)
 {
     struct virt_smmu *smmu;
 
@@ -586,6 +740,7 @@ static int vsmmuv3_init_single(struct domain *d, paddr_t addr, paddr_t size)
         return -ENOMEM;
 
     smmu->d = d;
+    smmu->virq = virq;
     smmu->cmdq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_CMDQS);
     smmu->cmdq.ent_size = CMDQ_ENT_DWORDS * DWORDS_BYTES;
     smmu->evtq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_EVTQS);
@@ -612,14 +767,16 @@ int domain_vsmmuv3_init(struct domain *d)
 
         list_for_each_entry(hw_iommu, &host_iommu_list, entry)
         {
-            ret = vsmmuv3_init_single(d, hw_iommu->addr, hw_iommu->size);
+            ret = vsmmuv3_init_single(d, hw_iommu->addr, hw_iommu->size,
+                                      hw_iommu->irq);
             if ( ret )
                 return ret;
         }
     }
     else
     {
-        ret = vsmmuv3_init_single(d, GUEST_VSMMUV3_BASE, GUEST_VSMMUV3_SIZE);
+        ret = vsmmuv3_init_single(d, GUEST_VSMMUV3_BASE, GUEST_VSMMUV3_SIZE,
+                                  GUEST_VSMMU_SPI);
         if ( ret )
             return ret;
     }
diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h
index d6201df06a..1b03222b77 100644
--- a/xen/include/public/arch-arm.h
+++ b/xen/include/public/arch-arm.h
@@ -521,9 +521,10 @@ typedef uint64_t xen_callback_t;
 #define GUEST_EVTCHN_PPI        31
 
 #define GUEST_VPL011_SPI        32
+#define GUEST_VSMMU_SPI         33
 
-#define GUEST_VIRTIO_MMIO_SPI_FIRST   33
-#define GUEST_VIRTIO_MMIO_SPI_LAST    43
+#define GUEST_VIRTIO_MMIO_SPI_FIRST   34
+#define GUEST_VIRTIO_MMIO_SPI_LAST    44
 
 /*
  * SGI is the preferred delivery mechanism of FF-A pending notifications or
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 13/20] xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (11 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 12/20] xen/arm: vsmmuv3: Add support for event queue and global error Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 14/20] xen/arm: vIOMMU: IOMMU device tree node for dom0 Milan Djokic
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

"iommus" property will be added for dom0 devices to virtual
IOMMU node to enable the dom0 linux kernel to configure the IOMMU

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/arch/arm/domain_build.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 50e4f5fee4..9d2b0ea928 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -537,9 +537,12 @@ static int __init write_properties(struct domain *d, struct kernel_info *kinfo,
             continue;
         }
 
-        if ( iommu_node )
+        /*
+         * Expose IOMMU specific properties to hwdom when vIOMMU is
+         * enabled.
+         */
+        if ( iommu_node && !is_viommu_enabled() )
         {
-            /* Don't expose IOMMU specific properties to hwdom */
             if ( dt_property_name_is_equal(prop, "iommus") )
                 continue;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 14/20] xen/arm: vIOMMU: IOMMU device tree node for dom0
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (12 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 13/20] xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 15/20] xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less Milan Djokic
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

XEN will create an IOMMU device tree node in the device tree
to enable the dom0 to discover the virtual SMMUv3 during dom0 boot.
IOMMU device tree node will only be created when cmdline option viommu
is enabled.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/arch/arm/domain_build.c       | 94 +++++++++++++++++++++++++++++++
 xen/arch/arm/include/asm/viommu.h |  1 +
 2 files changed, 95 insertions(+)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 9d2b0ea928..aa2b3bdfcc 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -1559,6 +1559,95 @@ int __init make_chosen_node(const struct kernel_info *kinfo)
     return res;
 }
 
+#ifdef CONFIG_VIRTUAL_IOMMU
+static int make_hwdom_viommu_node(const struct kernel_info *kinfo)
+{
+    uint32_t len;
+    int res;
+    char buf[24];
+    void *fdt = kinfo->fdt;
+    const void *prop = NULL;
+    const struct dt_device_node *iommu = NULL;
+    struct host_iommu *iommu_data;
+    gic_interrupt_t intr;
+
+    if ( list_empty(&host_iommu_list) )
+        return 0;
+
+    list_for_each_entry( iommu_data, &host_iommu_list, entry )
+    {
+        if ( iommu_data->hwdom_node_created )
+            return 0;
+
+        iommu = iommu_data->dt_node;
+
+        snprintf(buf, sizeof(buf), "iommu@%"PRIx64, iommu_data->addr);
+
+        res = fdt_begin_node(fdt, buf);
+        if ( res )
+            return res;
+
+        prop = dt_get_property(iommu, "compatible", &len);
+        if ( !prop )
+        {
+            res = -FDT_ERR_XEN(ENOENT);
+            return res;
+        }
+
+        res = fdt_property(fdt, "compatible", prop, len);
+        if ( res )
+            return res;
+
+        if ( iommu->phandle )
+        {
+            res = fdt_property_cell(fdt, "phandle", iommu->phandle);
+            if ( res )
+                return res;
+        }
+
+        /* Use the same reg regions as the IOMMU node in host DTB. */
+        prop = dt_get_property(iommu, "reg", &len);
+        if ( !prop )
+        {
+            printk(XENLOG_ERR "vIOMMU: Can't find IOMMU reg property.\n");
+            res = -FDT_ERR_XEN(ENOENT);
+            return res;
+        }
+
+        res = fdt_property(fdt, "reg", prop, len);
+        if ( res )
+            return res;
+
+        prop = dt_get_property(iommu, "#iommu-cells", &len);
+        if ( !prop )
+        {
+            res = -FDT_ERR_XEN(ENOENT);
+            return res;
+        }
+
+        res = fdt_property(fdt, "#iommu-cells", prop, len);
+        if ( res )
+            return res;
+
+        res = fdt_property_string(fdt, "interrupt-names", "combined");
+        if ( res )
+            return res;
+
+        set_interrupt(intr, iommu_data->irq, 0xf, DT_IRQ_TYPE_LEVEL_HIGH);
+
+        res = fdt_property_interrupts(kinfo, &intr, 1);
+        if ( res )
+            return res;
+
+        iommu_data->hwdom_node_created = true;
+
+        fdt_end_node(fdt);
+    }
+
+    return res;
+}
+#endif
+
 static int __init handle_node(struct domain *d, struct kernel_info *kinfo,
                               struct dt_device_node *node,
                               p2m_type_t p2mt)
@@ -1627,6 +1716,11 @@ static int __init handle_node(struct domain *d, struct kernel_info *kinfo,
     if ( dt_match_node(timer_matches, node) )
         return make_timer_node(kinfo);
 
+#ifdef CONFIG_VIRTUAL_IOMMU
+    if ( device_get_class(node) == DEVICE_IOMMU && is_viommu_enabled() )
+        return make_hwdom_viommu_node(kinfo);
+#endif
+
     /* Skip nodes used by Xen */
     if ( dt_device_used_by(node) == DOMID_XEN )
     {
diff --git a/xen/arch/arm/include/asm/viommu.h b/xen/arch/arm/include/asm/viommu.h
index 4de4cceeda..e6018f435b 100644
--- a/xen/arch/arm/include/asm/viommu.h
+++ b/xen/arch/arm/include/asm/viommu.h
@@ -19,6 +19,7 @@ struct host_iommu {
     paddr_t addr;
     paddr_t size;
     uint32_t irq;
+    bool hwdom_node_created;
 };
 
 struct viommu_ops {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 15/20] xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (13 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 14/20] xen/arm: vIOMMU: IOMMU device tree node for dom0 Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 16/20] arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl Milan Djokic
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Andrew Cooper, Anthony PERARD,
	Jan Beulich, Roger Pau Monné, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

XEN will create an Emulated SMMUv3 device tree node in the device tree
to enable the dom0less domains to discover the virtual SMMUv3 during
boot. Emulated SMMUv3 device tree node will only be created when cmdline
option vsmmuv3 is enabled.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/arch/arm/dom0less-build.c         | 53 +++++++++++++++++++++++++++
 xen/include/public/device_tree_defs.h |  1 +
 2 files changed, 54 insertions(+)

diff --git a/xen/arch/arm/dom0less-build.c b/xen/arch/arm/dom0less-build.c
index 1fa6df214b..1e526c10de 100644
--- a/xen/arch/arm/dom0less-build.c
+++ b/xen/arch/arm/dom0less-build.c
@@ -217,10 +217,63 @@ static int __init make_vpl011_uart_node(struct kernel_info *kinfo)
 }
 #endif
 
+#ifdef CONFIG_VIRTUAL_ARM_SMMU_V3
+static int __init make_vsmmuv3_node(const struct kernel_info *kinfo)
+{
+    int res;
+    char buf[24];
+    __be32 reg[GUEST_ROOT_ADDRESS_CELLS + GUEST_ROOT_SIZE_CELLS];
+    __be32 *cells;
+    void *fdt = kinfo->fdt;
+
+    snprintf(buf, sizeof(buf), "iommu@%llx", GUEST_VSMMUV3_BASE);
+
+    res = fdt_begin_node(fdt, buf);
+    if ( res )
+        return res;
+
+    res = fdt_property_string(fdt, "compatible", "arm,smmu-v3");
+    if ( res )
+        return res;
+
+    /* Create reg property */
+    cells = &reg[0];
+    dt_child_set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
+                       GUEST_VSMMUV3_BASE, GUEST_VSMMUV3_SIZE);
+    res = fdt_property(fdt, "reg", reg,
+                       (GUEST_ROOT_ADDRESS_CELLS +
+                       GUEST_ROOT_SIZE_CELLS) * sizeof(*reg));
+    if ( res )
+        return res;
+
+    res = fdt_property_cell(fdt, "phandle", GUEST_PHANDLE_VSMMUV3);
+    if ( res )
+        return res;
+
+    res = fdt_property_cell(fdt, "#iommu-cells", 1);
+    if ( res )
+        return res;
+
+    res = fdt_end_node(fdt);
+
+    return res;
+}
+#endif
+
 int __init make_arch_nodes(struct kernel_info *kinfo)
 {
     int ret;
 
+
+#ifdef CONFIG_VIRTUAL_ARM_SMMU_V3
+    if ( is_viommu_enabled() )
+    {
+        ret = make_vsmmuv3_node(kinfo);
+        if ( ret )
+            return -EINVAL;
+    }
+#endif
+
     ret = make_psci_node(kinfo->fdt);
     if ( ret )
         return -EINVAL;
diff --git a/xen/include/public/device_tree_defs.h b/xen/include/public/device_tree_defs.h
index 9e80d0499d..7846a0425c 100644
--- a/xen/include/public/device_tree_defs.h
+++ b/xen/include/public/device_tree_defs.h
@@ -14,6 +14,7 @@
  */
 #define GUEST_PHANDLE_GIC (65000)
 #define GUEST_PHANDLE_IOMMU (GUEST_PHANDLE_GIC + 1)
+#define GUEST_PHANDLE_VSMMUV3 (GUEST_PHANDLE_IOMMU + 1)
 
 #define GUEST_ROOT_ADDRESS_CELLS 2
 #define GUEST_ROOT_SIZE_CELLS 2
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 16/20] arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (14 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 15/20] xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 17/20] xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3 Milan Djokic
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Anthony PERARD, Juergen Gross, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

libxl will create an Emulated SMMUv3 device tree node in the device
tree to enable the guest OS to discover the virtual SMMUv3 during guest
boot.

Emulated SMMUv3 device tree node will only be created when
"viommu=smmuv3" is set in xl domain configuration.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 tools/libs/light/libxl_arm.c | 37 ++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index 44b7bcd672..c8255d8d4f 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -887,6 +887,36 @@ static int make_vpl011_uart_node(libxl__gc *gc, void *fdt,
     return 0;
 }
 
+static int make_vsmmuv3_node(libxl__gc *gc, void *fdt,
+                             const struct arch_info *ainfo,
+                             struct xc_dom_image *dom)
+{
+    int res;
+    const char *name = GCSPRINTF("iommu@%llx", GUEST_VSMMUV3_BASE);
+
+    res = fdt_begin_node(fdt, name);
+    if (res) return res;
+
+    res = fdt_property_compat(gc, fdt, 1, "arm,smmu-v3");
+    if (res) return res;
+
+    res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS,
+                            GUEST_ROOT_SIZE_CELLS, 1, GUEST_VSMMUV3_BASE,
+                            GUEST_VSMMUV3_SIZE);
+    if (res) return res;
+
+    res = fdt_property_cell(fdt, "phandle", GUEST_PHANDLE_VSMMUV3);
+    if (res) return res;
+
+    res = fdt_property_cell(fdt, "#iommu-cells", 1);
+    if (res) return res;
+
+    res = fdt_end_node(fdt);
+    if (res) return res;
+
+    return 0;
+}
+
 static int make_vpci_node(libxl__gc *gc, void *fdt,
                           const struct arch_info *ainfo,
                           struct xc_dom_image *dom)
@@ -928,6 +958,10 @@ static int make_vpci_node(libxl__gc *gc, void *fdt,
         GUEST_VPCI_PREFETCH_MEM_SIZE);
     if (res) return res;
 
+    res = fdt_property_values(gc, fdt, "iommu-map", 4, 0,
+                              GUEST_PHANDLE_VSMMUV3, 0, 0x10000);
+    if (res) return res;
+
     res = fdt_end_node(fdt);
     if (res) return res;
 
@@ -1394,6 +1428,9 @@ next_resize:
         if (d_config->num_pcidevs)
             FDT( make_vpci_node(gc, fdt, ainfo, dom) );
 
+        if (info->arch_arm.viommu_type == LIBXL_VIOMMU_TYPE_SMMUV3)
+            FDT( make_vsmmuv3_node(gc, fdt, ainfo, dom) );
+
         for (i = 0; i < d_config->num_disks; i++) {
             libxl_device_disk *disk = &d_config->disks[i];
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 17/20] xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (15 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 16/20] arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 18/20] xen/arm: vsmmuv3: Add support to send stage-1 event to guest Milan Djokic
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Anthony PERARD, Juergen Gross, Stefano Stabellini,
	Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk,
	Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Alloc and reserve virq for event queue and global error to send event to
guests. Also Modify the libxl to accomadate the new define virq.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 tools/libs/light/libxl_arm.c           | 28 ++++++++++++++++++++++++--
 xen/arch/arm/dom0less-build.c          | 17 ++++++++++++++++
 xen/drivers/passthrough/arm/vsmmu-v3.c | 13 ++++++++++++
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index c8255d8d4f..55beda8c0e 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -86,8 +86,8 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
 {
     uint32_t nr_spis = 0, cfg_nr_spis = d_config->b_info.arch_arm.nr_spis;
     unsigned int i;
-    uint32_t vuart_irq, virtio_irq = 0;
-    bool vuart_enabled = false, virtio_enabled = false;
+    uint32_t vuart_irq, virtio_irq = 0, vsmmu_irq = 0;
+    bool vuart_enabled = false, virtio_enabled = false, vsmmu_enabled = false;
     uint64_t virtio_mmio_base = GUEST_VIRTIO_MMIO_BASE;
     uint32_t virtio_mmio_irq = GUEST_VIRTIO_MMIO_SPI_FIRST;
     int rc;
@@ -102,6 +102,16 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
         vuart_enabled = true;
     }
 
+    /*
+     * If smmuv3 viommu is enabled then increment the nr_spis to allow allocation
+     * of SPI VIRQ for VSMMU.
+     */
+    if (d_config->b_info.arch_arm.viommu_type == LIBXL_VIOMMU_TYPE_SMMUV3) {
+        nr_spis += (GUEST_VSMMU_SPI - 32) + 1;
+        vsmmu_irq = GUEST_VSMMU_SPI;
+        vsmmu_enabled = true;
+    }
+
     for (i = 0; i < d_config->num_disks; i++) {
         libxl_device_disk *disk = &d_config->disks[i];
 
@@ -170,6 +180,11 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
             return ERROR_FAIL;
         }
 
+        if (vsmmu_enabled && irq == vsmmu_irq) {
+            LOG(ERROR, "Physical IRQ %u conflicting with vSMMUv3 SPI\n", irq);
+            return ERROR_FAIL;
+        }
+
         if (irq < 32)
             continue;
 
@@ -893,6 +908,7 @@ static int make_vsmmuv3_node(libxl__gc *gc, void *fdt,
 {
     int res;
     const char *name = GCSPRINTF("iommu@%llx", GUEST_VSMMUV3_BASE);
+    gic_interrupt intr;
 
     res = fdt_begin_node(fdt, name);
     if (res) return res;
@@ -911,6 +927,14 @@ static int make_vsmmuv3_node(libxl__gc *gc, void *fdt,
     res = fdt_property_cell(fdt, "#iommu-cells", 1);
     if (res) return res;
 
+    res = fdt_property_string(fdt, "interrupt-names", "combined");
+    if (res) return res;
+
+    set_interrupt(intr, GUEST_VSMMU_SPI, 0xf, DT_IRQ_TYPE_LEVEL_HIGH);
+
+    res = fdt_property_interrupts(gc, fdt, &intr, 1);
+    if (res) return res;
+
     res = fdt_end_node(fdt);
     if (res) return res;
 
diff --git a/xen/arch/arm/dom0less-build.c b/xen/arch/arm/dom0less-build.c
index 1e526c10de..66ef3dfa1f 100644
--- a/xen/arch/arm/dom0less-build.c
+++ b/xen/arch/arm/dom0less-build.c
@@ -224,6 +224,7 @@ static int __init make_vsmmuv3_node(const struct kernel_info *kinfo)
     char buf[24];
     __be32 reg[GUEST_ROOT_ADDRESS_CELLS + GUEST_ROOT_SIZE_CELLS];
     __be32 *cells;
+    gic_interrupt_t intr;
     void *fdt = kinfo->fdt;
 
     snprintf(buf, sizeof(buf), "iommu@%llx", GUEST_VSMMUV3_BASE);
@@ -254,6 +255,22 @@ static int __init make_vsmmuv3_node(const struct kernel_info *kinfo)
     if ( res )
         return res;
 
+    res = fdt_property_string(fdt, "interrupt-names", "combined");
+    if ( res )
+        return res;
+
+    set_interrupt(intr, GUEST_VSMMU_SPI, 0xf, DT_IRQ_TYPE_LEVEL_HIGH);
+
+    res = fdt_property(kinfo->fdt, "interrupts",
+                       intr, sizeof(intr));
+    if ( res )
+        return res;
+
+    res = fdt_property_cell(kinfo->fdt, "interrupt-parent",
+                            kinfo->phandle_intc);
+    if ( res )
+        return res;
+
     res = fdt_end_node(fdt);
 
     return res;
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index 7a6c18df53..a5b9700369 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -733,6 +733,7 @@ static const struct mmio_handler_ops vsmmuv3_mmio_handler = {
 static int vsmmuv3_init_single(struct domain *d, paddr_t addr,
                                paddr_t size, uint32_t virq)
 {
+    int ret;
     struct virt_smmu *smmu;
 
     smmu = xzalloc(struct virt_smmu);
@@ -748,12 +749,24 @@ static int vsmmuv3_init_single(struct domain *d, paddr_t addr,
 
     spin_lock_init(&smmu->cmd_queue_lock);
 
+    ret = vgic_reserve_virq(d, virq);
+    if ( !ret )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+
     register_mmio_handler(d, &vsmmuv3_mmio_handler, addr, size, smmu);
 
     /* Register the vIOMMU to be able to clean it up later. */
     list_add_tail(&smmu->viommu_list, &d->arch.viommu_list);
 
     return 0;
+
+out:
+    xfree(smmu);
+    vgic_free_virq(d, virq);
+    return ret;
 }
 
 int domain_vsmmuv3_init(struct domain *d)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 18/20] xen/arm: vsmmuv3: Add support to send stage-1 event to guest
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (16 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 17/20] xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3 Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 19/20] libxl/arm: vIOMMU: Modify the partial device tree for iommus Milan Djokic
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Bertrand Marquis, Stefano Stabellini, Julien Grall,
	Michal Orzel, Volodymyr Babchuk, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

Stage-1 translation is handled by guest, therefore stage-1 fault has to
be forwarded to guest.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/drivers/passthrough/arm/smmu-v3.c  | 48 ++++++++++++++++++++++++--
 xen/drivers/passthrough/arm/vsmmu-v3.c | 45 ++++++++++++++++++++++++
 xen/drivers/passthrough/arm/vsmmu-v3.h | 12 +++++++
 3 files changed, 103 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c
index 91bf72d420..4ea3e90c11 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -853,7 +853,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
-__maybe_unused
 static struct arm_smmu_master *
 arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 {
@@ -874,10 +873,51 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 	return NULL;
 }
 
+static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
+{
+	int ret;
+	struct arm_smmu_master *master;
+	u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]);
+
+	switch (FIELD_GET(EVTQ_0_ID, evt[0])) {
+	case EVT_ID_TRANSLATION_FAULT:
+		break;
+	case EVT_ID_ADDR_SIZE_FAULT:
+		break;
+	case EVT_ID_ACCESS_FAULT:
+		break;
+	case EVT_ID_PERMISSION_FAULT:
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	/* Stage-2 event */
+	if (evt[1] & EVTQ_1_S2)
+		return -EFAULT;
+
+	mutex_lock(&smmu->streams_mutex);
+	master = arm_smmu_find_master(smmu, sid);
+	if (!master) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = arm_vsmmu_handle_evt(master->domain->d, smmu->dev, evt);
+	if (ret) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+out_unlock:
+	mutex_unlock(&smmu->streams_mutex);
+	return ret;
+}
+
 /* IRQ and event handlers */
 static void arm_smmu_evtq_tasklet(void *dev)
 {
-	int i;
+	int i, ret;
 	struct arm_smmu_device *smmu = dev;
 	struct arm_smmu_queue *q = &smmu->evtq.q;
 	struct arm_smmu_ll_queue *llq = &q->llq;
@@ -887,6 +927,10 @@ static void arm_smmu_evtq_tasklet(void *dev)
 		while (!queue_remove_raw(q, evt)) {
 			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
 
+			ret = arm_smmu_handle_evt(smmu, evt);
+			if (!ret)
+				continue;
+
 			dev_info(smmu->dev, "event 0x%02x received:\n", id);
 			for (i = 0; i < ARRAY_SIZE(evt); ++i)
 				dev_info(smmu->dev, "\t0x%016llx\n",
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.c b/xen/drivers/passthrough/arm/vsmmu-v3.c
index a5b9700369..5d0dabd2b2 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.c
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.c
@@ -103,6 +103,7 @@ struct arm_vsmmu_queue {
 struct virt_smmu {
     struct      domain *d;
     struct      list_head viommu_list;
+    paddr_t     addr;
     uint8_t     sid_split;
     uint32_t    features;
     uint32_t    cr[3];
@@ -237,6 +238,49 @@ void arm_vsmmu_send_event(struct virt_smmu *smmu,
     return;
 }
 
+static struct virt_smmu *vsmmuv3_find_by_addr(struct domain *d, paddr_t paddr)
+{
+    struct virt_smmu *smmu;
+
+    list_for_each_entry( smmu, &d->arch.viommu_list, viommu_list )
+    {
+        if ( smmu->addr == paddr )
+            return smmu;
+    }
+
+    return NULL;
+}
+
+int arm_vsmmu_handle_evt(struct domain *d, struct device *dev, uint64_t *evt)
+{
+    int ret;
+    struct virt_smmu *smmu;
+
+    if ( is_hardware_domain(d) )
+    {
+        paddr_t paddr;
+        /* Base address */
+        ret = dt_device_get_address(dev_to_dt(dev), 0, &paddr, NULL);
+        if ( ret )
+            return -EINVAL;
+
+        smmu = vsmmuv3_find_by_addr(d, paddr);
+        if ( !smmu )
+            return -ENODEV;
+    }
+    else
+    {
+        smmu = list_entry(d->arch.viommu_list.next,
+                          struct virt_smmu, viommu_list);
+    }
+
+    ret = arm_vsmmu_write_evtq(smmu, evt);
+    if ( ret )
+        arm_vsmmu_inject_irq(smmu, true, GERROR_EVTQ_ABT_ERR);
+
+    return 0;
+}
+
 static int arm_vsmmu_find_ste(struct virt_smmu *smmu, uint32_t sid,
                               uint64_t *ste)
 {
@@ -742,6 +786,7 @@ static int vsmmuv3_init_single(struct domain *d, paddr_t addr,
 
     smmu->d = d;
     smmu->virq = virq;
+    smmu->addr = addr;
     smmu->cmdq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_CMDQS);
     smmu->cmdq.ent_size = CMDQ_ENT_DWORDS * DWORDS_BYTES;
     smmu->evtq.q_base = FIELD_PREP(Q_BASE_LOG2SIZE, SMMU_EVTQS);
diff --git a/xen/drivers/passthrough/arm/vsmmu-v3.h b/xen/drivers/passthrough/arm/vsmmu-v3.h
index e11f85b431..c7bfd3fb59 100644
--- a/xen/drivers/passthrough/arm/vsmmu-v3.h
+++ b/xen/drivers/passthrough/arm/vsmmu-v3.h
@@ -8,6 +8,12 @@
 
 void vsmmuv3_set_type(void);
 
+static inline int arm_vsmmu_handle_evt(struct domain *d,
+                                       struct device *dev, uint64_t *evt)
+{
+    return -EINVAL;
+}
+
 #else
 
 static inline void vsmmuv3_set_type(void)
@@ -15,6 +21,12 @@ static inline void vsmmuv3_set_type(void)
     return;
 }
 
+static inline int arm_vsmmu_handle_evt(struct domain *d,
+                                       struct device *dev, uint64_t *evt)
+{
+    return -EINVAL;
+}
+
 #endif /* CONFIG_VIRTUAL_ARM_SMMU_V3 */
 
 #endif /* __ARCH_ARM_VSMMU_V3_H__ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 19/20] libxl/arm: vIOMMU: Modify the partial device tree for iommus
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (17 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 18/20] xen/arm: vsmmuv3: Add support to send stage-1 event to guest Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 16:59 ` [PATCH 20/20] xen/arm: vIOMMU: Modify the partial device tree for dom0less Milan Djokic
  2025-08-07 17:58 ` [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Julien Grall
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Anthony PERARD, Juergen Gross, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

To configure IOMMU in guest for passthrough devices, user will need to
copy the unmodified "iommus" property from host device tree to partial
device tree. To enable the dom0 linux kernel to confiure the IOMMU
correctly replace the phandle in partial device tree with virtual
IOMMU phandle when "iommus" property is set.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Singed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 tools/libs/light/libxl_arm.c | 47 +++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index 55beda8c0e..d99f49b831 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -1312,6 +1312,41 @@ static int copy_partial_fdt(libxl__gc *gc, void *fdt, void *pfdt)
     return 0;
 }
 
+static int modify_partial_fdt(libxl__gc *gc, void *pfdt)
+{
+    int nodeoff, proplen, i, r;
+    const fdt32_t *prop;
+    fdt32_t *prop_c;
+
+    nodeoff = fdt_path_offset(pfdt, "/passthrough");
+    if (nodeoff < 0)
+        return nodeoff;
+
+    for (nodeoff = fdt_first_subnode(pfdt, nodeoff);
+         nodeoff >= 0;
+         nodeoff = fdt_next_subnode(pfdt, nodeoff)) {
+
+        prop = fdt_getprop(pfdt, nodeoff, "iommus", &proplen);
+        if (!prop)
+            continue;
+
+        prop_c = libxl__zalloc(gc, proplen);
+
+        for (i = 0; i < proplen / 8; ++i) {
+            prop_c[i * 2] = cpu_to_fdt32(GUEST_PHANDLE_VSMMUV3);
+            prop_c[i * 2 + 1] = prop[i * 2 + 1];
+        }
+
+        r = fdt_setprop(pfdt, nodeoff, "iommus", prop_c, proplen);
+        if (r) {
+            LOG(ERROR, "Can't set the iommus property in partial FDT");
+            return r;
+        }
+    }
+
+    return 0;
+}
+
 #else
 
 static int check_partial_fdt(libxl__gc *gc, void *fdt, size_t size)
@@ -1330,6 +1365,13 @@ static int copy_partial_fdt(libxl__gc *gc, void *fdt, void *pfdt)
     return -FDT_ERR_INTERNAL;
 }
 
+static int modify_partial_fdt(libxl__gc *gc, void *pfdt)
+{
+    LOG(ERROR, "partial device tree not supported");
+
+    return ERROR_FAIL;
+}
+
 #endif /* ENABLE_PARTIAL_DEVICE_TREE */
 
 #define FDT_MAX_SIZE (1<<20)
@@ -1452,8 +1494,11 @@ next_resize:
         if (d_config->num_pcidevs)
             FDT( make_vpci_node(gc, fdt, ainfo, dom) );
 
-        if (info->arch_arm.viommu_type == LIBXL_VIOMMU_TYPE_SMMUV3)
+        if (info->arch_arm.viommu_type == LIBXL_VIOMMU_TYPE_SMMUV3) {
             FDT( make_vsmmuv3_node(gc, fdt, ainfo, dom) );
+            if (pfdt)
+                FDT( modify_partial_fdt(gc, pfdt) );
+        }
 
         for (i = 0; i < d_config->num_disks; i++) {
             libxl_device_disk *disk = &d_config->disks[i];
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 20/20] xen/arm: vIOMMU: Modify the partial device tree for dom0less
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (18 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 19/20] libxl/arm: vIOMMU: Modify the partial device tree for iommus Milan Djokic
@ 2025-08-07 16:59 ` Milan Djokic
  2025-08-07 17:58 ` [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Julien Grall
  20 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2025-08-07 16:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Rahul Singh, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Milan Djokic

From: Rahul Singh <rahul.singh@arm.com>

To configure IOMMU in guest for passthrough devices, user will need to
copy the unmodified "iommus" property from host device tree to partial
device tree. To enable the dom0 linux kernel to confiure the IOMMU
correctly replace the phandle in partial device tree with virtual
IOMMU phandle when "iommus" property is set.

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Milan Djokic <milan_djokic@epam.com>
---
 xen/common/device-tree/dom0less-build.c | 31 ++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/xen/common/device-tree/dom0less-build.c b/xen/common/device-tree/dom0less-build.c
index 6bb038111d..3cfffa6581 100644
--- a/xen/common/device-tree/dom0less-build.c
+++ b/xen/common/device-tree/dom0less-build.c
@@ -314,7 +314,35 @@ static int __init handle_prop_pfdt(struct kernel_info *kinfo,
     return ( propoff != -FDT_ERR_NOTFOUND ) ? propoff : 0;
 }
 
-static int __init scan_pfdt_node(struct kernel_info *kinfo, const void *pfdt,
+static void modify_pfdt_node(void *pfdt, int nodeoff)
+{
+    int proplen, i, rc;
+    const fdt32_t *prop;
+    fdt32_t *prop_c;
+
+    prop = fdt_getprop(pfdt, nodeoff, "iommus", &proplen);
+    if ( !prop )
+        return;
+
+    prop_c = xzalloc_bytes(proplen);
+
+    for ( i = 0; i < proplen / 8; ++i )
+    {
+        prop_c[i * 2] = cpu_to_fdt32(GUEST_PHANDLE_VSMMUV3);
+        prop_c[i * 2 + 1] = prop[i * 2 + 1];
+    }
+
+    rc = fdt_setprop(pfdt, nodeoff, "iommus", prop_c, proplen);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR, "Can't set the iommus property in partial FDT");
+        return;
+    }
+
+    return;
+}
+
+static int __init scan_pfdt_node(struct kernel_info *kinfo, void *pfdt,
                                  int nodeoff,
                                  uint32_t address_cells, uint32_t size_cells,
                                  bool scan_passthrough_prop)
@@ -340,6 +368,7 @@ static int __init scan_pfdt_node(struct kernel_info *kinfo, const void *pfdt,
     node_next = fdt_first_subnode(pfdt, nodeoff);
     while ( node_next > 0 )
     {
+        modify_pfdt_node(pfdt, node_next);
         rc = scan_pfdt_node(kinfo, pfdt, node_next, address_cells, size_cells,
                             scan_passthrough_prop);
         if ( rc )
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
                   ` (19 preceding siblings ...)
  2025-08-07 16:59 ` [PATCH 20/20] xen/arm: vIOMMU: Modify the partial device tree for dom0less Milan Djokic
@ 2025-08-07 17:58 ` Julien Grall
  2025-08-13 10:04   ` Milan Djokic
  20 siblings, 1 reply; 40+ messages in thread
From: Julien Grall @ 2025-08-07 17:58 UTC (permalink / raw)
  To: Milan Djokic
  Cc: xen-devel@lists.xenproject.org, Bertrand Marquis, Rahul Singh,
	Stefano Stabellini, Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 4225 bytes --]

Hi Milan,

On Thu, 7 Aug 2025 at 17:55, Milan Djokic <milan_djokic@epam.com> wrote:

> This patch series represents a rebase of an older patch series implemented
> and
> sumbitted by Rahul Singh as an RFC:
> https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/
> .
> Original patch series content is aligned with the latest xen structure in
> terms of common/arch-specific code structuring.
> Some minor bugfixes are also applied:
> - Sanity checks / error handling
> - Non-pci devices support for emulated iommu


>
> Overall description of stage-1 support is available in the original
> patch series cover letter. Original commits structure with detailed
> explanation for each commit
> functionality is maintained.


I am a bit surprised not much has changed. Last time we asked a document to
explain the overall design of the vSMMU including some details on the
security posture. I can’t remember if this was ever posted.

If not, then you need to start with that. Otherwise, if is going to be
pretty difficult to review this series.

Cheers,


>
> Patch series testing is performed in qemu arm environment. Additionally,
> stage-1 translation for non-pci devices is verified on a Renesas platform.
>
> Jean-Philippe Brucker (1):
>   xen/arm: smmuv3: Maintain a SID->device structure
>
> Rahul Singh (19):
>   xen/arm: smmuv3: Add support for stage-1 and nested stage translation
>   xen/arm: smmuv3: Alloc io_domain for each device
>   xen/arm: vIOMMU: add generic vIOMMU framework
>   xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests
>   xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param
>   xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
>   xen/arm: vsmmuv3: Add support for registers emulation
>   xen/arm: vsmmuv3: Add support for cmdqueue handling
>   xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE
>   xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware
>   xen/arm: vsmmuv3: Add support for event queue and global error
>   xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices
>   xen/arm: vIOMMU: IOMMU device tree node for dom0
>   xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less
>   arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl
>   xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3
>   xen/arm: vsmmuv3: Add support to send stage-1 event to guest
>   libxl/arm: vIOMMU: Modify the partial device tree for iommus
>   xen/arm: vIOMMU: Modify the partial device tree for dom0less
>
>  docs/man/xl.cfg.5.pod.in                |  13 +
>  docs/misc/xen-command-line.pandoc       |   7 +
>  tools/golang/xenlight/helpers.gen.go    |   2 +
>  tools/golang/xenlight/types.gen.go      |   1 +
>  tools/include/libxl.h                   |   5 +
>  tools/libs/light/libxl_arm.c            | 123 +++-
>  tools/libs/light/libxl_types.idl        |   6 +
>  tools/xl/xl_parse.c                     |  10 +
>  xen/arch/arm/dom0less-build.c           |  72 ++
>  xen/arch/arm/domain.c                   |  26 +
>  xen/arch/arm/domain_build.c             | 103 ++-
>  xen/arch/arm/include/asm/domain.h       |   4 +
>  xen/arch/arm/include/asm/viommu.h       | 102 +++
>  xen/common/device-tree/dom0less-build.c |  31 +-
>  xen/drivers/passthrough/Kconfig         |  14 +
>  xen/drivers/passthrough/arm/Makefile    |   2 +
>  xen/drivers/passthrough/arm/smmu-v3.c   | 369 +++++++++-
>  xen/drivers/passthrough/arm/smmu-v3.h   |  49 +-
>  xen/drivers/passthrough/arm/viommu.c    |  87 +++
>  xen/drivers/passthrough/arm/vsmmu-v3.c  | 895 ++++++++++++++++++++++++
>  xen/drivers/passthrough/arm/vsmmu-v3.h  |  32 +
>  xen/include/public/arch-arm.h           |  14 +-
>  xen/include/public/device_tree_defs.h   |   1 +
>  xen/include/xen/iommu.h                 |  14 +
>  24 files changed, 1935 insertions(+), 47 deletions(-)
>  create mode 100644 xen/arch/arm/include/asm/viommu.h
>  create mode 100644 xen/drivers/passthrough/arm/viommu.c
>  create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c
>  create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h
>
> --
> 2.43.0
>

[-- Attachment #2: Type: text/html, Size: 5786 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-07 17:58 ` [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Julien Grall
@ 2025-08-13 10:04   ` Milan Djokic
  2025-08-13 12:11     ` Julien Grall
  0 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-08-13 10:04 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel@lists.xenproject.org, Bertrand Marquis, Rahul Singh,
	Stefano Stabellini, Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

On 8/7/25 19:58, Julien Grall wrote:
> Hi Milan,
> 
> On Thu, 7 Aug 2025 at 17:55, Milan Djokic <milan_djokic@epam.com 
> <mailto:milan_djokic@epam.com>> wrote:
> 
>     This patch series represents a rebase of an older patch series
>     implemented and
>     sumbitted by Rahul Singh as an RFC: https://patchwork.kernel.org/
>     project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/
>     <https://eur01.safelinks.protection.outlook.com/?
>     url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fxen-
>     devel%2Fcover%2Fcover.1669888522.git.rahul.singh%40arm.com%2F&data=05%7C02%7Cmilan_djokic%40epam.com%7C03265dfcc1a94a11e83f08ddd5dc0edc%7Cb41b72d04e9f4c268a69f949f367c91d%7C1%7C0%7C638901863296475715%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=bdsPyXoIqvzwWIWk0Ot3BDOu8yAaF%2Bq3Vrs4wsmZJEA%3D&reserved=0>.
>     Original patch series content is aligned with the latest xen
>     structure in terms of common/arch-specific code structuring.
>     Some minor bugfixes are also applied:
>     - Sanity checks / error handling
>     - Non-pci devices support for emulated iommu
> 
> 
> 
>     Overall description of stage-1 support is available in the original
>     patch series cover letter. Original commits structure with detailed
>     explanation for each commit
>     functionality is maintained.
> 
> 
> I am a bit surprised not much has changed. Last time we asked a document 
> to explain the overall design of the vSMMU including some details on the 
> security posture. I can’t remember if this was ever posted.
> 
> If not, then you need to start with that. Otherwise, if is going to be 
> pretty difficult to review this series.
> 
> Cheers,
Hello Julien,

We have prepared a design document and it will be part of the updated 
patch series (added in docs/design). I'll also extend cover letter with 
details on implementation structure to make review easier.
Following is the design document content which will be provided in 
updated patch series:

Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

Author: Milan Djokic <milan_djokic@epam.com>
Date:   2025-08-07
Status: Draft

Introduction
------------

The SMMUv3 supports two stages of translation. Each stage of translation
can be independently enabled. An incoming address is logically
translated from VA to IPA in stage 1, then the IPA is input to stage 2
which translates the IPA to the output PA. Stage 1 translation support
is required to provide isolation between different devices within the OS.

Xen already supports Stage 2 translation but there is no support for
Stage 1 translation. This design proposal outlines the introduction of
Stage-1 SMMUv3 support in Xen for ARM guests.

Motivation
----------

ARM systems utilizing SMMUv3 require Stage-1 address translation to
ensure correct and secure DMA behavior inside guests.

This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation

Design Overview
---------------

These changes provide emulated SMMUv3 support:

- SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
   SMMUv3 driver
- vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
- Register/Command Emulation: SMMUv3 register emulation and command
   queue handling
- Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
   device trees for dom0 and dom0less scenarios
- Runtime Configuration: introduces a 'viommu' boot parameter for
   dynamic enablement

Security Considerations
------------------------

viommu security benefits:
- Stage-1 translation ensures guest devices cannot perform unauthorized
   DMA
- Emulated SMMUv3 for domains removes dependency on host hardware while
   maintaining isolation

Observations and Potential Risks
--------------------------------

1. Observation:
Support for Stage-1 translation introduces new data structures
(s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
in the Stream Table Entry (STE), including an abort field for partial
config states.

Risk:
A partially applied Stage-1 configuration might leave guest DMA
mappings in an inconsistent state, enabling unauthorized access or
cross-domain interference.

Mitigation (Handled by design):
Both s1_cfg and s2_cfg are written atomically. The abort field ensures
Stage-1 config is only used when fully applied. Incomplete configs are
ignored by the hypervisor.

2. Observation:
Guests can now issue Stage-1 cache invalidations.

Risk:
Failure to propagate invalidations could leave stale mappings, enabling
data leakage or misrouting.

Mitigation (Handled by design):
Guest invalidations are forwarded to the hardware to ensure IOMMU
coherency.

3. Observation:
The feature introduces large functional changes including the vIOMMU
framework, vsmmuv3 devices, command queues, event queues, domain
handling, and Device Tree modifications.

Risk:
Increased attack surface with risk of race conditions, malformed
commands, or misconfiguration via the device tree.

Mitigation:
- Improved sanity checks and error handling
- Feature is marked as Tech Preview and self-contained to reduce risk
   to unrelated code

4. Observation:
The implementation supports nested and standard translation modes,
using guest command queues (e.g. CMD_CFGI_STE) and events.

Risk:
Malicious commands could bypass validation and corrupt SMMUv3 state or
destabilize dom0.

Mitigation (Handled by design):
Command queues are validated, and only permitted configuration changes
are accepted. Handled in vsmmuv3 and cmdqueue logic.

5. Observation:
Device Tree changes inject iommus and vsmmuv3 nodes via libxl.

Risk:
Malicious or incorrect DT fragments could result in wrong device
assignments or hardware access.

Mitigation:
Only vetted and sanitized DT fragments are allowed. libxl limits what
guests can inject.

6. Observation:
The feature is enabled per-guest via viommu setting.

Risk:
Guests without viommu may behave differently, potentially causing
confusion, privilege drift, or accidental exposure.

Mitigation:
Ensure downgrade paths are safe. Perform isolation audits in
multi-guest environments to ensure correct behavior.

Performance Impact
------------------

Hardware-managed translations are expected to have minimal overhead.
Emulated vIOMMU may introduce some latency during initialization or
event processing.

Testing
-------

- QEMU-based testing for Stage-1 and nested translation
- Hardware testing on Renesas SMMUv3-enabled ARM systems
- Unit tests for translation accuracy (not yet implemented)

Migration and Compatibility
---------------------------

This feature is optional and disabled by default (viommu="") to ensure
backward compatibility.

References
----------

- Original implementation by Rahul Singh:

https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/
- ARM SMMUv3 architecture documentation
- Existing vIOMMU code in Xen

BR,
Milan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-13 10:04   ` Milan Djokic
@ 2025-08-13 12:11     ` Julien Grall
  2025-08-14 16:26       ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Julien Grall @ 2025-08-13 12:11 UTC (permalink / raw)
  To: Milan Djokic, Julien Grall
  Cc: xen-devel@lists.xenproject.org, Bertrand Marquis, Rahul Singh,
	Stefano Stabellini, Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

On 13/08/2025 11:04, Milan Djokic wrote:
> Hello Julien,

Hi Milan,

> 
> We have prepared a design document and it will be part of the updated 
> patch series (added in docs/design). I'll also extend cover letter with 
> details on implementation structure to make review easier.

I would suggest to just iterate on the design document for now.

> Following is the design document content which will be provided in 
> updated patch series:
> 
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ==========================================================
> 
> Author: Milan Djokic <milan_djokic@epam.com>
> Date:   2025-08-07
> Status: Draft
> 
> Introduction
> ------------
> 
> The SMMUv3 supports two stages of translation. Each stage of translation
> can be independently enabled. An incoming address is logically
> translated from VA to IPA in stage 1, then the IPA is input to stage 2
> which translates the IPA to the output PA. Stage 1 translation support
> is required to provide isolation between different devices within the OS.
> 
> Xen already supports Stage 2 translation but there is no support for
> Stage 1 translation. This design proposal outlines the introduction of
> Stage-1 SMMUv3 support in Xen for ARM guests.
> 
> Motivation
> ----------
> 
> ARM systems utilizing SMMUv3 require Stage-1 address translation to
> ensure correct and secure DMA behavior inside guests.

Can you clarify what you mean by "correct"? DMA would still work without 
stage-1.
> 
> This feature enables:
> - Stage-1 translation in guest domain
> - Safe device passthrough under secure memory translation
> 
> Design Overview
> ---------------
> 
> These changes provide emulated SMMUv3 support:
> 
> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>    SMMUv3 driver
> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling

So what are you planning to expose to a guest? Is it one vIOMMU per 
pIOMMU? Or a single one?

Have you considered the pros/cons for both?
> - Register/Command Emulation: SMMUv3 register emulation and command
>    queue handling

For each pSMMU, we have a single command queue that will receive command 
from all the guests. How do you plan to prevent a guest hogging the 
command queue?

In addition to that, AFAIU, the size of the virtual command queue is 
fixed by the guest rather than Xen. If a guest is filling up the queue 
with commands before notifying Xen, how do you plan to ensure we don't 
spend too much time in Xen (which is not preemptible)?

Lastly, what do you plan to expose? Is it a full vIOMMU (including event 
forwarding)?

> - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
>    device trees for dom0 and dom0less scenarios
> - Runtime Configuration: introduces a 'viommu' boot parameter for
>    dynamic enablement
> 
> Security Considerations
> ------------------------
> 
> viommu security benefits:
> - Stage-1 translation ensures guest devices cannot perform unauthorized
>    DMA
> - Emulated SMMUv3 for domains removes dependency on host hardware while
>    maintaining isolation

I don't understand this sentence.

> 
> Observations and Potential Risks
> --------------------------------
> 
> 1. Observation:
> Support for Stage-1 translation introduces new data structures
> (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
> in the Stream Table Entry (STE), including an abort field for partial
> config states.
> 
> Risk:
> A partially applied Stage-1 configuration might leave guest DMA
> mappings in an inconsistent state, enabling unauthorized access or
> cross-domain interference.

I don't understand how a misconfigured stage-1 could lead to 
cross-domain interference. Can you clarify?

> 
> Mitigation (Handled by design):
> Both s1_cfg and s2_cfg are written atomically. The abort field ensures
> Stage-1 config is only used when fully applied. Incomplete configs are
> ignored by the hypervisor.
> 
> 2. Observation:
> Guests can now issue Stage-1 cache invalidations.
> 
> Risk:
> Failure to propagate invalidations could leave stale mappings, enabling
> data leakage or misrouting.

This is a risk from the guest PoV, right? IOW, this would not open up a 
security hole in Xen.

> 
> Mitigation (Handled by design):
> Guest invalidations are forwarded to the hardware to ensure IOMMU
> coherency.
> 
> 3. Observation:
> The feature introduces large functional changes including the vIOMMU
> framework, vsmmuv3 devices, command queues, event queues, domain
> handling, and Device Tree modifications.
> 
> Risk:
> Increased attack surface with risk of race conditions, malformed
> commands, or misconfiguration via the device tree.
> 
> Mitigation:
> - Improved sanity checks and error handling
> - Feature is marked as Tech Preview and self-contained to reduce risk
>    to unrelated code

Surely, you will want to use the code in production... No?

> 
> 4. Observation:
> The implementation supports nested and standard translation modes,
> using guest command queues (e.g. CMD_CFGI_STE) and events.
> 
> Risk:
> Malicious commands could bypass validation and corrupt SMMUv3 state or
> destabilize dom0.
> 
> Mitigation (Handled by design):
> Command queues are validated, and only permitted configuration changes
> are accepted. Handled in vsmmuv3 and cmdqueue logic.

I didn't mention anything in obversation 1 but now I have to say it... 
The observations you wrote are what I would expect to be handled in any 
submission to our code base. This is the bare minimum to have the code 
secure. But you don't seem to address the more subttle ones which are 
more related to scheduling issue (see some above). They require some 
design and discussion.

> 
> 5. Observation:
> Device Tree changes inject iommus and vsmmuv3 nodes via libxl.
> 
> Risk:
> Malicious or incorrect DT fragments could result in wrong device
> assignments or hardware access.
> 
> Mitigation:
> Only vetted and sanitized DT fragments are allowed. libxl limits what
> guests can inject.

Today, libxl doesn't do any sanitisation on the DT. In fact, this is 
pretty much impossible because libfdt expects trusted DT. Is this 
something you are planning to change?
> 
> 6. Observation:
> The feature is enabled per-guest via viommu setting.
> 
> Risk:
> Guests without viommu may behave differently, potentially causing
> confusion, privilege drift, or accidental exposure.
> 
> Mitigation:
> Ensure downgrade paths are safe. Perform isolation audits in
> multi-guest environments to ensure correct behavior.
> 
> Performance Impact
> ------------------
> 
> Hardware-managed translations are expected to have minimal overhead.
> Emulated vIOMMU may introduce some latency during initialization or
> event processing.

Latency to who? We still expect isolation between guests and a guest 
will not go over its time slice.

For the guest itself, the main performance impact will be TLB flushes 
because they are commands that will end up to be emulated by Xen. 
Depending on your Linux configuration (I haven't checked other), this 
will either happen every unmap operation or they will be batch. The 
performance of the latter will be the worse one.

Have you done any benchmark to confirm the impact? Just to note, I would 
not gate the work for virtual SMMUv3 based on the performance. I think 
it is ok to offer the support if the user want extra security and 
doesn't care about performance. But it would be good to outline them as 
I expect them to be pretty bad...

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-13 12:11     ` Julien Grall
@ 2025-08-14 16:26       ` Milan Djokic
  2025-08-26 23:28         ` Volodymyr Babchuk
  0 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-08-14 16:26 UTC (permalink / raw)
  To: Julien Grall, Julien Grall
  Cc: xen-devel@lists.xenproject.org, Bertrand Marquis, Rahul Singh,
	Stefano Stabellini, Michal Orzel, Volodymyr Babchuk, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hello Julien,

On 8/13/25 14:11, Julien Grall wrote:
> On 13/08/2025 11:04, Milan Djokic wrote:
>> Hello Julien,
> 
> Hi Milan,
> 
>>
>> We have prepared a design document and it will be part of the updated
>> patch series (added in docs/design). I'll also extend cover letter with
>> details on implementation structure to make review easier.
> 
> I would suggest to just iterate on the design document for now.
> 
>> Following is the design document content which will be provided in
>> updated patch series:
>>
>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>> ==========================================================
>>
>> Author: Milan Djokic <milan_djokic@epam.com>
>> Date:   2025-08-07
>> Status: Draft
>>
>> Introduction
>> ------------
>>
>> The SMMUv3 supports two stages of translation. Each stage of translation
>> can be independently enabled. An incoming address is logically
>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>> which translates the IPA to the output PA. Stage 1 translation support
>> is required to provide isolation between different devices within the OS.
>>
>> Xen already supports Stage 2 translation but there is no support for
>> Stage 1 translation. This design proposal outlines the introduction of
>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>
>> Motivation
>> ----------
>>
>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>> ensure correct and secure DMA behavior inside guests.
> 
> Can you clarify what you mean by "correct"? DMA would still work without
> stage-1.

Correct in terms of working with guest managed I/O space. I'll rephrase 
this statement, it seems ambiguous.

>>
>> This feature enables:
>> - Stage-1 translation in guest domain
>> - Safe device passthrough under secure memory translation
>>
>> Design Overview
>> ---------------
>>
>> These changes provide emulated SMMUv3 support:
>>
>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>     SMMUv3 driver
>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
> 
> So what are you planning to expose to a guest? Is it one vIOMMU per
> pIOMMU? Or a single one?

Single vIOMMU model is used in this design.

> 
> Have you considered the pros/cons for both?
>> - Register/Command Emulation: SMMUv3 register emulation and command
>>     queue handling
> 

That's a point for consideration.
single vIOMMU prevails in terms of less complex implementation and a 
simple guest iommmu model - single vIOMMU node, one interrupt path, 
event queue, single set of trap handlers for emulation, etc.
Cons for a single vIOMMU model could be less accurate hw representation 
and a potential bottleneck with one emulated queue and interrupt path.
On the other hand, vIOMMU per pIOMMU provides more accurate hw modeling 
and offers better scalability in case of many IOMMUs in the system, but 
this comes with more complex emulation logic and device tree, also 
handling multiple vIOMMUs on guest side.
IMO, single vIOMMU model seems like a better option mostly because it's 
less complex, easier to maintain and debug. Of course, this decision can 
and should be discussed.

> For each pSMMU, we have a single command queue that will receive command
> from all the guests. How do you plan to prevent a guest hogging the
> command queue?
> 
> In addition to that, AFAIU, the size of the virtual command queue is
> fixed by the guest rather than Xen. If a guest is filling up the queue
> with commands before notifying Xen, how do you plan to ensure we don't
> spend too much time in Xen (which is not preemptible)?
> 

We'll have to do a detailed analysis on these scenarios, they are not 
covered by the design (as well as some others which is clear after your 
comments). I'll come back with an updated design.

> Lastly, what do you plan to expose? Is it a full vIOMMU (including event
> forwarding)?
> 

Yes, implementation provides full vIOMMU functionality, with stage-1 
event forwarding to guest.

>> - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
>>     device trees for dom0 and dom0less scenarios
>> - Runtime Configuration: introduces a 'viommu' boot parameter for
>>     dynamic enablement
>>
>> Security Considerations
>> ------------------------
>>
>> viommu security benefits:
>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>>     DMA
>> - Emulated SMMUv3 for domains removes dependency on host hardware while
>>     maintaining isolation
> 
> I don't understand this sentence.
> 

Current implementation emulates IOMMU with predefined capabilities, 
exposed as a single vIOMMU to guest. That's where "removes dependency on 
host hardware" came from. I'll rephrase this part to be more clear.


>>
>> Observations and Potential Risks
>> --------------------------------
>>
>> 1. Observation:
>> Support for Stage-1 translation introduces new data structures
>> (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
>> in the Stream Table Entry (STE), including an abort field for partial
>> config states.
>>
>> Risk:
>> A partially applied Stage-1 configuration might leave guest DMA
>> mappings in an inconsistent state, enabling unauthorized access or
>> cross-domain interference.
> 
> I don't understand how a misconfigured stage-1 could lead to
> cross-domain interference. Can you clarify?
> 

For stage-1 support, SID-to-device mapping and per device  io_domain 
allocation is introduced in Xen smmu driver, and we have to take care 
that these mappings are valid all the time. If these are not properly 
managed, structures and SIDs could be mapped to wrong device (and 
consequentially wrong guest) in some extreme cases.
This is covered by the design, but listed as a risc anyway for eventual 
future updates in this area.


>>
>> Mitigation (Handled by design):
>> Both s1_cfg and s2_cfg are written atomically. The abort field ensures
>> Stage-1 config is only used when fully applied. Incomplete configs are
>> ignored by the hypervisor.
>>
>> 2. Observation:
>> Guests can now issue Stage-1 cache invalidations.
>>
>> Risk:
>> Failure to propagate invalidations could leave stale mappings, enabling
>> data leakage or misrouting.
> 
> This is a risk from the guest PoV, right? IOW, this would not open up a
> security hole in Xen.
> 

Yes, this is guest PoV, although still related to vIOMMU.

>>
>> Mitigation (Handled by design):
>> Guest invalidations are forwarded to the hardware to ensure IOMMU
>> coherency.
>>
>> 3. Observation:
>> The feature introduces large functional changes including the vIOMMU
>> framework, vsmmuv3 devices, command queues, event queues, domain
>> handling, and Device Tree modifications.
>>
>> Risk:
>> Increased attack surface with risk of race conditions, malformed
>> commands, or misconfiguration via the device tree.
>>
>> Mitigation:
>> - Improved sanity checks and error handling
>> - Feature is marked as Tech Preview and self-contained to reduce risk
>>     to unrelated code
> 
> Surely, you will want to use the code in production... No?
> 

Yes, it is planned for production usage. At the moment, it is optionally 
enabled (grouped under unsupported features), needs community feedback, 
complete security analysis and performance benchmarking/optimization. 
That's the reason it's marked as a Tech Preview at this point.


>>
>> 4. Observation:
>> The implementation supports nested and standard translation modes,
>> using guest command queues (e.g. CMD_CFGI_STE) and events.
>>
>> Risk:
>> Malicious commands could bypass validation and corrupt SMMUv3 state or
>> destabilize dom0.
>>
>> Mitigation (Handled by design):
>> Command queues are validated, and only permitted configuration changes
>> are accepted. Handled in vsmmuv3 and cmdqueue logic.
> 
> I didn't mention anything in obversation 1 but now I have to say it...
> The observations you wrote are what I would expect to be handled in any
> submission to our code base. This is the bare minimum to have the code
> secure. But you don't seem to address the more subttle ones which are
> more related to scheduling issue (see some above). They require some
> design and discussion.
> 

Yes, it's clear to me after your comments that some important 
observations are missing. We'll do additional analysis and come back 
with a more complete design.

>>
>> 5. Observation:
>> Device Tree changes inject iommus and vsmmuv3 nodes via libxl.
>>
>> Risk:
>> Malicious or incorrect DT fragments could result in wrong device
>> assignments or hardware access.
>>
>> Mitigation:
>> Only vetted and sanitized DT fragments are allowed. libxl limits what
>> guests can inject.
> 
> Today, libxl doesn't do any sanitisation on the DT. In fact, this is
> pretty much impossible because libfdt expects trusted DT. Is this
> something you are planning to change?

I've referred to libxl parsing only supported fragments/nodes from DT, 
but yes, that's not actual sanitization. I'll update these statements.

>>
>> 6. Observation:
>> The feature is enabled per-guest via viommu setting.
>>
>> Risk:
>> Guests without viommu may behave differently, potentially causing
>> confusion, privilege drift, or accidental exposure.
>>
>> Mitigation:
>> Ensure downgrade paths are safe. Perform isolation audits in
>> multi-guest environments to ensure correct behavior.
>>
>> Performance Impact
>> ------------------
>>
>> Hardware-managed translations are expected to have minimal overhead.
>> Emulated vIOMMU may introduce some latency during initialization or
>> event processing.
> 
> Latency to who? We still expect isolation between guests and a guest
> will not go over its time slice.
> 

This is more related to comparison of emulated vs hw translation, and 
overall overhead introduced with these mechanisms. I'll rephrase this 
part to be more clear.

> For the guest itself, the main performance impact will be TLB flushes
> because they are commands that will end up to be emulated by Xen.
> Depending on your Linux configuration (I haven't checked other), this
> will either happen every unmap operation or they will be batch. The
> performance of the latter will be the worse one.
> 
> Have you done any benchmark to confirm the impact? Just to note, I would
> not gate the work for virtual SMMUv3 based on the performance. I think
> it is ok to offer the support if the user want extra security and
> doesn't care about performance. But it would be good to outline them as
> I expect them to be pretty bad...
> 

We haven't performed detailed benchmarking, just a measurement of boot 
time and our domU application execution rate with and without viommu. We 
could perform some measurements for viommu operations and add results in 
this section.

Thank you for your feedback, I'll come back with an updated design 
document for further review.

BR,
Milan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-14 16:26       ` Milan Djokic
@ 2025-08-26 23:28         ` Volodymyr Babchuk
  2025-08-28 11:59           ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Volodymyr Babchuk @ 2025-08-26 23:28 UTC (permalink / raw)
  To: Milan Djokic
  Cc: Julien Grall, Julien Grall, xen-devel@lists.xenproject.org,
	Bertrand Marquis, Rahul Singh, Stefano Stabellini, Michal Orzel,
	Jan Beulich, Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper


Hi Milan,

Milan Djokic <milan_djokic@epam.com> writes:

> Hello Julien,
>
> On 8/13/25 14:11, Julien Grall wrote:
>> On 13/08/2025 11:04, Milan Djokic wrote:
>>> Hello Julien,
>> Hi Milan,
>> 
>>>
>>> We have prepared a design document and it will be part of the updated
>>> patch series (added in docs/design). I'll also extend cover letter with
>>> details on implementation structure to make review easier.
>> I would suggest to just iterate on the design document for now.
>> 
>>> Following is the design document content which will be provided in
>>> updated patch series:
>>>
>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>> ==========================================================
>>>
>>> Author: Milan Djokic <milan_djokic@epam.com>
>>> Date:   2025-08-07
>>> Status: Draft
>>>
>>> Introduction
>>> ------------
>>>
>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>> can be independently enabled. An incoming address is logically
>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>> which translates the IPA to the output PA. Stage 1 translation support
>>> is required to provide isolation between different devices within the OS.
>>>
>>> Xen already supports Stage 2 translation but there is no support for
>>> Stage 1 translation. This design proposal outlines the introduction of
>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>
>>> Motivation
>>> ----------
>>>
>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>> ensure correct and secure DMA behavior inside guests.
>> Can you clarify what you mean by "correct"? DMA would still work
>> without
>> stage-1.
>
> Correct in terms of working with guest managed I/O space. I'll
> rephrase this statement, it seems ambiguous.
>
>>>
>>> This feature enables:
>>> - Stage-1 translation in guest domain
>>> - Safe device passthrough under secure memory translation
>>>
>>> Design Overview
>>> ---------------
>>>
>>> These changes provide emulated SMMUv3 support:
>>>
>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>     SMMUv3 driver
>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>> So what are you planning to expose to a guest? Is it one vIOMMU per
>> pIOMMU? Or a single one?
>
> Single vIOMMU model is used in this design.
>
>> Have you considered the pros/cons for both?
>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>     queue handling
>> 
>
> That's a point for consideration.
> single vIOMMU prevails in terms of less complex implementation and a
> simple guest iommmu model - single vIOMMU node, one interrupt path,
> event queue, single set of trap handlers for emulation, etc.
> Cons for a single vIOMMU model could be less accurate hw
> representation and a potential bottleneck with one emulated queue and
> interrupt path.
> On the other hand, vIOMMU per pIOMMU provides more accurate hw
> modeling and offers better scalability in case of many IOMMUs in the
> system, but this comes with more complex emulation logic and device
> tree, also handling multiple vIOMMUs on guest side.
> IMO, single vIOMMU model seems like a better option mostly because
> it's less complex, easier to maintain and debug. Of course, this
> decision can and should be discussed.
>

Well, I am not sure that this is possible, because of StreamID
allocation. The biggest offender is of course PCI, as each Root PCI
bridge will require own SMMU instance with own StreamID space. But even
without PCI you'll need some mechanism to map vStremID to
<pSMMU, pStreamID>, because there will be overlaps in SID space.


Actually, PCI/vPCI with vSMMU is its own can of worms...

>> For each pSMMU, we have a single command queue that will receive command
>> from all the guests. How do you plan to prevent a guest hogging the
>> command queue?
>> In addition to that, AFAIU, the size of the virtual command queue is
>> fixed by the guest rather than Xen. If a guest is filling up the queue
>> with commands before notifying Xen, how do you plan to ensure we don't
>> spend too much time in Xen (which is not preemptible)?
>> 
>
> We'll have to do a detailed analysis on these scenarios, they are not
> covered by the design (as well as some others which is clear after
> your comments). I'll come back with an updated design.

I think that can be handled akin to hypercall continuation, which is
used in similar places, like P2M code

[...]

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-26 23:28         ` Volodymyr Babchuk
@ 2025-08-28 11:59           ` Milan Djokic
  2025-08-29 16:27             ` Volodymyr Babchuk
  0 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-08-28 11:59 UTC (permalink / raw)
  To: Volodymyr Babchuk, Julien Grall, Julien Grall
  Cc: xen-devel@lists.xenproject.org, Bertrand Marquis, Rahul Singh,
	Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hello Julien, Volodymyr

On 8/27/25 01:28, Volodymyr Babchuk wrote:
> 
> Hi Milan,
> 
> Milan Djokic <milan_djokic@epam.com> writes:
> 
>> Hello Julien,
>>
>> On 8/13/25 14:11, Julien Grall wrote:
>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>> Hello Julien,
>>> Hi Milan,
>>>
>>>>
>>>> We have prepared a design document and it will be part of the updated
>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>> details on implementation structure to make review easier.
>>> I would suggest to just iterate on the design document for now.
>>>
>>>> Following is the design document content which will be provided in
>>>> updated patch series:
>>>>
>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>> ==========================================================
>>>>
>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>> Date:   2025-08-07
>>>> Status: Draft
>>>>
>>>> Introduction
>>>> ------------
>>>>
>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>> can be independently enabled. An incoming address is logically
>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>> is required to provide isolation between different devices within the OS.
>>>>
>>>> Xen already supports Stage 2 translation but there is no support for
>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>
>>>> Motivation
>>>> ----------
>>>>
>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>> ensure correct and secure DMA behavior inside guests.
>>> Can you clarify what you mean by "correct"? DMA would still work
>>> without
>>> stage-1.
>>
>> Correct in terms of working with guest managed I/O space. I'll
>> rephrase this statement, it seems ambiguous.
>>
>>>>
>>>> This feature enables:
>>>> - Stage-1 translation in guest domain
>>>> - Safe device passthrough under secure memory translation
>>>>
>>>> Design Overview
>>>> ---------------
>>>>
>>>> These changes provide emulated SMMUv3 support:
>>>>
>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>      SMMUv3 driver
>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>> pIOMMU? Or a single one?
>>
>> Single vIOMMU model is used in this design.
>>
>>> Have you considered the pros/cons for both?
>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>      queue handling
>>>
>>
>> That's a point for consideration.
>> single vIOMMU prevails in terms of less complex implementation and a
>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>> event queue, single set of trap handlers for emulation, etc.
>> Cons for a single vIOMMU model could be less accurate hw
>> representation and a potential bottleneck with one emulated queue and
>> interrupt path.
>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>> modeling and offers better scalability in case of many IOMMUs in the
>> system, but this comes with more complex emulation logic and device
>> tree, also handling multiple vIOMMUs on guest side.
>> IMO, single vIOMMU model seems like a better option mostly because
>> it's less complex, easier to maintain and debug. Of course, this
>> decision can and should be discussed.
>>
> 
> Well, I am not sure that this is possible, because of StreamID
> allocation. The biggest offender is of course PCI, as each Root PCI
> bridge will require own SMMU instance with own StreamID space. But even
> without PCI you'll need some mechanism to map vStremID to
> <pSMMU, pStreamID>, because there will be overlaps in SID space.
> 
> 
> Actually, PCI/vPCI with vSMMU is its own can of worms...
> 
>>> For each pSMMU, we have a single command queue that will receive command
>>> from all the guests. How do you plan to prevent a guest hogging the
>>> command queue?
>>> In addition to that, AFAIU, the size of the virtual command queue is
>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>> with commands before notifying Xen, how do you plan to ensure we don't
>>> spend too much time in Xen (which is not preemptible)?
>>>
>>
>> We'll have to do a detailed analysis on these scenarios, they are not
>> covered by the design (as well as some others which is clear after
>> your comments). I'll come back with an updated design.
> 
> I think that can be handled akin to hypercall continuation, which is
> used in similar places, like P2M code
> 
> [...]
> 

I have updated vIOMMU design document with additional security topics 
covered and performance impact results. Also added some additional 
explanations for vIOMMU components following your comments.
Updated document content:

===============================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
===============================================

:Author:     Milan Djokic <milan_djokic@epam.com>
:Date:       2025-08-07
:Status:     Draft

Introduction
========

The SMMUv3 supports two stages of translation. Each stage of translation 
can be
independently enabled. An incoming address is logically translated from 
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide 
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support 
in Xen for ARM guests.

Motivation
==========

ARM systems utilizing SMMUv3 require stage-1 address translation to 
ensure secure DMA and guest managed I/O memory mappings.
This feature enables:

- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation

Design Overview
===============

These changes provide emulated SMMUv3 support:

- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support 
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command 
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to 
device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for 
dynamic enablement.

vIOMMU is exposed to guest as a single device with predefined 
capabilities and commands supported. Single vIOMMU model abstracts the 
details of an actual IOMMU hardware, simplifying usage from the guest 
point of view. Guest OS handles only a single IOMMU, even if multiple 
IOMMU units are available on the host system.

Security Considerations
=======================

**viommu security benefits:**

- Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
- Emulated IOMMU removes guest dependency on IOMMU hardware while 
maintaining domains isolation.


1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures 
(`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and 
Stage-2 entries in the Stream Table Entry (STE), including an `abort` 
field to handle partial configuration states.

**Risk:**
Without proper handling, a partially applied Stage-1 configuration might 
leave guest DMA mappings in an inconsistent state, potentially enabling 
unauthorized access or causing cross-domain interference.

**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to 
STE and manages the `abort` field-only considering Stage-1 configuration 
if fully attached. This ensures incomplete or invalid guest 
configurations are safely ignored by the hypervisor.

2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding 
to SMMUv3 hardware to maintain coherence.

**Risk:**
Failing to propagate cache invalidation could allow stale mappings, 
enabling access to old mappings and possibly data leakage or misrouting.

**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly 
forwarded to the hardware, preserving IOMMU coherency.

3. Observation:
---------------
This design introduces substantial new functionality, including the 
`vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues, 
event queues, domain management, and Device Tree modifications (e.g., 
`iommus` nodes and `libxl` integration).

**Risk:**
Large feature expansions increase the attack surface—potential for race 
conditions, unchecked command inputs, or Device Tree-based 
misconfigurations.

**Mitigation:**

- Sanity checks and error-handling improvements have been introduced in 
this feature.
- Further audits have to be performed for this feature and its 
dependencies in this area. Currently, feature is marked as *Tech 
Preview* and is self-contained, reducing the risk to unrelated components.

4. Observation:
---------------
The code includes transformations to handle nested translation versus 
standard modes and uses guest-configured command queues (e.g., 
`CMD_CFGI_STE`) and event notifications.

**Risk:**
Malicious or malformed queue commands from guests could bypass 
validation, manipulate SMMUv3 state, or cause Dom0 instability.

**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms 
ensure only permitted configurations are applied. This is supported via 
additions in `vsmmuv3` and `cmdqueue` handling code.

5. Observation:
---------------
Device Tree modifications enable device assignment and 
configuration—guest DT fragments (e.g., `iommus`) are added via `libxl`.

**Risk:**
Erroneous or malicious Device Tree injection could result in device 
misbinding or guest access to unauthorized hardware.

**Mitigation:**

- `libxl` perform checks of guest configuration and parse only 
predefined dt fragments and nodes, reducing risc.
- The system integrator must ensure correct resource mapping in the 
guest Device Tree (DT) fragments.

6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl 
guest config) means some guests may opt-out.

**Risk:**
Differences between guests with and without `viommu` may cause 
unexpected behavior or privilege drift.

**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing 
support doesn't cause security issues. Additional audits on emulation 
paths and domains interference need to be performed in a multi-guest 
environment.

7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache 
invalidation, stream table entries configuration, etc. An adversarial 
guest may issue a high volume of commands in rapid succession.

**Risk**
Excessive commands requests can cause high hypervisor CPU consumption 
and disrupt scheduling, leading to degraded system responsiveness and 
potential denial-of-service scenarios.

**Mitigation**

- Xen credit scheduler limits guest vCPU execution time, securing basic 
guest rate-limiting.
- Batch multiple commands of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support

8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU 
command queue (e.g. TLB invalidate). For each pIOMMU, only one command 
queue is
available for all domains.

**Risk**
Excessive commands requests from abusive guest can cause flooding of 
physical IOMMU command queue, leading to degraded pIOMMU responsivness 
on commands issued from other guests.

**Mitigation**

- Xen credit scheduler limits guest vCPU execution time, securing basic 
guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue and 
enable support for batch execution pause/continuation
- If possible, implement domain penalization by adding a per-domain cost 
counter for vIOMMU/pIOMMU usage.

9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to 
guest (e.g. translation faults, invalid stream IDs, permission errors). 
A malicious guest can misconfigure its SMMU state or intentionally 
trigger faults with high frequency.

**Risk**
Occurance of IOMMU events with high frequency can cause Xen to flood the 
event queue and disrupt scheduling with high hypervisor CPU load for 
events handling.

**Mitigation**

- Implement fail-safe state by disabling events forwarding when faults 
are occured with high frequency and not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests

Performance Impact
==================

With iommu stage-1 and nested translation inclusion, performance 
overhead is introduced comparing to existing, stage-2 only usage in Xen.
Once mappings are established, translations should not introduce 
significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting 
device initialization and event handling.
Performance impact highly depends on target CPU capabilities. Testing is 
performed on cortex-a53 based platform.
Performance is mostly impacted by emulated vIOMMU operations, results 
shown in the following table.

+-------------------------------+---------------------------------+
| vIOMMU Operation              | Execution time in guest         |
+===============================+=================================+
| Reg read                      | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write                     | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
+-------------------------------+---------------------------------+
| Invalidate STE                | median: 450μs worst_case: 7ms+  |
+-------------------------------+---------------------------------+

With vIOMMU exposed to guest, guest OS has to initialize IOMMU device 
and configure stage-1 mappings for devices attached to it.
Following table shows initialization stages which impact stage-1 enabled 
guest boot time and compares it with stage-1 disabled guest.

"NOTE: Device probe execution time varies significantly depending on 
device complexity. virtio-gpu was selected as a test case due to its 
extensive use of dynamic DMA allocations and IOMMU mappings, making it a 
suitable candidate for benchmarking stage-1 vIOMMU behavior."

+---------------------+-----------------------+------------------------+
| Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init          | ~25ms                 | /                      |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms                | ~200ms                 |
+---------------------+-----------------------+------------------------+

For devices configured with dynamic DMA mappings, DMA allocate/map/unmap 
operations performance is also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation issues emulated IOMMU functions like mmio 
write/read and TLB invalidations.
As a reference, following table shows performance results for runtime 
dma operations for virtio-gpu device.

+---------------+-------------------------+----------------------------+
| DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
+===============+=========================+============================+
| dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+

Testing
============

- QEMU-based ARM system tests for Stage-1 translation and nested 
virtualization.
- Actual hardware validation on platforms such as Renesas to ensure 
compatibility with real SMMUv3 implementations.
- Unit/Functional tests validating correct translations (not implemented).

Migration and Compatibility
===========================

This optional feature defaults to disabled (`viommu=""`) for backward 
compatibility.

References
==========

- Original feature implemented by Rahul Singh: 
https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ 

- SMMUv3 architecture documentation
- Existing vIOMMU code patterns








^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-28 11:59           ` Milan Djokic
@ 2025-08-29 16:27             ` Volodymyr Babchuk
  2025-09-01 11:06               ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Volodymyr Babchuk @ 2025-08-29 16:27 UTC (permalink / raw)
  To: Milan Djokic
  Cc: Julien Grall, Julien Grall, xen-devel@lists.xenproject.org,
	Bertrand Marquis, Rahul Singh, Stefano Stabellini, Michal Orzel,
	Jan Beulich, Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hi Milan,

Thanks, "Security Considerations" sections looks really good. But I have
more questions.

Milan Djokic <milan_djokic@epam.com> writes:

> Hello Julien, Volodymyr
>
> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>> Hi Milan,
>> Milan Djokic <milan_djokic@epam.com> writes:
>> 
>>> Hello Julien,
>>>
>>> On 8/13/25 14:11, Julien Grall wrote:
>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>> Hello Julien,
>>>> Hi Milan,
>>>>
>>>>>
>>>>> We have prepared a design document and it will be part of the updated
>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>> details on implementation structure to make review easier.
>>>> I would suggest to just iterate on the design document for now.
>>>>
>>>>> Following is the design document content which will be provided in
>>>>> updated patch series:
>>>>>
>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>> ==========================================================
>>>>>
>>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>>> Date:   2025-08-07
>>>>> Status: Draft
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>>
>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>> can be independently enabled. An incoming address is logically
>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>> is required to provide isolation between different devices within the OS.
>>>>>
>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>
>>>>> Motivation
>>>>> ----------
>>>>>
>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>> ensure correct and secure DMA behavior inside guests.
>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>> without
>>>> stage-1.
>>>
>>> Correct in terms of working with guest managed I/O space. I'll
>>> rephrase this statement, it seems ambiguous.
>>>
>>>>>
>>>>> This feature enables:
>>>>> - Stage-1 translation in guest domain
>>>>> - Safe device passthrough under secure memory translation
>>>>>
>>>>> Design Overview
>>>>> ---------------
>>>>>
>>>>> These changes provide emulated SMMUv3 support:
>>>>>
>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>      SMMUv3 driver
>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>> pIOMMU? Or a single one?
>>>
>>> Single vIOMMU model is used in this design.
>>>
>>>> Have you considered the pros/cons for both?
>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>      queue handling
>>>>
>>>
>>> That's a point for consideration.
>>> single vIOMMU prevails in terms of less complex implementation and a
>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>> event queue, single set of trap handlers for emulation, etc.
>>> Cons for a single vIOMMU model could be less accurate hw
>>> representation and a potential bottleneck with one emulated queue and
>>> interrupt path.
>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>> modeling and offers better scalability in case of many IOMMUs in the
>>> system, but this comes with more complex emulation logic and device
>>> tree, also handling multiple vIOMMUs on guest side.
>>> IMO, single vIOMMU model seems like a better option mostly because
>>> it's less complex, easier to maintain and debug. Of course, this
>>> decision can and should be discussed.
>>>
>> Well, I am not sure that this is possible, because of StreamID
>> allocation. The biggest offender is of course PCI, as each Root PCI
>> bridge will require own SMMU instance with own StreamID space. But even
>> without PCI you'll need some mechanism to map vStremID to
>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>> 
>>>> For each pSMMU, we have a single command queue that will receive command
>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>> command queue?
>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>> spend too much time in Xen (which is not preemptible)?
>>>>
>>>
>>> We'll have to do a detailed analysis on these scenarios, they are not
>>> covered by the design (as well as some others which is clear after
>>> your comments). I'll come back with an updated design.
>> I think that can be handled akin to hypercall continuation, which is
>> used in similar places, like P2M code
>> [...]
>> 
>
> I have updated vIOMMU design document with additional security topics
> covered and performance impact results. Also added some additional
> explanations for vIOMMU components following your comments.
> Updated document content:
>
> ===============================================
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ===============================================
>
> :Author:     Milan Djokic <milan_djokic@epam.com>
> :Date:       2025-08-07
> :Status:     Draft
>
> Introduction
> ========
>
> The SMMUv3 supports two stages of translation. Each stage of
> translation can be
> independently enabled. An incoming address is logically translated
> from VA to
> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
> the output PA. Stage 1 translation support is required to provide
> isolation between different
> devices within OS. XEN already supports Stage 2 translation but there is no
> support for Stage 1 translation.
> This design proposal outlines the introduction of Stage-1 SMMUv3
> support in Xen for ARM guests.
>
> Motivation
> ==========
>
> ARM systems utilizing SMMUv3 require stage-1 address translation to
> ensure secure DMA and guest managed I/O memory mappings.

It is unclear for my what you mean by "guest manged IO memory mappings",
could you please provide an example?

> This feature enables:
>
> - Stage-1 translation in guest domain
> - Safe device passthrough under secure memory translation
>

As I see it, ARM specs use "secure" mostly when referring to Secure mode
(S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
devices, like secure GIC, secure Timer, etc. So I'd probably don't use
this word here to reduce confusion

> Design Overview
> ===============
>
> These changes provide emulated SMMUv3 support:
>
> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>     support in SMMUv3 driver.

"Nested translation" as in "nested virtualization"? Or is this something else?

> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>     handling.

I think, this is the big topic. You see, apart from SMMU, there is
at least Renesas IP-MMU, which uses completely different API. And
probably there are other IO-MMU implementations possible. Right now
vIOMMU framework handles only SMMU, which is okay, but probably we
should design it in a such way, that other IO-MMUs will be supported as
well. Maybe even IO-MMUs for other architectures (RISC V maybe?).

> - **Register/Command Emulation**: SMMUv3 register emulation and
>     command queue handling.

Continuing previous paragraph: what about other IO-MMUs? For example, if
platform provides only Renesas IO-MMU, will vIOMMU framework still
emulate SMMUv3 registers and queue handling?

> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>     to device trees for dom0 and dom0less scenarios.
> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>     dynamic enablement.
>
> vIOMMU is exposed to guest as a single device with predefined
> capabilities and commands supported. Single vIOMMU model abstracts the
> details of an actual IOMMU hardware, simplifying usage from the guest
> point of view. Guest OS handles only a single IOMMU, even if multiple
> IOMMU units are available on the host system.

In the previous email I asked how are you planning to handle potential
SID overlaps, especially in PCI use case. I want to return to this
topic. I am not saying that this is impossible, but I'd like to see this
covered in the design document.

>
> Security Considerations
> =======================
>
> **viommu security benefits:**
>
> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>   maintaining domains isolation.

I am not sure that I got this paragraph. 

>
>
> 1. Observation:
> ---------------
> Support for Stage-1 translation in SMMUv3 introduces new data
> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
> an `abort` field to handle partial configuration states.
>
> **Risk:**
> Without proper handling, a partially applied Stage-1 configuration
> might leave guest DMA mappings in an inconsistent state, potentially
> enabling unauthorized access or causing cross-domain interference.
>
> **Mitigation:** *(Handled by design)*
> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
> to STE and manages the `abort` field-only considering Stage-1
> configuration if fully attached. This ensures incomplete or invalid
> guest configurations are safely ignored by the hypervisor.
>
> 2. Observation:
> ---------------
> Guests can now invalidate Stage-1 caches; invalidation needs
> forwarding to SMMUv3 hardware to maintain coherence.
>
> **Risk:**
> Failing to propagate cache invalidation could allow stale mappings,
> enabling access to old mappings and possibly data leakage or
> misrouting.
>
> **Mitigation:** *(Handled by design)*
> This feature ensures that guest-initiated invalidations are correctly
> forwarded to the hardware, preserving IOMMU coherency.
>
> 3. Observation:
> ---------------
> This design introduces substantial new functionality, including the
> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
> queues, event queues, domain management, and Device Tree modifications
> (e.g., `iommus` nodes and `libxl` integration).
>
> **Risk:**
> Large feature expansions increase the attack surface—potential for
> race conditions, unchecked command inputs, or Device Tree-based
> misconfigurations.
>
> **Mitigation:**
>
> - Sanity checks and error-handling improvements have been introduced
>   in this feature.
> - Further audits have to be performed for this feature and its
>   dependencies in this area. Currently, feature is marked as *Tech
>   Preview* and is self-contained, reducing the risk to unrelated
>  components.
>
> 4. Observation:
> ---------------
> The code includes transformations to handle nested translation versus
> standard modes and uses guest-configured command queues (e.g.,
> `CMD_CFGI_STE`) and event notifications.
>
> **Risk:**
> Malicious or malformed queue commands from guests could bypass
> validation, manipulate SMMUv3 state, or cause Dom0 instability.

Only Dom0?

>
> **Mitigation:** *(Handled by design)*
> Built-in validation of command queue entries and sanitization
> mechanisms ensure only permitted configurations are applied. This is
> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>
> 5. Observation:
> ---------------
> Device Tree modifications enable device assignment and
> configuration—guest DT fragments (e.g., `iommus`) are added via
> `libxl`.
>
> **Risk:**
> Erroneous or malicious Device Tree injection could result in device
> misbinding or guest access to unauthorized hardware.
>
> **Mitigation:**
>
> - `libxl` perform checks of guest configuration and parse only
>   predefined dt fragments and nodes, reducing risc.
> - The system integrator must ensure correct resource mapping in the
>   guest Device Tree (DT) fragments.
>
> 6. Observation:
> ---------------
> Introducing optional per-guest enabled features (`viommu` argument in
> xl guest config) means some guests may opt-out.
>
> **Risk:**
> Differences between guests with and without `viommu` may cause
> unexpected behavior or privilege drift.
>
> **Mitigation:**
> Verify that downgrade paths are safe and well-isolated; ensure missing
> support doesn't cause security issues. Additional audits on emulation
> paths and domains interference need to be performed in a multi-guest
> environment.
>
> 7. Observation:
> ---------------
> Guests have the ability to issue Stage-1 IOMMU commands like cache
> invalidation, stream table entries configuration, etc. An adversarial
> guest may issue a high volume of commands in rapid succession.
>
> **Risk**
> Excessive commands requests can cause high hypervisor CPU consumption
> and disrupt scheduling, leading to degraded system responsiveness and
> potential denial-of-service scenarios.
>
> **Mitigation**
>
> - Xen credit scheduler limits guest vCPU execution time, securing
>   basic guest rate-limiting.

I don't thing that this feature available only in credit schedulers,
AFAIK, all schedulers except null scheduler will limit vCPU execution time.

> - Batch multiple commands of same type to reduce overhead on the
>   virtual SMMUv3 hardware emulation.
> - Implement vIOMMU commands execution restart and continuation support

So, something like "hypercall continuation"?

>
> 8. Observation:
> ---------------
> Some guest commands issued towards vIOMMU are propagated to pIOMMU
> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
> queue is
> available for all domains.
>
> **Risk**
> Excessive commands requests from abusive guest can cause flooding of
> physical IOMMU command queue, leading to degraded pIOMMU responsivness
> on commands issued from other guests.
>
> **Mitigation**
>
> - Xen credit scheduler limits guest vCPU execution time, securing
>   basic guest rate-limiting.
> - Batch commands which should be propagated towards pIOMMU cmd queue
>   and enable support for batch execution pause/continuation
> - If possible, implement domain penalization by adding a per-domain
>   cost counter for vIOMMU/pIOMMU usage.
>
> 9. Observation:
> ---------------
> vIOMMU feature includes event queue used for forwarding IOMMU events
> to guest (e.g. translation faults, invalid stream IDs, permission
> errors). A malicious guest can misconfigure its SMMU state or
> intentionally trigger faults with high frequency.
>
> **Risk**
> Occurance of IOMMU events with high frequency can cause Xen to flood
> the event queue and disrupt scheduling with high hypervisor CPU load
> for events handling.
>
> **Mitigation**
>
> - Implement fail-safe state by disabling events forwarding when faults
>   are occured with high frequency and not processed by guest.
> - Batch multiple events of same type to reduce overhead on the virtual
>   SMMUv3 hardware emulation.
> - Consider disabling event queue for untrusted guests
>
> Performance Impact
> ==================
>
> With iommu stage-1 and nested translation inclusion, performance
> overhead is introduced comparing to existing, stage-2 only usage in
> Xen.
> Once mappings are established, translations should not introduce
> significant overhead.
> Emulated paths may introduce moderate overhead, primarily affecting
> device initialization and event handling.
> Performance impact highly depends on target CPU capabilities. Testing
> is performed on cortex-a53 based platform.

Which platform exactly? While QEMU emulates SMMU to some extent, we are
observing somewhat different SMMU behavior on real HW platforms (mostly
due to cache coherence problems). Also, according to MMU-600 errata, it
can have lower than expected performance in some use-cases.

> Performance is mostly impacted by emulated vIOMMU operations, results
> shown in the following table.
>
> +-------------------------------+---------------------------------+
> | vIOMMU Operation              | Execution time in guest         |
> +===============================+=================================+
> | Reg read                      | median: 30μs, worst-case: 250μs |
> +-------------------------------+---------------------------------+
> | Reg write                     | median: 35μs, worst-case: 280μs |
> +-------------------------------+---------------------------------+
> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
> +-------------------------------+---------------------------------+
> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
> +-------------------------------+---------------------------------+
>
> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
> and configure stage-1 mappings for devices attached to it.
> Following table shows initialization stages which impact stage-1
> enabled guest boot time and compares it with stage-1 disabled guest.
>
> "NOTE: Device probe execution time varies significantly depending on
> device complexity. virtio-gpu was selected as a test case due to its
> extensive use of dynamic DMA allocations and IOMMU mappings, making it
> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>
> +---------------------+-----------------------+------------------------+
> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
> +=====================+=======================+========================+
> | IOMMU Init          | ~25ms                 | /                      |
> +---------------------+-----------------------+------------------------+
> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
> +---------------------+-----------------------+------------------------+
>
> For devices configured with dynamic DMA mappings, DMA
> allocate/map/unmap operations performance is also impacted on stage-1
> enabled guests.
> Dynamic DMA mapping operation issues emulated IOMMU functions like
> mmio write/read and TLB invalidations.
> As a reference, following table shows performance results for runtime
> dma operations for virtio-gpu device.
>
> +---------------+-------------------------+----------------------------+
> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
> +===============+=========================+============================+
> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
> +---------------+-------------------------+----------------------------+
> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
> +---------------+-------------------------+----------------------------+
> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
> +---------------+-------------------------+----------------------------+
> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
> +---------------+-------------------------+----------------------------+
>
> Testing
> ============
>
> - QEMU-based ARM system tests for Stage-1 translation and nested
>   virtualization.
> - Actual hardware validation on platforms such as Renesas to ensure
>   compatibility with real SMMUv3 implementations.
> - Unit/Functional tests validating correct translations (not implemented).
>
> Migration and Compatibility
> ===========================
>
> This optional feature defaults to disabled (`viommu=""`) for backward
> compatibility.
>

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-08-29 16:27             ` Volodymyr Babchuk
@ 2025-09-01 11:06               ` Milan Djokic
  2025-11-03 13:16                 ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-09-01 11:06 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Julien Grall, Julien Grall, xen-devel@lists.xenproject.org,
	Bertrand Marquis, Rahul Singh, Stefano Stabellini, Michal Orzel,
	Jan Beulich, Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hi Volodymyr,

On 8/29/25 18:27, Volodymyr Babchuk wrote:
> Hi Milan,
> 
> Thanks, "Security Considerations" sections looks really good. But I have
> more questions.
> 
> Milan Djokic <milan_djokic@epam.com> writes:
> 
>> Hello Julien, Volodymyr
>>
>> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>>> Hi Milan,
>>> Milan Djokic <milan_djokic@epam.com> writes:
>>>
>>>> Hello Julien,
>>>>
>>>> On 8/13/25 14:11, Julien Grall wrote:
>>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>>> Hello Julien,
>>>>> Hi Milan,
>>>>>
>>>>>>
>>>>>> We have prepared a design document and it will be part of the updated
>>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>>> details on implementation structure to make review easier.
>>>>> I would suggest to just iterate on the design document for now.
>>>>>
>>>>>> Following is the design document content which will be provided in
>>>>>> updated patch series:
>>>>>>
>>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>>> ==========================================================
>>>>>>
>>>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>>>> Date:   2025-08-07
>>>>>> Status: Draft
>>>>>>
>>>>>> Introduction
>>>>>> ------------
>>>>>>
>>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>>> can be independently enabled. An incoming address is logically
>>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>>> is required to provide isolation between different devices within the OS.
>>>>>>
>>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>>
>>>>>> Motivation
>>>>>> ----------
>>>>>>
>>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>>> ensure correct and secure DMA behavior inside guests.
>>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>>> without
>>>>> stage-1.
>>>>
>>>> Correct in terms of working with guest managed I/O space. I'll
>>>> rephrase this statement, it seems ambiguous.
>>>>
>>>>>>
>>>>>> This feature enables:
>>>>>> - Stage-1 translation in guest domain
>>>>>> - Safe device passthrough under secure memory translation
>>>>>>
>>>>>> Design Overview
>>>>>> ---------------
>>>>>>
>>>>>> These changes provide emulated SMMUv3 support:
>>>>>>
>>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>>       SMMUv3 driver
>>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>>> pIOMMU? Or a single one?
>>>>
>>>> Single vIOMMU model is used in this design.
>>>>
>>>>> Have you considered the pros/cons for both?
>>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>>       queue handling
>>>>>
>>>>
>>>> That's a point for consideration.
>>>> single vIOMMU prevails in terms of less complex implementation and a
>>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>>> event queue, single set of trap handlers for emulation, etc.
>>>> Cons for a single vIOMMU model could be less accurate hw
>>>> representation and a potential bottleneck with one emulated queue and
>>>> interrupt path.
>>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>>> modeling and offers better scalability in case of many IOMMUs in the
>>>> system, but this comes with more complex emulation logic and device
>>>> tree, also handling multiple vIOMMUs on guest side.
>>>> IMO, single vIOMMU model seems like a better option mostly because
>>>> it's less complex, easier to maintain and debug. Of course, this
>>>> decision can and should be discussed.
>>>>
>>> Well, I am not sure that this is possible, because of StreamID
>>> allocation. The biggest offender is of course PCI, as each Root PCI
>>> bridge will require own SMMU instance with own StreamID space. But even
>>> without PCI you'll need some mechanism to map vStremID to
>>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>>>
>>>>> For each pSMMU, we have a single command queue that will receive command
>>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>>> command queue?
>>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>>> spend too much time in Xen (which is not preemptible)?
>>>>>
>>>>
>>>> We'll have to do a detailed analysis on these scenarios, they are not
>>>> covered by the design (as well as some others which is clear after
>>>> your comments). I'll come back with an updated design.
>>> I think that can be handled akin to hypercall continuation, which is
>>> used in similar places, like P2M code
>>> [...]
>>>
>>
>> I have updated vIOMMU design document with additional security topics
>> covered and performance impact results. Also added some additional
>> explanations for vIOMMU components following your comments.
>> Updated document content:
>>
>> ===============================================
>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>> ===============================================
>>
>> :Author:     Milan Djokic <milan_djokic@epam.com>
>> :Date:       2025-08-07
>> :Status:     Draft
>>
>> Introduction
>> ========
>>
>> The SMMUv3 supports two stages of translation. Each stage of
>> translation can be
>> independently enabled. An incoming address is logically translated
>> from VA to
>> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
>> the output PA. Stage 1 translation support is required to provide
>> isolation between different
>> devices within OS. XEN already supports Stage 2 translation but there is no
>> support for Stage 1 translation.
>> This design proposal outlines the introduction of Stage-1 SMMUv3
>> support in Xen for ARM guests.
>>
>> Motivation
>> ==========
>>
>> ARM systems utilizing SMMUv3 require stage-1 address translation to
>> ensure secure DMA and guest managed I/O memory mappings.
> 
> It is unclear for my what you mean by "guest manged IO memory mappings",
> could you please provide an example?
> 

Basically enabling stage-1 translation means that the guest is 
responsible for managing IOVA to IPA mappings through its own IOMMU 
driver. Guest manages its own stage-1 page tables and TLB.
For example, when a guest driver wants to perform DMA mapping (e.g. with 
dma_map_single()), it will request mapping of its buffer physical 
address to IOVA through guest IOMMU driver. Guest IOMMU driver will 
further issue mapping commands emulated by Xen which translate it into 
stage-2 mappings.

>> This feature enables:
>>
>> - Stage-1 translation in guest domain
>> - Safe device passthrough under secure memory translation
>>
> 
> As I see it, ARM specs use "secure" mostly when referring to Secure mode
> (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
> devices, like secure GIC, secure Timer, etc. So I'd probably don't use
> this word here to reduce confusion
> 

Sure, secure in terms of isolation is the topic here. I'll rephrase this

>> Design Overview
>> ===============
>>
>> These changes provide emulated SMMUv3 support:
>>
>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>>      support in SMMUv3 driver.
> 
> "Nested translation" as in "nested virtualization"? Or is this something else?
> 

No, this refers to 2-stage translation IOVA->IPA->PA as a nested 
translation. Although with this feature, nested virtualization is also 
enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest.


>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>>      handling.
> 
> I think, this is the big topic. You see, apart from SMMU, there is
> at least Renesas IP-MMU, which uses completely different API. And
> probably there are other IO-MMU implementations possible. Right now
> vIOMMU framework handles only SMMU, which is okay, but probably we
> should design it in a such way, that other IO-MMUs will be supported as
> well. Maybe even IO-MMUs for other architectures (RISC V maybe?).
> 

I think that it is already designed in such manner. We have a generic 
vIOMMU framework and a backend implementation for target IOMMU as 
separate components. And the backend implements supported 
commands/mechanisms which are specific for target IOMMU type. At this 
point, only SMMUv3 is supported, but it is possible to implement other 
IOMMU types support under the same generic framework. AFAIK, RISC-V 
IOMMU stage-2 is still in early development stage, but I do believe that 
it will be also compatible with vIOMMU framework.

>> - **Register/Command Emulation**: SMMUv3 register emulation and
>>      command queue handling.
> 
> Continuing previous paragraph: what about other IO-MMUs? For example, if
> platform provides only Renesas IO-MMU, will vIOMMU framework still
> emulate SMMUv3 registers and queue handling?
> 

Yes, this is not supported in current implementation. To support other 
IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for
target IOMMU and probably Xen driver for target IOMMU has to be updated 
to handle stage-1 configuration. I will elaborate this part in the 
design, to make clear that we have a generic vIOMMU framework, but only 
SMMUv3 backend exists atm.

>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>>      to device trees for dom0 and dom0less scenarios.
>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>      dynamic enablement.
>>
>> vIOMMU is exposed to guest as a single device with predefined
>> capabilities and commands supported. Single vIOMMU model abstracts the
>> details of an actual IOMMU hardware, simplifying usage from the guest
>> point of view. Guest OS handles only a single IOMMU, even if multiple
>> IOMMU units are available on the host system.
> 
> In the previous email I asked how are you planning to handle potential
> SID overlaps, especially in PCI use case. I want to return to this
> topic. I am not saying that this is impossible, but I'd like to see this
> covered in the design document.
> 

Sorry, I've missed this part in the previous mail. This is a valid point,
SID overlapping would be an issue for a single vIOMMU model. To prevent 
it, design will have to be extended with SID namespace virtualization, 
introducing a remapping layer which will make sure that guest virtual 
SIDs are unique and maintain proper mappings of vSIDs to pSIDs.
For PCI case, we need to have an extended remapping logic where 
iommu-map property will be also patched in the guest device tree since 
we need a range of unique vSIDs for every RC assigned to guest.
Alternative approach would be to switch to vIOMMU per pIOMMU model. 
Since both approaches require major updates, I'll have to do a detailed 
analysis and come back with an updated design which would address this 
issue.


>>
>> Security Considerations
>> =======================
>>
>> **viommu security benefits:**
>>
>> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
>> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>>    maintaining domains isolation.
> 
> I am not sure that I got this paragraph.
> 

First one refers to guest controlled DMA access. Only IOVA->IPA mappings 
created by guest are usable by the device when stage-1 is enabled. On 
the other hand, with stage-2 only enabled, device could access to 
complete IOVA->PA mapping created by Xen for guest. Since the guest has 
no control over device IOVA accesses, a malicious guest kernel could 
potentially access memory regions it shouldn't be allowed to, e.g. if 
stage-2 mappings are stale. With stage-1 enabled, guest device driver 
has to explicitly map IOVAs and this request is propagated through 
emulated IOMMU, making sure that IOVA mappings are valid all the time.

Second claim means that with emulated IOMMU, guests don’t need direct 
access to physical IOMMU hardware. The hypervisor emulates IOMMU 
behavior for the guest, while still ensuring that memory access by 
devices remains properly isolated between guests, just like it would 
with real IOMMU hardware.

>>
>>
>> 1. Observation:
>> ---------------
>> Support for Stage-1 translation in SMMUv3 introduces new data
>> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
>> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
>> an `abort` field to handle partial configuration states.
>>
>> **Risk:**
>> Without proper handling, a partially applied Stage-1 configuration
>> might leave guest DMA mappings in an inconsistent state, potentially
>> enabling unauthorized access or causing cross-domain interference.
>>
>> **Mitigation:** *(Handled by design)*
>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
>> to STE and manages the `abort` field-only considering Stage-1
>> configuration if fully attached. This ensures incomplete or invalid
>> guest configurations are safely ignored by the hypervisor.
>>
>> 2. Observation:
>> ---------------
>> Guests can now invalidate Stage-1 caches; invalidation needs
>> forwarding to SMMUv3 hardware to maintain coherence.
>>
>> **Risk:**
>> Failing to propagate cache invalidation could allow stale mappings,
>> enabling access to old mappings and possibly data leakage or
>> misrouting.
>>
>> **Mitigation:** *(Handled by design)*
>> This feature ensures that guest-initiated invalidations are correctly
>> forwarded to the hardware, preserving IOMMU coherency.
>>
>> 3. Observation:
>> ---------------
>> This design introduces substantial new functionality, including the
>> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
>> queues, event queues, domain management, and Device Tree modifications
>> (e.g., `iommus` nodes and `libxl` integration).
>>
>> **Risk:**
>> Large feature expansions increase the attack surface—potential for
>> race conditions, unchecked command inputs, or Device Tree-based
>> misconfigurations.
>>
>> **Mitigation:**
>>
>> - Sanity checks and error-handling improvements have been introduced
>>    in this feature.
>> - Further audits have to be performed for this feature and its
>>    dependencies in this area. Currently, feature is marked as *Tech
>>    Preview* and is self-contained, reducing the risk to unrelated
>>   components.
>>
>> 4. Observation:
>> ---------------
>> The code includes transformations to handle nested translation versus
>> standard modes and uses guest-configured command queues (e.g.,
>> `CMD_CFGI_STE`) and event notifications.
>>
>> **Risk:**
>> Malicious or malformed queue commands from guests could bypass
>> validation, manipulate SMMUv3 state, or cause Dom0 instability.
> 
> Only Dom0?
> 

This is a mistake, the whole system could be affected. I'll fix this.

>>
>> **Mitigation:** *(Handled by design)*
>> Built-in validation of command queue entries and sanitization
>> mechanisms ensure only permitted configurations are applied. This is
>> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>>
>> 5. Observation:
>> ---------------
>> Device Tree modifications enable device assignment and
>> configuration—guest DT fragments (e.g., `iommus`) are added via
>> `libxl`.
>>
>> **Risk:**
>> Erroneous or malicious Device Tree injection could result in device
>> misbinding or guest access to unauthorized hardware.
>>
>> **Mitigation:**
>>
>> - `libxl` perform checks of guest configuration and parse only
>>    predefined dt fragments and nodes, reducing risc.
>> - The system integrator must ensure correct resource mapping in the
>>    guest Device Tree (DT) fragments.
>>
>> 6. Observation:
>> ---------------
>> Introducing optional per-guest enabled features (`viommu` argument in
>> xl guest config) means some guests may opt-out.
>>
>> **Risk:**
>> Differences between guests with and without `viommu` may cause
>> unexpected behavior or privilege drift.
>>
>> **Mitigation:**
>> Verify that downgrade paths are safe and well-isolated; ensure missing
>> support doesn't cause security issues. Additional audits on emulation
>> paths and domains interference need to be performed in a multi-guest
>> environment.
>>
>> 7. Observation:
>> ---------------
>> Guests have the ability to issue Stage-1 IOMMU commands like cache
>> invalidation, stream table entries configuration, etc. An adversarial
>> guest may issue a high volume of commands in rapid succession.
>>
>> **Risk**
>> Excessive commands requests can cause high hypervisor CPU consumption
>> and disrupt scheduling, leading to degraded system responsiveness and
>> potential denial-of-service scenarios.
>>
>> **Mitigation**
>>
>> - Xen credit scheduler limits guest vCPU execution time, securing
>>    basic guest rate-limiting.
> 
> I don't thing that this feature available only in credit schedulers,
> AFAIK, all schedulers except null scheduler will limit vCPU execution time.
> 

I was not aware of that. I'll rephrase this part.

>> - Batch multiple commands of same type to reduce overhead on the
>>    virtual SMMUv3 hardware emulation.
>> - Implement vIOMMU commands execution restart and continuation support
> 
> So, something like "hypercall continuation"?
> 

Yes

>>
>> 8. Observation:
>> ---------------
>> Some guest commands issued towards vIOMMU are propagated to pIOMMU
>> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
>> queue is
>> available for all domains.
>>
>> **Risk**
>> Excessive commands requests from abusive guest can cause flooding of
>> physical IOMMU command queue, leading to degraded pIOMMU responsivness
>> on commands issued from other guests.
>>
>> **Mitigation**
>>
>> - Xen credit scheduler limits guest vCPU execution time, securing
>>    basic guest rate-limiting.
>> - Batch commands which should be propagated towards pIOMMU cmd queue
>>    and enable support for batch execution pause/continuation
>> - If possible, implement domain penalization by adding a per-domain
>>    cost counter for vIOMMU/pIOMMU usage.
>>
>> 9. Observation:
>> ---------------
>> vIOMMU feature includes event queue used for forwarding IOMMU events
>> to guest (e.g. translation faults, invalid stream IDs, permission
>> errors). A malicious guest can misconfigure its SMMU state or
>> intentionally trigger faults with high frequency.
>>
>> **Risk**
>> Occurance of IOMMU events with high frequency can cause Xen to flood
>> the event queue and disrupt scheduling with high hypervisor CPU load
>> for events handling.
>>
>> **Mitigation**
>>
>> - Implement fail-safe state by disabling events forwarding when faults
>>    are occured with high frequency and not processed by guest.
>> - Batch multiple events of same type to reduce overhead on the virtual
>>    SMMUv3 hardware emulation.
>> - Consider disabling event queue for untrusted guests
>>
>> Performance Impact
>> ==================
>>
>> With iommu stage-1 and nested translation inclusion, performance
>> overhead is introduced comparing to existing, stage-2 only usage in
>> Xen.
>> Once mappings are established, translations should not introduce
>> significant overhead.
>> Emulated paths may introduce moderate overhead, primarily affecting
>> device initialization and event handling.
>> Performance impact highly depends on target CPU capabilities. Testing
>> is performed on cortex-a53 based platform.
> 
> Which platform exactly? While QEMU emulates SMMU to some extent, we are
> observing somewhat different SMMU behavior on real HW platforms (mostly
> due to cache coherence problems). Also, according to MMU-600 errata, it
> can have lower than expected performance in some use-cases.
> 

Performance measurement are done on QEMU emulated Renesas platform. I'll 
add some details for this.

>> Performance is mostly impacted by emulated vIOMMU operations, results
>> shown in the following table.
>>
>> +-------------------------------+---------------------------------+
>> | vIOMMU Operation              | Execution time in guest         |
>> +===============================+=================================+
>> | Reg read                      | median: 30μs, worst-case: 250μs |
>> +-------------------------------+---------------------------------+
>> | Reg write                     | median: 35μs, worst-case: 280μs |
>> +-------------------------------+---------------------------------+
>> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
>> +-------------------------------+---------------------------------+
>> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
>> +-------------------------------+---------------------------------+
>>
>> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
>> and configure stage-1 mappings for devices attached to it.
>> Following table shows initialization stages which impact stage-1
>> enabled guest boot time and compares it with stage-1 disabled guest.
>>
>> "NOTE: Device probe execution time varies significantly depending on
>> device complexity. virtio-gpu was selected as a test case due to its
>> extensive use of dynamic DMA allocations and IOMMU mappings, making it
>> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>>
>> +---------------------+-----------------------+------------------------+
>> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
>> +=====================+=======================+========================+
>> | IOMMU Init          | ~25ms                 | /                      |
>> +---------------------+-----------------------+------------------------+
>> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
>> +---------------------+-----------------------+------------------------+
>>
>> For devices configured with dynamic DMA mappings, DMA
>> allocate/map/unmap operations performance is also impacted on stage-1
>> enabled guests.
>> Dynamic DMA mapping operation issues emulated IOMMU functions like
>> mmio write/read and TLB invalidations.
>> As a reference, following table shows performance results for runtime
>> dma operations for virtio-gpu device.
>>
>> +---------------+-------------------------+----------------------------+
>> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
>> +===============+=========================+============================+
>> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
>> +---------------+-------------------------+----------------------------+
>> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
>> +---------------+-------------------------+----------------------------+
>> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
>> +---------------+-------------------------+----------------------------+
>> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
>> +---------------+-------------------------+----------------------------+
>>
>> Testing
>> ============
>>
>> - QEMU-based ARM system tests for Stage-1 translation and nested
>>    virtualization.
>> - Actual hardware validation on platforms such as Renesas to ensure
>>    compatibility with real SMMUv3 implementations.
>> - Unit/Functional tests validating correct translations (not implemented).
>>
>> Migration and Compatibility
>> ===========================
>>
>> This optional feature defaults to disabled (`viommu=""`) for backward
>> compatibility.
>>
> 

BR,
Milan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-09-01 11:06               ` Milan Djokic
@ 2025-11-03 13:16                 ` Milan Djokic
  2025-11-27  2:31                   ` Volodymyr Babchuk
  2025-11-27 10:22                   ` Julien Grall
  0 siblings, 2 replies; 40+ messages in thread
From: Milan Djokic @ 2025-11-03 13:16 UTC (permalink / raw)
  To: Volodymyr Babchuk, Julien Grall
  Cc: Julien Grall, xen-devel@lists.xenproject.org, Bertrand Marquis,
	Rahul Singh, Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

On 9/1/25 13:06, Milan Djokic wrote:
> Hi Volodymyr,
> 
> On 8/29/25 18:27, Volodymyr Babchuk wrote:
>> Hi Milan,
>>
>> Thanks, "Security Considerations" sections looks really good. But I have
>> more questions.
>>
>> Milan Djokic <milan_djokic@epam.com> writes:
>>
>>> Hello Julien, Volodymyr
>>>
>>> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>>>> Hi Milan,
>>>> Milan Djokic <milan_djokic@epam.com> writes:
>>>>
>>>>> Hello Julien,
>>>>>
>>>>> On 8/13/25 14:11, Julien Grall wrote:
>>>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>>>> Hello Julien,
>>>>>> Hi Milan,
>>>>>>
>>>>>>>
>>>>>>> We have prepared a design document and it will be part of the updated
>>>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>>>> details on implementation structure to make review easier.
>>>>>> I would suggest to just iterate on the design document for now.
>>>>>>
>>>>>>> Following is the design document content which will be provided in
>>>>>>> updated patch series:
>>>>>>>
>>>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>>>> ==========================================================
>>>>>>>
>>>>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>>>>> Date:   2025-08-07
>>>>>>> Status: Draft
>>>>>>>
>>>>>>> Introduction
>>>>>>> ------------
>>>>>>>
>>>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>>>> can be independently enabled. An incoming address is logically
>>>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>>>> is required to provide isolation between different devices within the OS.
>>>>>>>
>>>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>>>
>>>>>>> Motivation
>>>>>>> ----------
>>>>>>>
>>>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>>>> ensure correct and secure DMA behavior inside guests.
>>>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>>>> without
>>>>>> stage-1.
>>>>>
>>>>> Correct in terms of working with guest managed I/O space. I'll
>>>>> rephrase this statement, it seems ambiguous.
>>>>>
>>>>>>>
>>>>>>> This feature enables:
>>>>>>> - Stage-1 translation in guest domain
>>>>>>> - Safe device passthrough under secure memory translation
>>>>>>>
>>>>>>> Design Overview
>>>>>>> ---------------
>>>>>>>
>>>>>>> These changes provide emulated SMMUv3 support:
>>>>>>>
>>>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>>>        SMMUv3 driver
>>>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>>>> pIOMMU? Or a single one?
>>>>>
>>>>> Single vIOMMU model is used in this design.
>>>>>
>>>>>> Have you considered the pros/cons for both?
>>>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>>>        queue handling
>>>>>>
>>>>>
>>>>> That's a point for consideration.
>>>>> single vIOMMU prevails in terms of less complex implementation and a
>>>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>>>> event queue, single set of trap handlers for emulation, etc.
>>>>> Cons for a single vIOMMU model could be less accurate hw
>>>>> representation and a potential bottleneck with one emulated queue and
>>>>> interrupt path.
>>>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>>>> modeling and offers better scalability in case of many IOMMUs in the
>>>>> system, but this comes with more complex emulation logic and device
>>>>> tree, also handling multiple vIOMMUs on guest side.
>>>>> IMO, single vIOMMU model seems like a better option mostly because
>>>>> it's less complex, easier to maintain and debug. Of course, this
>>>>> decision can and should be discussed.
>>>>>
>>>> Well, I am not sure that this is possible, because of StreamID
>>>> allocation. The biggest offender is of course PCI, as each Root PCI
>>>> bridge will require own SMMU instance with own StreamID space. But even
>>>> without PCI you'll need some mechanism to map vStremID to
>>>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>>>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>>>>
>>>>>> For each pSMMU, we have a single command queue that will receive command
>>>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>>>> command queue?
>>>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>>>> spend too much time in Xen (which is not preemptible)?
>>>>>>
>>>>>
>>>>> We'll have to do a detailed analysis on these scenarios, they are not
>>>>> covered by the design (as well as some others which is clear after
>>>>> your comments). I'll come back with an updated design.
>>>> I think that can be handled akin to hypercall continuation, which is
>>>> used in similar places, like P2M code
>>>> [...]
>>>>
>>>
>>> I have updated vIOMMU design document with additional security topics
>>> covered and performance impact results. Also added some additional
>>> explanations for vIOMMU components following your comments.
>>> Updated document content:
>>>
>>> ===============================================
>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>> ===============================================
>>>
>>> :Author:     Milan Djokic <milan_djokic@epam.com>
>>> :Date:       2025-08-07
>>> :Status:     Draft
>>>
>>> Introduction
>>> ========
>>>
>>> The SMMUv3 supports two stages of translation. Each stage of
>>> translation can be
>>> independently enabled. An incoming address is logically translated
>>> from VA to
>>> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
>>> the output PA. Stage 1 translation support is required to provide
>>> isolation between different
>>> devices within OS. XEN already supports Stage 2 translation but there is no
>>> support for Stage 1 translation.
>>> This design proposal outlines the introduction of Stage-1 SMMUv3
>>> support in Xen for ARM guests.
>>>
>>> Motivation
>>> ==========
>>>
>>> ARM systems utilizing SMMUv3 require stage-1 address translation to
>>> ensure secure DMA and guest managed I/O memory mappings.
>>
>> It is unclear for my what you mean by "guest manged IO memory mappings",
>> could you please provide an example?
>>
> 
> Basically enabling stage-1 translation means that the guest is
> responsible for managing IOVA to IPA mappings through its own IOMMU
> driver. Guest manages its own stage-1 page tables and TLB.
> For example, when a guest driver wants to perform DMA mapping (e.g. with
> dma_map_single()), it will request mapping of its buffer physical
> address to IOVA through guest IOMMU driver. Guest IOMMU driver will
> further issue mapping commands emulated by Xen which translate it into
> stage-2 mappings.
> 
>>> This feature enables:
>>>
>>> - Stage-1 translation in guest domain
>>> - Safe device passthrough under secure memory translation
>>>
>>
>> As I see it, ARM specs use "secure" mostly when referring to Secure mode
>> (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
>> devices, like secure GIC, secure Timer, etc. So I'd probably don't use
>> this word here to reduce confusion
>>
> 
> Sure, secure in terms of isolation is the topic here. I'll rephrase this
> 
>>> Design Overview
>>> ===============
>>>
>>> These changes provide emulated SMMUv3 support:
>>>
>>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>>>       support in SMMUv3 driver.
>>
>> "Nested translation" as in "nested virtualization"? Or is this something else?
>>
> 
> No, this refers to 2-stage translation IOVA->IPA->PA as a nested
> translation. Although with this feature, nested virtualization is also
> enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest.
> 
> 
>>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>>>       handling.
>>
>> I think, this is the big topic. You see, apart from SMMU, there is
>> at least Renesas IP-MMU, which uses completely different API. And
>> probably there are other IO-MMU implementations possible. Right now
>> vIOMMU framework handles only SMMU, which is okay, but probably we
>> should design it in a such way, that other IO-MMUs will be supported as
>> well. Maybe even IO-MMUs for other architectures (RISC V maybe?).
>>
> 
> I think that it is already designed in such manner. We have a generic
> vIOMMU framework and a backend implementation for target IOMMU as
> separate components. And the backend implements supported
> commands/mechanisms which are specific for target IOMMU type. At this
> point, only SMMUv3 is supported, but it is possible to implement other
> IOMMU types support under the same generic framework. AFAIK, RISC-V
> IOMMU stage-2 is still in early development stage, but I do believe that
> it will be also compatible with vIOMMU framework.
> 
>>> - **Register/Command Emulation**: SMMUv3 register emulation and
>>>       command queue handling.
>>
>> Continuing previous paragraph: what about other IO-MMUs? For example, if
>> platform provides only Renesas IO-MMU, will vIOMMU framework still
>> emulate SMMUv3 registers and queue handling?
>>
> 
> Yes, this is not supported in current implementation. To support other
> IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for
> target IOMMU and probably Xen driver for target IOMMU has to be updated
> to handle stage-1 configuration. I will elaborate this part in the
> design, to make clear that we have a generic vIOMMU framework, but only
> SMMUv3 backend exists atm.
> 
>>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>>>       to device trees for dom0 and dom0less scenarios.
>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>>       dynamic enablement.
>>>
>>> vIOMMU is exposed to guest as a single device with predefined
>>> capabilities and commands supported. Single vIOMMU model abstracts the
>>> details of an actual IOMMU hardware, simplifying usage from the guest
>>> point of view. Guest OS handles only a single IOMMU, even if multiple
>>> IOMMU units are available on the host system.
>>
>> In the previous email I asked how are you planning to handle potential
>> SID overlaps, especially in PCI use case. I want to return to this
>> topic. I am not saying that this is impossible, but I'd like to see this
>> covered in the design document.
>>
> 
> Sorry, I've missed this part in the previous mail. This is a valid point,
> SID overlapping would be an issue for a single vIOMMU model. To prevent
> it, design will have to be extended with SID namespace virtualization,
> introducing a remapping layer which will make sure that guest virtual
> SIDs are unique and maintain proper mappings of vSIDs to pSIDs.
> For PCI case, we need to have an extended remapping logic where
> iommu-map property will be also patched in the guest device tree since
> we need a range of unique vSIDs for every RC assigned to guest.
> Alternative approach would be to switch to vIOMMU per pIOMMU model.
> Since both approaches require major updates, I'll have to do a detailed
> analysis and come back with an updated design which would address this
> issue.
> 
> 
>>>
>>> Security Considerations
>>> =======================
>>>
>>> **viommu security benefits:**
>>>
>>> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
>>> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>>>     maintaining domains isolation.
>>
>> I am not sure that I got this paragraph.
>>
> 
> First one refers to guest controlled DMA access. Only IOVA->IPA mappings
> created by guest are usable by the device when stage-1 is enabled. On
> the other hand, with stage-2 only enabled, device could access to
> complete IOVA->PA mapping created by Xen for guest. Since the guest has
> no control over device IOVA accesses, a malicious guest kernel could
> potentially access memory regions it shouldn't be allowed to, e.g. if
> stage-2 mappings are stale. With stage-1 enabled, guest device driver
> has to explicitly map IOVAs and this request is propagated through
> emulated IOMMU, making sure that IOVA mappings are valid all the time.
> 
> Second claim means that with emulated IOMMU, guests don’t need direct
> access to physical IOMMU hardware. The hypervisor emulates IOMMU
> behavior for the guest, while still ensuring that memory access by
> devices remains properly isolated between guests, just like it would
> with real IOMMU hardware.
> 
>>>
>>>
>>> 1. Observation:
>>> ---------------
>>> Support for Stage-1 translation in SMMUv3 introduces new data
>>> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
>>> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
>>> an `abort` field to handle partial configuration states.
>>>
>>> **Risk:**
>>> Without proper handling, a partially applied Stage-1 configuration
>>> might leave guest DMA mappings in an inconsistent state, potentially
>>> enabling unauthorized access or causing cross-domain interference.
>>>
>>> **Mitigation:** *(Handled by design)*
>>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
>>> to STE and manages the `abort` field-only considering Stage-1
>>> configuration if fully attached. This ensures incomplete or invalid
>>> guest configurations are safely ignored by the hypervisor.
>>>
>>> 2. Observation:
>>> ---------------
>>> Guests can now invalidate Stage-1 caches; invalidation needs
>>> forwarding to SMMUv3 hardware to maintain coherence.
>>>
>>> **Risk:**
>>> Failing to propagate cache invalidation could allow stale mappings,
>>> enabling access to old mappings and possibly data leakage or
>>> misrouting.
>>>
>>> **Mitigation:** *(Handled by design)*
>>> This feature ensures that guest-initiated invalidations are correctly
>>> forwarded to the hardware, preserving IOMMU coherency.
>>>
>>> 3. Observation:
>>> ---------------
>>> This design introduces substantial new functionality, including the
>>> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
>>> queues, event queues, domain management, and Device Tree modifications
>>> (e.g., `iommus` nodes and `libxl` integration).
>>>
>>> **Risk:**
>>> Large feature expansions increase the attack surface—potential for
>>> race conditions, unchecked command inputs, or Device Tree-based
>>> misconfigurations.
>>>
>>> **Mitigation:**
>>>
>>> - Sanity checks and error-handling improvements have been introduced
>>>     in this feature.
>>> - Further audits have to be performed for this feature and its
>>>     dependencies in this area. Currently, feature is marked as *Tech
>>>     Preview* and is self-contained, reducing the risk to unrelated
>>>    components.
>>>
>>> 4. Observation:
>>> ---------------
>>> The code includes transformations to handle nested translation versus
>>> standard modes and uses guest-configured command queues (e.g.,
>>> `CMD_CFGI_STE`) and event notifications.
>>>
>>> **Risk:**
>>> Malicious or malformed queue commands from guests could bypass
>>> validation, manipulate SMMUv3 state, or cause Dom0 instability.
>>
>> Only Dom0?
>>
> 
> This is a mistake, the whole system could be affected. I'll fix this.
> 
>>>
>>> **Mitigation:** *(Handled by design)*
>>> Built-in validation of command queue entries and sanitization
>>> mechanisms ensure only permitted configurations are applied. This is
>>> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>>>
>>> 5. Observation:
>>> ---------------
>>> Device Tree modifications enable device assignment and
>>> configuration—guest DT fragments (e.g., `iommus`) are added via
>>> `libxl`.
>>>
>>> **Risk:**
>>> Erroneous or malicious Device Tree injection could result in device
>>> misbinding or guest access to unauthorized hardware.
>>>
>>> **Mitigation:**
>>>
>>> - `libxl` perform checks of guest configuration and parse only
>>>     predefined dt fragments and nodes, reducing risc.
>>> - The system integrator must ensure correct resource mapping in the
>>>     guest Device Tree (DT) fragments.
>>>
>>> 6. Observation:
>>> ---------------
>>> Introducing optional per-guest enabled features (`viommu` argument in
>>> xl guest config) means some guests may opt-out.
>>>
>>> **Risk:**
>>> Differences between guests with and without `viommu` may cause
>>> unexpected behavior or privilege drift.
>>>
>>> **Mitigation:**
>>> Verify that downgrade paths are safe and well-isolated; ensure missing
>>> support doesn't cause security issues. Additional audits on emulation
>>> paths and domains interference need to be performed in a multi-guest
>>> environment.
>>>
>>> 7. Observation:
>>> ---------------
>>> Guests have the ability to issue Stage-1 IOMMU commands like cache
>>> invalidation, stream table entries configuration, etc. An adversarial
>>> guest may issue a high volume of commands in rapid succession.
>>>
>>> **Risk**
>>> Excessive commands requests can cause high hypervisor CPU consumption
>>> and disrupt scheduling, leading to degraded system responsiveness and
>>> potential denial-of-service scenarios.
>>>
>>> **Mitigation**
>>>
>>> - Xen credit scheduler limits guest vCPU execution time, securing
>>>     basic guest rate-limiting.
>>
>> I don't thing that this feature available only in credit schedulers,
>> AFAIK, all schedulers except null scheduler will limit vCPU execution time.
>>
> 
> I was not aware of that. I'll rephrase this part.
> 
>>> - Batch multiple commands of same type to reduce overhead on the
>>>     virtual SMMUv3 hardware emulation.
>>> - Implement vIOMMU commands execution restart and continuation support
>>
>> So, something like "hypercall continuation"?
>>
> 
> Yes
> 
>>>
>>> 8. Observation:
>>> ---------------
>>> Some guest commands issued towards vIOMMU are propagated to pIOMMU
>>> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
>>> queue is
>>> available for all domains.
>>>
>>> **Risk**
>>> Excessive commands requests from abusive guest can cause flooding of
>>> physical IOMMU command queue, leading to degraded pIOMMU responsivness
>>> on commands issued from other guests.
>>>
>>> **Mitigation**
>>>
>>> - Xen credit scheduler limits guest vCPU execution time, securing
>>>     basic guest rate-limiting.
>>> - Batch commands which should be propagated towards pIOMMU cmd queue
>>>     and enable support for batch execution pause/continuation
>>> - If possible, implement domain penalization by adding a per-domain
>>>     cost counter for vIOMMU/pIOMMU usage.
>>>
>>> 9. Observation:
>>> ---------------
>>> vIOMMU feature includes event queue used for forwarding IOMMU events
>>> to guest (e.g. translation faults, invalid stream IDs, permission
>>> errors). A malicious guest can misconfigure its SMMU state or
>>> intentionally trigger faults with high frequency.
>>>
>>> **Risk**
>>> Occurance of IOMMU events with high frequency can cause Xen to flood
>>> the event queue and disrupt scheduling with high hypervisor CPU load
>>> for events handling.
>>>
>>> **Mitigation**
>>>
>>> - Implement fail-safe state by disabling events forwarding when faults
>>>     are occured with high frequency and not processed by guest.
>>> - Batch multiple events of same type to reduce overhead on the virtual
>>>     SMMUv3 hardware emulation.
>>> - Consider disabling event queue for untrusted guests
>>>
>>> Performance Impact
>>> ==================
>>>
>>> With iommu stage-1 and nested translation inclusion, performance
>>> overhead is introduced comparing to existing, stage-2 only usage in
>>> Xen.
>>> Once mappings are established, translations should not introduce
>>> significant overhead.
>>> Emulated paths may introduce moderate overhead, primarily affecting
>>> device initialization and event handling.
>>> Performance impact highly depends on target CPU capabilities. Testing
>>> is performed on cortex-a53 based platform.
>>
>> Which platform exactly? While QEMU emulates SMMU to some extent, we are
>> observing somewhat different SMMU behavior on real HW platforms (mostly
>> due to cache coherence problems). Also, according to MMU-600 errata, it
>> can have lower than expected performance in some use-cases.
>>
> 
> Performance measurement are done on QEMU emulated Renesas platform. I'll
> add some details for this.
> 
>>> Performance is mostly impacted by emulated vIOMMU operations, results
>>> shown in the following table.
>>>
>>> +-------------------------------+---------------------------------+
>>> | vIOMMU Operation              | Execution time in guest         |
>>> +===============================+=================================+
>>> | Reg read                      | median: 30μs, worst-case: 250μs |
>>> +-------------------------------+---------------------------------+
>>> | Reg write                     | median: 35μs, worst-case: 280μs |
>>> +-------------------------------+---------------------------------+
>>> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
>>> +-------------------------------+---------------------------------+
>>> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
>>> +-------------------------------+---------------------------------+
>>>
>>> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
>>> and configure stage-1 mappings for devices attached to it.
>>> Following table shows initialization stages which impact stage-1
>>> enabled guest boot time and compares it with stage-1 disabled guest.
>>>
>>> "NOTE: Device probe execution time varies significantly depending on
>>> device complexity. virtio-gpu was selected as a test case due to its
>>> extensive use of dynamic DMA allocations and IOMMU mappings, making it
>>> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>>>
>>> +---------------------+-----------------------+------------------------+
>>> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
>>> +=====================+=======================+========================+
>>> | IOMMU Init          | ~25ms                 | /                      |
>>> +---------------------+-----------------------+------------------------+
>>> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
>>> +---------------------+-----------------------+------------------------+
>>>
>>> For devices configured with dynamic DMA mappings, DMA
>>> allocate/map/unmap operations performance is also impacted on stage-1
>>> enabled guests.
>>> Dynamic DMA mapping operation issues emulated IOMMU functions like
>>> mmio write/read and TLB invalidations.
>>> As a reference, following table shows performance results for runtime
>>> dma operations for virtio-gpu device.
>>>
>>> +---------------+-------------------------+----------------------------+
>>> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
>>> +===============+=========================+============================+
>>> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
>>> +---------------+-------------------------+----------------------------+
>>> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
>>> +---------------+-------------------------+----------------------------+
>>> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
>>> +---------------+-------------------------+----------------------------+
>>> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
>>> +---------------+-------------------------+----------------------------+
>>>
>>> Testing
>>> ============
>>>
>>> - QEMU-based ARM system tests for Stage-1 translation and nested
>>>     virtualization.
>>> - Actual hardware validation on platforms such as Renesas to ensure
>>>     compatibility with real SMMUv3 implementations.
>>> - Unit/Functional tests validating correct translations (not implemented).
>>>
>>> Migration and Compatibility
>>> ===========================
>>>
>>> This optional feature defaults to disabled (`viommu=""`) for backward
>>> compatibility.
>>>
>>
> 
> BR,
> Milan
> 

Hello Volodymyr, Julien

Sorry for the delayed follow-up on this topic.
We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and 
pIOMMU. Considering single vIOMMU model limitation pointed out by 
Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the 
only proper solution.
Following is the updated design document.
I have added additional details to the design and performance impact 
sections, and also indicated future improvements. Security 
considerations section is unchanged apart from some minor details 
according to review comments.
Let me know what do you think about updated design. Once approved, I 
will send the updated vIOMMU patch series.


==========================================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

:Author:     Milan Djokic <milan_djokic@epam.com>
:Date:       2025-11-03
:Status:     Draft

Introduction
============

The SMMUv3 supports two stages of translation. Each stage of translation 
can be
independently enabled. An incoming address is logically translated from 
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide 
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support 
in Xen for ARM guests.

Motivation
==========

ARM systems utilizing SMMUv3 require stage-1 address translation to 
ensure secure DMA and
guest managed I/O memory mappings.
With stage-1 enabed, guest manages IOVA to IPA mappings through its own 
IOMMU driver.

This feature enables:

- Stage-1 translation in guest domain
- Safe device passthrough with per-device address translation table

Design Overview
===============

These changes provide emulated SMMUv3 support:

- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support 
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command 
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to 
device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for 
dynamic enablement.

Separate vIOMMU device is exposed to guest for every physical IOMMU in 
the system.
vIOMMU feature is designed in a way to provide a generic vIOMMU 
framework and a backend implementation
for target IOMMU as separate components.
Backend implementation contains specific IOMMU structure and commands 
handling (only SMMUv3 currently supported).
This structure allows potential reuse of stage-1 feature for other IOMMU 
types.

Security Considerations
=======================

**viommu security benefits:**

- Stage-1 translation ensures guest devices cannot perform unauthorized 
DMA (device I/O address mapping managed by guest).
- Emulated IOMMU removes guest direct dependency on IOMMU hardware, 
while maintaining domains isolation.


1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures 
(`s1_cfg` alongside `s2_cfg`)
and logic to write both Stage-1 and Stage-2 entries in the Stream Table 
Entry (STE), including an `abort`
field to handle partial configuration states.

**Risk:**
Without proper handling, a partially applied Stage-1 configuration might 
leave guest DMA mappings in an
inconsistent state, potentially enabling unauthorized access or causing 
cross-domain interference.

**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to 
STE and manages the `abort` field-only
considering Stage-1 configuration if fully attached. This ensures 
incomplete or invalid guest configurations
are safely ignored by the hypervisor.

2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding 
to SMMUv3 hardware to maintain coherence.

**Risk:**
Failing to propagate cache invalidation could allow stale mappings, 
enabling access to old mappings and possibly
data leakage or misrouting.

**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly 
forwarded to the hardware,
preserving IOMMU coherency.

3. Observation:
---------------
This design introduces substantial new functionality, including the 
`vIOMMU` framework, virtual SMMUv3
devices (`vsmmuv3`), command queues, event queues, domain management, 
and Device Tree
modifications (e.g., `iommus` nodes and `libxl` integration).

**Risk:**
Large feature expansions increase the attack surface potential for race 
conditions, unchecked command inputs,
or Device Tree-based misconfigurations.

**Mitigation:**

- Sanity checks and error-handling improvements have been introduced in 
this feature.
- Further audits have to be performed for this feature and its 
dependencies in this area.

4. Observation:
---------------
The code includes transformations to handle nested translation versus 
standard modes and uses guest-configured
command queues (e.g., `CMD_CFGI_STE`) and event notifications.

**Risk:**
Malicious or malformed queue commands from guests could bypass 
validation, manipulate SMMUv3 state,
or cause system instability.

**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms 
ensure only permitted configurations
are applied. This is supported via additions in `vsmmuv3` and `cmdqueue` 
handling code.

5. Observation:
---------------
Device Tree modifications enable device assignment and configuration 
through guest DT fragments (e.g., `iommus`)
are added via `libxl`.

**Risk:**
Erroneous or malicious Device Tree injection could result in device 
misbinding or guest access to unauthorized
hardware.

**Mitigation:**

- `libxl` perform checks of guest configuration and parse only 
predefined dt fragments and nodes, reducing risk.
- The system integrator must ensure correct resource mapping in the 
guest Device Tree (DT) fragments.

6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl 
guest config) means some guests
may opt-out.

**Risk:**
Differences between guests with and without `viommu` may cause 
unexpected behavior or privilege drift.

**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing 
support doesn't cause security issues.
Additional audits on emulation paths and domains interference need to be 
performed in a multi-guest environment.

7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache 
invalidation, stream table entries
configuration, etc. An adversarial guest may issue a high volume of 
commands in rapid succession.

**Risk:**
Excessive commands requests can cause high hypervisor CPU consumption 
and disrupt scheduling,
leading to degraded system responsiveness and potential 
denial-of-service scenarios.

**Mitigation:**

- Xen scheduler limits guest vCPU execution time, securing basic guest 
rate-limiting.
- Batch multiple commands of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support

8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU 
command queue (e.g. TLB invalidate).

**Risk:**
Excessive commands requests from abusive guest can cause flooding of 
physical IOMMU command queue,
leading to degraded pIOMMU responsivness on commands issued from other 
guests.

**Mitigation:**

- Xen credit scheduler limits guest vCPU execution time, securing basic 
guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue and 
enable support for batch
   execution pause/continuation
- If possible, implement domain penalization by adding a per-domain cost 
counter for vIOMMU/pIOMMU usage.

9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to 
guest
(e.g. translation faults, invalid stream IDs, permission errors).
A malicious guest can misconfigure its SMMU state or intentionally 
trigger faults with high frequency.

**Risk:**
Occurance of IOMMU events with high frequency can cause Xen to flood the 
event queue and disrupt scheduling with
high hypervisor CPU load for events handling.

**Mitigation:**

- Implement fail-safe state by disabling events forwarding when faults 
are occured with high frequency and
   not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests

Performance Impact
==================

With iommu stage-1 and nested translation inclusion, performance 
overhead is introduced comparing to existing,
stage-2 only usage in Xen. Once mappings are established, translations 
should not introduce significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting 
device initialization and event handling.
Performance impact highly depends on target CPU capabilities.
Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) 
platforms.
Performance is mostly impacted by emulated vIOMMU operations, results 
shown in the following table.

+-------------------------------+---------------------------------+
| vIOMMU Operation              | Execution time in guest         |
+===============================+=================================+
| Reg read                      | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write                     | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
+-------------------------------+---------------------------------+
| Invalidate STE                | median: 450μs worst_case: 7ms+  |
+-------------------------------+---------------------------------+

With vIOMMU exposed to guest, guest OS has to initialize IOMMU device 
and configure stage-1 mappings for devices
attached to it.
Following table shows initialization stages which impact stage-1 enabled 
guest boot time and compares it with
stage-1 disabled guest.

"NOTE: Device probe execution time varies significantly depending on 
device complexity. virtio-gpu was selected
as a test case due to its extensive use of dynamic DMA allocations and 
IOMMU mappings, making it a suitable
candidate for benchmarking stage-1 vIOMMU behavior."

+---------------------+-----------------------+------------------------+
| Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init          | ~25ms                 | /                      |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms                | ~200ms                 |
+---------------------+-----------------------+------------------------+

For devices configured with dynamic DMA mappings, DMA allocate/map/unmap 
operations performance is
also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation trigger emulated IOMMU functions like mmio 
write/read and TLB invalidations.
As a reference, following table shows performance results for runtime 
dma operations for virtio-gpu device.

+---------------+-------------------------+----------------------------+
| DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
+===============+=========================+============================+
| dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+

Testing
=======

- QEMU-based ARM system tests for Stage-1 translation.
- Actual hardware validation to ensure compatibility with real SMMUv3 
implementations.
- Unit/Functional tests validating correct translations (not implemented).

Migration and Compatibility
===========================

This optional feature defaults to disabled (`viommu=""`) for backward 
compatibility.

Future improvements
===================

- Implement the proposed mitigations to address security risks that are 
not covered by the current
   design (events batching, commands execution continuation)
- Support for other IOMMU HW (Renesas, RISC-V, etc.)
- Due to static definition of SPIs and MMIO regions for emulated 
devices, current implementation statically
   defines SPIs and MMIO regions for up to 16 vIOMMUs per guest. Future 
improvements would include configurable
   number of IOMMUs or automatic runtime resolution for target platform.

References
==========

- Original feature implemented by Rahul Singh:
  
https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ 

- SMMUv3 architecture documentation
- Existing vIOMMU code patterns


BR,
Milan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-11-03 13:16                 ` Milan Djokic
@ 2025-11-27  2:31                   ` Volodymyr Babchuk
  2025-11-27 10:22                   ` Julien Grall
  1 sibling, 0 replies; 40+ messages in thread
From: Volodymyr Babchuk @ 2025-11-27  2:31 UTC (permalink / raw)
  To: Milan Djokic
  Cc: Julien Grall, Julien Grall, xen-devel@lists.xenproject.org,
	Bertrand Marquis, Rahul Singh, Stefano Stabellini, Michal Orzel,
	Jan Beulich, Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hi Milan,

Milan Djokic <milan_djokic@epam.com> writes:

> On 9/1/25 13:06, Milan Djokic wrote:

[...]
>
> Hello Volodymyr, Julien
>
> Sorry for the delayed follow-up on this topic.
> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU
> and pIOMMU. Considering single vIOMMU model limitation pointed out by
> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
> only proper solution.
> Following is the updated design document.
> I have added additional details to the design and performance impact
> sections, and also indicated future improvements. Security
> considerations section is unchanged apart from some minor details
> according to review comments.
> Let me know what do you think about updated design. Once approved, I
> will send the updated vIOMMU patch series.

This looks fine for me. I can't see any immediate flaws here. So let's
get to patches :)


[...]

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-11-03 13:16                 ` Milan Djokic
  2025-11-27  2:31                   ` Volodymyr Babchuk
@ 2025-11-27 10:22                   ` Julien Grall
  2025-12-02 22:08                     ` Milan Djokic
  1 sibling, 1 reply; 40+ messages in thread
From: Julien Grall @ 2025-11-27 10:22 UTC (permalink / raw)
  To: Milan Djokic, Volodymyr Babchuk
  Cc: Julien Grall, xen-devel@lists.xenproject.org, Bertrand Marquis,
	Rahul Singh, Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper



On 03/11/2025 13:16, Milan Djokic wrote:
> Hello Volodymyr, Julien

Hi Milan,

Thanks for the new update. For the future, can you trim your reply?

> Sorry for the delayed follow-up on this topic.
> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and 
> pIOMMU. Considering single vIOMMU model limitation pointed out by 
> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the 
> only proper solution.

I am not sure to fully understand. My assumption with the single vIOMMU 
is you have a virtual SID that would be mapped to a (pIOMMU, physical 
SID). Does this means in your solution you will end up with multiple 
vPCI as well and then map pBDF == vBDF? (this because the SID have to be 
fixed at boot)

> Following is the updated design document.
> I have added additional details to the design and performance impact 
> sections, and also indicated future improvements. Security 
> considerations section is unchanged apart from some minor details 
> according to review comments.
> Let me know what do you think about updated design. Once approved, I 
> will send the updated vIOMMU patch series.
> 
> 
> ==========================================================
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ==========================================================
> 
> :Author:     Milan Djokic <milan_djokic@epam.com>
> :Date:       2025-11-03
> :Status:     Draft
> 
> Introduction
> ============
> 
> The SMMUv3 supports two stages of translation. Each stage of translation 
> can be
> independently enabled. An incoming address is logically translated from 
> VA to
> IPA in stage 1, then the IPA is input to stage 2 which translates the 
> IPA to
> the output PA. Stage 1 translation support is required to provide 
> isolation between different
> devices within OS. XEN already supports Stage 2 translation but there is no
> support for Stage 1 translation.
> This design proposal outlines the introduction of Stage-1 SMMUv3 support 
> in Xen for ARM guests.
> 
> Motivation
> ==========
> 
> ARM systems utilizing SMMUv3 require stage-1 address translation to 
> ensure secure DMA and
> guest managed I/O memory mappings.
> With stage-1 enabed, guest manages IOVA to IPA mappings through its own 
> IOMMU driver.
> 
> This feature enables:
> 
> - Stage-1 translation in guest domain
> - Safe device passthrough with per-device address translation table

I find this misleading. Even without this feature, device passthrough is 
still safe in the sense a device will be isolated (assuming all the DMA 
goes through the IOMMU) and will not be able to DMA outside of the guest 
memory. What the stage-1 is doing is providing an extra layer to control 
what each device can see. This is useful if you don't trust your devices 
or you want to assign a device to userspace (e.g. for DPDK).

> 
> Design Overview
> ===============
> 
> These changes provide emulated SMMUv3 support:

If my understanding is correct, there are all some implications in how 
we create the PCI topology. It would be good to spell them out.

> 
> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support 
> in SMMUv3 driver.
> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 
> handling.
> - **Register/Command Emulation**: SMMUv3 register emulation and command 
> queue handling.
> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to 
> device trees for dom0 and dom0less scenarios.

What about ACPI?

> - **Runtime Configuration**: Introduces a `viommu` boot parameter for 
> dynamic enablement.
> 
> Separate vIOMMU device is exposed to guest for every physical IOMMU in 
> the system.
> vIOMMU feature is designed in a way to provide a generic vIOMMU 
> framework and a backend implementation
> for target IOMMU as separate components.
> Backend implementation contains specific IOMMU structure and commands 
> handling (only SMMUv3 currently supported).
> This structure allows potential reuse of stage-1 feature for other IOMMU 
> types.
> 
> Security Considerations
> =======================
> 
> **viommu security benefits:**
> 
> - Stage-1 translation ensures guest devices cannot perform unauthorized 
> DMA (device I/O address mapping managed by guest).
> - Emulated IOMMU removes guest direct dependency on IOMMU hardware, 
> while maintaining domains isolation.

Sorry, I don't follow this argument. Are you saying that it would be 
possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?

> 1. Observation:
> ---------------
> Support for Stage-1 translation in SMMUv3 introduces new data structures 
> (`s1_cfg` alongside `s2_cfg`)
> and logic to write both Stage-1 and Stage-2 entries in the Stream Table 
> Entry (STE), including an `abort`
> field to handle partial configuration states.
> 
> **Risk:**
> Without proper handling, a partially applied Stage-1 configuration might 
> leave guest DMA mappings in an
> inconsistent state, potentially enabling unauthorized access or causing 
> cross-domain interference.

How so? Even if you misconfigure the S1, the S2 would still be properly 
configured (you just mention partially applied stage-1).

> 
> **Mitigation:** *(Handled by design)*
> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to 
> STE and manages the `abort` field-only
> considering Stage-1 configuration if fully attached. This ensures 
> incomplete or invalid guest configurations
> are safely ignored by the hypervisor.

Can you clarify what you mean by invalid guest configurations?

> 
> 2. Observation:
> ---------------
> Guests can now invalidate Stage-1 caches; invalidation needs forwarding 
> to SMMUv3 hardware to maintain coherence.
> 
> **Risk:**
> Failing to propagate cache invalidation could allow stale mappings, 
> enabling access to old mappings and possibly
> data leakage or misrouting.

You are referring to data leakage/misrouting between two devices own by 
the same guest, right? Xen would still be in charge of flush when the 
stage-2 is updated.

> 
> **Mitigation:** *(Handled by design)*
> This feature ensures that guest-initiated invalidations are correctly 
> forwarded to the hardware,
> preserving IOMMU coherency.

How is this a mitigation? You have to properly handle commands. If you 
don't properly handle them, then yes it will break.

> 
> 4. Observation:
> ---------------
> The code includes transformations to handle nested translation versus 
> standard modes and uses guest-configured
> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
> 
> **Risk:**
> Malicious or malformed queue commands from guests could bypass 
> validation, manipulate SMMUv3 state,
> or cause system instability.
> 
> **Mitigation:** *(Handled by design)*
> Built-in validation of command queue entries and sanitization mechanisms 
> ensure only permitted configurations
> are applied.

This is true as long as we didn't make an mistake in the configurations ;).


> This is supported via additions in `vsmmuv3` and `cmdqueue` 
> handling code.
> 
> 5. Observation:
> ---------------
> Device Tree modifications enable device assignment and configuration 
> through guest DT fragments (e.g., `iommus`)
> are added via `libxl`.
> 
> **Risk:**
> Erroneous or malicious Device Tree injection could result in device 
> misbinding or guest access to unauthorized
> hardware.

The DT fragment are not security support and will never be at least 
until you have can a libfdt that is able to detect malformed Device-Tree 
(I haven't checked if this has changed recently).

> 
> **Mitigation:**
> 
> - `libxl` perform checks of guest configuration and parse only 
> predefined dt fragments and nodes, reducing risk.
> - The system integrator must ensure correct resource mapping in the 
> guest Device Tree (DT) fragments.
 > > 6. Observation:
> ---------------
> Introducing optional per-guest enabled features (`viommu` argument in xl 
> guest config) means some guests
> may opt-out.
> 
> **Risk:**
> Differences between guests with and without `viommu` may cause 
> unexpected behavior or privilege drift.

I don't understand this risk. Can you clarify?

> 
> **Mitigation:**
> Verify that downgrade paths are safe and well-isolated; ensure missing 
> support doesn't cause security issues.
> Additional audits on emulation paths and domains interference need to be 
> performed in a multi-guest environment.
> 
> 7. Observation:
> ---------------

This observation with 7, 8 and 9 are the most important observations but 
it seems to be missing some details on how this will be implemented. I 
will try to provide some questions that should help filling the gaps.

> Guests have the ability to issue Stage-1 IOMMU commands like cache 
> invalidation, stream table entries
> configuration, etc. An adversarial guest may issue a high volume of 
> commands in rapid succession.
> 
> **Risk:**
> Excessive commands requests can cause high hypervisor CPU consumption 
> and disrupt scheduling,
> leading to degraded system responsiveness and potential denial-of- 
> service scenarios.
> 
> **Mitigation:**
> 
> - Xen scheduler limits guest vCPU execution time, securing basic guest 
> rate-limiting.

This really depends on your scheduler. Some scheduler (e.g. NULL) will 
not do any scheduling at all. Furthermore, the scheduler only preempt 
EL1/EL0. It doesn't preempt EL2, so any long running operation need 
manual preemption. Therefore, I wouldn't consider this as a mitigation.

> - Batch multiple commands of same type to reduce overhead on the virtual 
> SMMUv3 hardware emulation.

The guest can send commands in any order. So can you expand how this 
would work? Maybe with some example.

> - Implement vIOMMU commands execution restart and continuation support

This needs a bit more details. How will you decide whether to restart 
and what would be the action? (I guess it will be re-executing the 
instruction to write to the CWRITER).

> 
> 8. Observation:
> ---------------
> Some guest commands issued towards vIOMMU are propagated to pIOMMU 
> command queue (e.g. TLB invalidate).
> 
> **Risk:**
> Excessive commands requests from abusive guest can cause flooding of 
> physical IOMMU command queue,
> leading to degraded pIOMMU responsivness on commands issued from other 
> guests.
> 
> **Mitigation:**
> 
> - Xen credit scheduler limits guest vCPU execution time, securing basic 
> guest rate-limiting.

Same as above. This mitigation cannot be used.


> - Batch commands which should be propagated towards pIOMMU cmd queue and 
> enable support for batch
>    execution pause/continuation

Can this be expanded?

> - If possible, implement domain penalization by adding a per-domain cost 
> counter for vIOMMU/pIOMMU usage.

Can this be expanded?

> 
> 9. Observation:
> ---------------
> vIOMMU feature includes event queue used for forwarding IOMMU events to 
> guest
> (e.g. translation faults, invalid stream IDs, permission errors).
> A malicious guest can misconfigure its SMMU state or intentionally 
> trigger faults with high frequency.
> 
> **Risk:**
> Occurance of IOMMU events with high frequency can cause Xen to flood the 

s/occurance/occurrence/

> event queue and disrupt scheduling with
> high hypervisor CPU load for events handling.
> 
> **Mitigation:**
> 
> - Implement fail-safe state by disabling events forwarding when faults 
> are occured with high frequency and
>    not processed by guest.

I am not sure to understand how this would work. Can you expand?

> - Batch multiple events of same type to reduce overhead on the virtual 
> SMMUv3 hardware emulation.

Ditto.

> - Consider disabling event queue for untrusted guests

My understanding is there is only a single physical event queue. Xen 
would be responsible to handle the events in the queue and forward to 
the respective guests. If so, it is not clear what you mean by "disable 
event queue".

> 
> Performance Impact
> ==================
> 
> With iommu stage-1 and nested translation inclusion, performance 
> overhead is introduced comparing to existing,
> stage-2 only usage in Xen. Once mappings are established, translations 
> should not introduce significant overhead.
> Emulated paths may introduce moderate overhead, primarily affecting 
> device initialization and event handling.
> Performance impact highly depends on target CPU capabilities.
> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) 
> platforms.

I am afraid QEMU is not a reliable platform to do performance testing. 
Don't you have a real HW with vIOMMU support?

[...]

> References
> ==========
> 
> - Original feature implemented by Rahul Singh:
> 
> https://patchwork.kernel.org/project/xen-devel/cover/ 
> cover.1669888522.git.rahul.singh@arm.com/
> - SMMUv3 architecture documentation
> - Existing vIOMMU code patterns

I am not sure what this is referring to?

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-11-27 10:22                   ` Julien Grall
@ 2025-12-02 22:08                     ` Milan Djokic
  2025-12-03 10:32                       ` Julien Grall
  0 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-12-02 22:08 UTC (permalink / raw)
  To: Julien Grall, Volodymyr Babchuk
  Cc: Julien Grall, xen-devel@lists.xenproject.org, Bertrand Marquis,
	Rahul Singh, Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hi Julien,

On 11/27/25 11:22, Julien Grall wrote:
>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>> only proper solution.
> 
> I am not sure to fully understand. My assumption with the single vIOMMU
> is you have a virtual SID that would be mapped to a (pIOMMU, physical
> SID). 

In the original single vIOMMU implementation, vSID was also equal to 
pSID, we didn't have SW mapping layer between them. Once SID overlap 
issue was discovered with this model, I have switched to 
vIOMMU-per-pIOMMU model. Alternative was to introduce a SW mapping layer 
and stick with a single vIOMMU model. Imo, vSID->pSID mapping layer 
would overcomplicate the design, especially for PCI RC streamIDs handling.
On the other hand, if even a multi-vIOMMU model introduces problems that 
I am not aware of yet, adding a complex mapping layer would be the only 
viable solution.

 > Does this means in your solution you will end up with multiple
 > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
 > fixed at boot)
 >

The important thing which I haven't mentioned here is that our focus is 
on non-PCI devices for this feature atm. If I'm not mistaken, arm PCI 
passthrough is still work in progress, so our plan was to implement full 
vIOMMU PCI support in the future, once PCI passthrough support is 
complete for arm. Of course, we need to make sure that vIOMMU design 
provides a suitable infrastructure for PCI.
To answer your question, yes we will have multiple vPCI nodes with this 
model, establishing 1-1 vSID-pSID mapping (same iommu-map range between 
pPCI-vPCI).
For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My 
understanding is that vBDF->pBDF mapping does not affect vSID->pSID 
mapping. Am I wrong here?


>> ==========================================================
>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>> ==========================================================
>>
>> :Author:     Milan Djokic <milan_djokic@epam.com>
>> :Date:       2025-11-03
>> :Status:     Draft
>>
>> Introduction
>> ============
>>
>> The SMMUv3 supports two stages of translation. Each stage of translation
>> can be
>> independently enabled. An incoming address is logically translated from
>> VA to
>> IPA in stage 1, then the IPA is input to stage 2 which translates the
>> IPA to
>> the output PA. Stage 1 translation support is required to provide
>> isolation between different
>> devices within OS. XEN already supports Stage 2 translation but there is no
>> support for Stage 1 translation.
>> This design proposal outlines the introduction of Stage-1 SMMUv3 support
>> in Xen for ARM guests.
>>
>> Motivation
>> ==========
>>
>> ARM systems utilizing SMMUv3 require stage-1 address translation to
>> ensure secure DMA and
>> guest managed I/O memory mappings.
>> With stage-1 enabed, guest manages IOVA to IPA mappings through its own
>> IOMMU driver.
>>
>> This feature enables:
>>
>> - Stage-1 translation in guest domain
>> - Safe device passthrough with per-device address translation table
> 
> I find this misleading. Even without this feature, device passthrough is
> still safe in the sense a device will be isolated (assuming all the DMA
> goes through the IOMMU) and will not be able to DMA outside of the guest
> memory. What the stage-1 is doing is providing an extra layer to control
> what each device can see. This is useful if you don't trust your devices
> or you want to assign a device to userspace (e.g. for DPDK).
> 

I'll rephrase this.

>>
>> Design Overview
>> ===============
>>
>> These changes provide emulated SMMUv3 support:
> 
> If my understanding is correct, there are all some implications in how
> we create the PCI topology. It would be good to spell them out.
> 

Sure, I will outline them.

>>
>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
>> in SMMUv3 driver.
>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>> handling.
>> - **Register/Command Emulation**: SMMUv3 register emulation and command
>> queue handling.
>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
>> device trees for dom0 and dom0less scenarios.
> 
> What about ACPI?
> 

ACPI support is not part of this feature atm. This will be a topic for 
future updates.

>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>> dynamic enablement.
>>
>> Separate vIOMMU device is exposed to guest for every physical IOMMU in
>> the system.
>> vIOMMU feature is designed in a way to provide a generic vIOMMU
>> framework and a backend implementation
>> for target IOMMU as separate components.
>> Backend implementation contains specific IOMMU structure and commands
>> handling (only SMMUv3 currently supported).
>> This structure allows potential reuse of stage-1 feature for other IOMMU
>> types.
>>
>> Security Considerations
>> =======================
>>
>> **viommu security benefits:**
>>
>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>> DMA (device I/O address mapping managed by guest).
>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware,
>> while maintaining domains isolation.
> 
> Sorry, I don't follow this argument. Are you saying that it would be
> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
> 

No, this would not work. Emulated IOMMU has to match with the pIOMMU type.
The argument only points out that we are emulating IOMMU, so the guest 
does not need direct HW interface for IOMMU functions.

>> 1. Observation:
>> ---------------
>> Support for Stage-1 translation in SMMUv3 introduces new data structures
>> (`s1_cfg` alongside `s2_cfg`)
>> and logic to write both Stage-1 and Stage-2 entries in the Stream Table
>> Entry (STE), including an `abort`
>> field to handle partial configuration states.
>>
>> **Risk:**
>> Without proper handling, a partially applied Stage-1 configuration might
>> leave guest DMA mappings in an
>> inconsistent state, potentially enabling unauthorized access or causing
>> cross-domain interference.
> 
> How so? Even if you misconfigure the S1, the S2 would still be properly
> configured (you just mention partially applied stage-1).
> 

This could be the case when we have only stage-1. But yes, this is 
improbable case for xen, stage-2 should be mentioned also, will fix this.

>>
>> **Mitigation:** *(Handled by design)*
>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
>> STE and manages the `abort` field-only
>> considering Stage-1 configuration if fully attached. This ensures
>> incomplete or invalid guest configurations
>> are safely ignored by the hypervisor.
> 
> Can you clarify what you mean by invalid guest configurations?
> 

s1 and s2 config will be considered only if configured for the guest 
device. E.g. if only stage-2 is attached for the guest device, stage-1 
configuration will be invalid, but safely ignored. I'll change this to 
"device configuration" instead of ambiguous "guest configuration".

>>
>> 2. Observation:
>> ---------------
>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding
>> to SMMUv3 hardware to maintain coherence.
>>
>> **Risk:**
>> Failing to propagate cache invalidation could allow stale mappings,
>> enabling access to old mappings and possibly
>> data leakage or misrouting.
> 
> You are referring to data leakage/misrouting between two devices own by
> the same guest, right? Xen would still be in charge of flush when the
> stage-2 is updated.
> 

Yes, this risk could affect only guests, not xen.

>>
>> **Mitigation:** *(Handled by design)*
>> This feature ensures that guest-initiated invalidations are correctly
>> forwarded to the hardware,
>> preserving IOMMU coherency.
> 
> How is this a mitigation? You have to properly handle commands. If you
> don't properly handle them, then yes it will break.
> 

Not really a mitigation, will remove it. Guest is responsible for the 
regular initiation of invalidation requests to mitigate this risk.

>>
>> 4. Observation:
>> ---------------
>> The code includes transformations to handle nested translation versus
>> standard modes and uses guest-configured
>> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
>>
>> **Risk:**
>> Malicious or malformed queue commands from guests could bypass
>> validation, manipulate SMMUv3 state,
>> or cause system instability.
>>
>> **Mitigation:** *(Handled by design)*
>> Built-in validation of command queue entries and sanitization mechanisms
>> ensure only permitted configurations
>> are applied.
> 
> This is true as long as we didn't make an mistake in the configurations ;).
> 

Yes, but I don’t see anything we can do to prevent configuration mistakes.

> 
>> This is supported via additions in `vsmmuv3` and `cmdqueue`
>> handling code.
>>
>> 5. Observation:
>> ---------------
>> Device Tree modifications enable device assignment and configuration
>> through guest DT fragments (e.g., `iommus`)
>> are added via `libxl`.
>>
>> **Risk:**
>> Erroneous or malicious Device Tree injection could result in device
>> misbinding or guest access to unauthorized
>> hardware.
> 
> The DT fragment are not security support and will never be at least
> until you have can a libfdt that is able to detect malformed Device-Tree
> (I haven't checked if this has changed recently).
> 

But this should still be considered a risk? Similar to the previous 
observation, system integrator should ensure that DT fragments are correct.

>>
>> **Mitigation:**
>>
>> - `libxl` perform checks of guest configuration and parse only
>> predefined dt fragments and nodes, reducing risk.
>> - The system integrator must ensure correct resource mapping in the
>> guest Device Tree (DT) fragments.
>   > > 6. Observation:
>> ---------------
>> Introducing optional per-guest enabled features (`viommu` argument in xl
>> guest config) means some guests
>> may opt-out.
>>
>> **Risk:**
>> Differences between guests with and without `viommu` may cause
>> unexpected behavior or privilege drift.
> 
> I don't understand this risk. Can you clarify?
> 

This risk is similar to the topics discussed in Observations 8 and 9, 
but in the context of vIOMMU-disabled guests potentially hogging the 
command and event queues due to faster processing of iommu requests. I 
will expand this.

>>
>> **Mitigation:**
>> Verify that downgrade paths are safe and well-isolated; ensure missing
>> support doesn't cause security issues.
>> Additional audits on emulation paths and domains interference need to be
>> performed in a multi-guest environment.
>>
>> 7. Observation:
>> ---------------
> 
> This observation with 7, 8 and 9 are the most important observations but
> it seems to be missing some details on how this will be implemented. I
> will try to provide some questions that should help filling the gaps.
> 

Thanks, I will expand these observations according to comments.

>> Guests have the ability to issue Stage-1 IOMMU commands like cache
>> invalidation, stream table entries
>> configuration, etc. An adversarial guest may issue a high volume of
>> commands in rapid succession.
>>
>> **Risk:**
>> Excessive commands requests can cause high hypervisor CPU consumption
>> and disrupt scheduling,
>> leading to degraded system responsiveness and potential denial-of-
>> service scenarios.
>>
>> **Mitigation:**
>>
>> - Xen scheduler limits guest vCPU execution time, securing basic guest
>> rate-limiting.
> 
> This really depends on your scheduler. Some scheduler (e.g. NULL) will
> not do any scheduling at all. Furthermore, the scheduler only preempt
> EL1/EL0. It doesn't preempt EL2, so any long running operation need
> manual preemption. Therefore, I wouldn't consider this as a mitigation.
> 
>> - Batch multiple commands of same type to reduce overhead on the virtual
>> SMMUv3 hardware emulation.
> 
> The guest can send commands in any order. So can you expand how this
> would work? Maybe with some example.
> 
>> - Implement vIOMMU commands execution restart and continuation support
> 
> This needs a bit more details. How will you decide whether to restart
> and what would be the action? (I guess it will be re-executing the
> instruction to write to the CWRITER).
> 
>>
>> 8. Observation:
>> ---------------
>> Some guest commands issued towards vIOMMU are propagated to pIOMMU
>> command queue (e.g. TLB invalidate).
>>
>> **Risk:**
>> Excessive commands requests from abusive guest can cause flooding of
>> physical IOMMU command queue,
>> leading to degraded pIOMMU responsivness on commands issued from other
>> guests.
>>
>> **Mitigation:**
>>
>> - Xen credit scheduler limits guest vCPU execution time, securing basic
>> guest rate-limiting.
> 
> Same as above. This mitigation cannot be used.
> 
> 
>> - Batch commands which should be propagated towards pIOMMU cmd queue and
>> enable support for batch
>>     execution pause/continuation
> 
> Can this be expanded?
> 
>> - If possible, implement domain penalization by adding a per-domain cost
>> counter for vIOMMU/pIOMMU usage.
> 
> Can this be expanded?
> 
>>
>> 9. Observation:
>> ---------------
>> vIOMMU feature includes event queue used for forwarding IOMMU events to
>> guest
>> (e.g. translation faults, invalid stream IDs, permission errors).
>> A malicious guest can misconfigure its SMMU state or intentionally
>> trigger faults with high frequency.
>>
>> **Risk:**
>> Occurance of IOMMU events with high frequency can cause Xen to flood the
> 
> s/occurance/occurrence/
> 
>> event queue and disrupt scheduling with
>> high hypervisor CPU load for events handling.
>>
>> **Mitigation:**
>>
>> - Implement fail-safe state by disabling events forwarding when faults
>> are occured with high frequency and
>>     not processed by guest.
> 
> I am not sure to understand how this would work. Can you expand?
> 
>> - Batch multiple events of same type to reduce overhead on the virtual
>> SMMUv3 hardware emulation.
> 
> Ditto.
> 
>> - Consider disabling event queue for untrusted guests
> 
> My understanding is there is only a single physical event queue. Xen
> would be responsible to handle the events in the queue and forward to
> the respective guests. If so, it is not clear what you mean by "disable
> event queue".
> 

I was referring to emulated IOMMU event queue. The idea is to make it 
optional for guests. When disabled, events won't be propagated to the guest.

>>
>> Performance Impact
>> ==================
>>
>> With iommu stage-1 and nested translation inclusion, performance
>> overhead is introduced comparing to existing,
>> stage-2 only usage in Xen. Once mappings are established, translations
>> should not introduce significant overhead.
>> Emulated paths may introduce moderate overhead, primarily affecting
>> device initialization and event handling.
>> Performance impact highly depends on target CPU capabilities.
>> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated)
>> platforms.
> 
> I am afraid QEMU is not a reliable platform to do performance testing.
> Don't you have a real HW with vIOMMU support?
> 

Yes, I will provide performance measurement for Renesas HW also.

> [...]
> 
>> References
>> ==========
>>
>> - Original feature implemented by Rahul Singh:
>>
>> https://patchwork.kernel.org/project/xen-devel/cover/
>> cover.1669888522.git.rahul.singh@arm.com/
>> - SMMUv3 architecture documentation
>> - Existing vIOMMU code patterns
> 
> I am not sure what this is referring to?
> 

QEMU and KVM IOMMU emulation patterns were used as a reference.

BR,
Milan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-12-02 22:08                     ` Milan Djokic
@ 2025-12-03 10:32                       ` Julien Grall
  2025-12-03 15:47                         ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Julien Grall @ 2025-12-03 10:32 UTC (permalink / raw)
  To: Milan Djokic, Volodymyr Babchuk
  Cc: Julien Grall, xen-devel@lists.xenproject.org, Bertrand Marquis,
	Rahul Singh, Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hi,

On 02/12/2025 22:08, Milan Djokic wrote:
> Hi Julien,
> 
> On 11/27/25 11:22, Julien Grall wrote:
>>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>>> only proper solution.
>>
>> I am not sure to fully understand. My assumption with the single vIOMMU
>> is you have a virtual SID that would be mapped to a (pIOMMU, physical
>> SID). 
> 
> In the original single vIOMMU implementation, vSID was also equal to 
> pSID, we didn't have SW mapping layer between them. Once SID overlap 
> issue was discovered with this model, I have switched to vIOMMU-per- 
> pIOMMU model. Alternative was to introduce a SW mapping layer and stick 
> with a single vIOMMU model. Imo, vSID->pSID mapping layer would 
> overcomplicate the design, especially for PCI RC streamIDs handling.
> On the other hand, if even a multi-vIOMMU model introduces problems that 
> I am not aware of yet, adding a complex mapping layer would be the only 
> viable solution.
> 
>  > Does this means in your solution you will end up with multiple
>  > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
>  > fixed at boot)
>  >
> 
> The important thing which I haven't mentioned here is that our focus is 
> on non-PCI devices for this feature atm. If I'm not mistaken, arm PCI 
> passthrough is still work in progress, so our plan was to implement full 
> vIOMMU PCI support in the future, once PCI passthrough support is 
> complete for arm. Of course, we need to make sure that vIOMMU design 
> provides a suitable infrastructure for PCI.
> To answer your question, yes we will have multiple vPCI nodes with this 
> model, establishing 1-1 vSID-pSID mapping (same iommu-map range between 
> pPCI-vPCI).
> For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My 
> understanding is that vBDF->pBDF mapping does not affect vSID->pSID 
> mapping. Am I wrong here?

 From my understanding, the mapping between a vBDF and vSID is setup at 
domain creation (as this is described in ACPI/Device-Tree). As PCI 
devices can be hotplug, if you want to enforce vSID == pSID, then you 
indirectly need to enforce vBDF == pBDF.

[...]

>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>> dynamic enablement.
>>>
>>> Separate vIOMMU device is exposed to guest for every physical IOMMU in
>>> the system.
>>> vIOMMU feature is designed in a way to provide a generic vIOMMU
>>> framework and a backend implementation
>>> for target IOMMU as separate components.
>>> Backend implementation contains specific IOMMU structure and commands
>>> handling (only SMMUv3 currently supported).
>>> This structure allows potential reuse of stage-1 feature for other IOMMU
>>> types.
>>>
>>> Security Considerations
>>> =======================
>>>
>>> **viommu security benefits:**
>>>
>>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>>> DMA (device I/O address mapping managed by guest).
>>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware,
>>> while maintaining domains isolation.
>>
>> Sorry, I don't follow this argument. Are you saying that it would be
>> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
>>
> 
> No, this would not work. Emulated IOMMU has to match with the pIOMMU type.
> The argument only points out that we are emulating IOMMU, so the guest 
> does not need direct HW interface for IOMMU functions.

Sorry, but I am still missing how this is a security benefits.

[...]


>>>
>>> 2. Observation:
>>> ---------------
>>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding
>>> to SMMUv3 hardware to maintain coherence.
>>>
>>> **Risk:**
>>> Failing to propagate cache invalidation could allow stale mappings,
>>> enabling access to old mappings and possibly
>>> data leakage or misrouting.
>>
>> You are referring to data leakage/misrouting between two devices own by
>> the same guest, right? Xen would still be in charge of flush when the
>> stage-2 is updated.
>>
> 
> Yes, this risk could affect only guests, not xen.

But it would affect a single guest right? IOW, it is not possible for 
guest A to leak data to guest B even if we don't properly invalidate 
stage-1. Correct?

> 
>>>
>>> **Mitigation:** *(Handled by design)*
>>> This feature ensures that guest-initiated invalidations are correctly
>>> forwarded to the hardware,
>>> preserving IOMMU coherency.
>>
>> How is this a mitigation? You have to properly handle commands. If you
>> don't properly handle them, then yes it will break.
>>
> 
> Not really a mitigation, will remove it. Guest is responsible for the 
> regular initiation of invalidation requests to mitigate this risk.
> 
>>>
>>> 4. Observation:
>>> ---------------
>>> The code includes transformations to handle nested translation versus
>>> standard modes and uses guest-configured
>>> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
>>>
>>> **Risk:**
>>> Malicious or malformed queue commands from guests could bypass
>>> validation, manipulate SMMUv3 state,
>>> or cause system instability.
>>>
>>> **Mitigation:** *(Handled by design)*
>>> Built-in validation of command queue entries and sanitization mechanisms
>>> ensure only permitted configurations
>>> are applied.
>>
>> This is true as long as we didn't make an mistake in the 
>> configurations ;).
>>
> 
> Yes, but I don’t see anything we can do to prevent configuration mistakes.

There is nothing really preventing it. Same for ...
> 
>>
>>> This is supported via additions in `vsmmuv3` and `cmdqueue`
>>> handling code.
>>>
>>> 5. Observation:
>>> ---------------
>>> Device Tree modifications enable device assignment and configuration
>>> through guest DT fragments (e.g., `iommus`)
>>> are added via `libxl`.
>>>
>>> **Risk:**
>>> Erroneous or malicious Device Tree injection could result in device
>>> misbinding or guest access to unauthorized
>>> hardware.
>>
>> The DT fragment are not security support and will never be at least
>> until you have can a libfdt that is able to detect malformed Device-Tree
>> (I haven't checked if this has changed recently).
>>
> 
> But this should still be considered a risk? Similar to the previous 
> observation, system integrator should ensure that DT fragments are correct.

... this one. I agree they are risks, but they don't provide much input 
in the design of the vIOMMU.

I am a lot more concerned for the scheduling part because the resources 
are shared.

>> My understanding is there is only a single physical event queue. Xen
>> would be responsible to handle the events in the queue and forward to
>> the respective guests. If so, it is not clear what you mean by "disable
>> event queue".
>>
> 
> I was referring to emulated IOMMU event queue. The idea is to make it 
> optional for guests. When disabled, events won't be propagated to the 
> guest.

But Xen will still receive the events, correct? If so, how does it make 
it better?

> 
>>>
>>> Performance Impact
>>> ==================
>>>
>>> With iommu stage-1 and nested translation inclusion, performance
>>> overhead is introduced comparing to existing,
>>> stage-2 only usage in Xen. Once mappings are established, translations
>>> should not introduce significant overhead.
>>> Emulated paths may introduce moderate overhead, primarily affecting
>>> device initialization and event handling.
>>> Performance impact highly depends on target CPU capabilities.
>>> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated)
>>> platforms.
>>
>> I am afraid QEMU is not a reliable platform to do performance testing.
>> Don't you have a real HW with vIOMMU support?
>>
> 
> Yes, I will provide performance measurement for Renesas HW also.

FWIW, I don't need to know the performance right now. I am mostly 
pointing out that if you want to provide performance number, then they 
should really come from real HW rather than QEMU.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-12-03 10:32                       ` Julien Grall
@ 2025-12-03 15:47                         ` Milan Djokic
  2026-02-13  3:18                           ` Milan Djokic
  0 siblings, 1 reply; 40+ messages in thread
From: Milan Djokic @ 2025-12-03 15:47 UTC (permalink / raw)
  To: Julien Grall, Volodymyr Babchuk
  Cc: Julien Grall, xen-devel@lists.xenproject.org, Bertrand Marquis,
	Rahul Singh, Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper

Hi Julien,
On 12/3/25 11:32, Julien Grall wrote:
> Hi,
> 
> On 02/12/2025 22:08, Milan Djokic wrote:
>> Hi Julien,
>>
>> On 11/27/25 11:22, Julien Grall wrote:
>>>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>>>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>>>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>>>> only proper solution.
>>
>>   > Does this means in your solution you will end up with multiple
>>   > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
>>   > fixed at boot)
>>   >
>>
>> To answer your question, yes we will have multiple vPCI nodes with this
>> model, establishing 1-1 vSID-pSID mapping (same iommu-map range between
>> pPCI-vPCI).
>> For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My
>> understanding is that vBDF->pBDF mapping does not affect vSID->pSID
>> mapping. Am I wrong here?
> 
>   From my understanding, the mapping between a vBDF and vSID is setup at
> domain creation (as this is described in ACPI/Device-Tree). As PCI
> devices can be hotplug, if you want to enforce vSID == pSID, then you
> indirectly need to enforce vBDF == pBDF.
> 

I was not aware of that. I will have to do a detailed analysis on this 
and come back with a solution. Right now I'm not sure how and if 
enumeration will work with multi vIOMMU/vPCI model. If that's not 
possible, we will have to introduce a mapping layer for vSID->pSID and 
go back to single vPCI/vIOMMU model.

> [...]
> 
>>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>>> dynamic enablement.
>>>>
>>>> Separate vIOMMU device is exposed to guest for every physical IOMMU in
>>>> the system.
>>>> vIOMMU feature is designed in a way to provide a generic vIOMMU
>>>> framework and a backend implementation
>>>> for target IOMMU as separate components.
>>>> Backend implementation contains specific IOMMU structure and commands
>>>> handling (only SMMUv3 currently supported).
>>>> This structure allows potential reuse of stage-1 feature for other IOMMU
>>>> types.
>>>>
>>>> Security Considerations
>>>> =======================
>>>>
>>>> **viommu security benefits:**
>>>>
>>>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>>>> DMA (device I/O address mapping managed by guest).
>>>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware,
>>>> while maintaining domains isolation.
>>>
>>> Sorry, I don't follow this argument. Are you saying that it would be
>>> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
>>>
>>
>> No, this would not work. Emulated IOMMU has to match with the pIOMMU type.
>> The argument only points out that we are emulating IOMMU, so the guest
>> does not need direct HW interface for IOMMU functions.
> 
> Sorry, but I am still missing how this is a security benefits.
> 

Yes, this is a mistake. This should be in the design section.

> [...]
> 
> 
>>>>
>>>> 2. Observation:
>>>> ---------------
>>>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding
>>>> to SMMUv3 hardware to maintain coherence.
>>>>
>>>> **Risk:**
>>>> Failing to propagate cache invalidation could allow stale mappings,
>>>> enabling access to old mappings and possibly
>>>> data leakage or misrouting.
>>>
>>> You are referring to data leakage/misrouting between two devices own by
>>> the same guest, right? Xen would still be in charge of flush when the
>>> stage-2 is updated.
>>>
>>
>> Yes, this risk could affect only guests, not xen.
> 
> But it would affect a single guest right? IOW, it is not possible for
> guest A to leak data to guest B even if we don't properly invalidate
> stage-1. Correct?
> 

Correct. I don't see any possible scenario for data leakage between 
different guests, just between 2 devices assigned to the same guest.
I will elaborate on this risk to make it clearer.

>>>>
>>>> 4. Observation:
>>>> ---------------
>>>> The code includes transformations to handle nested translation versus
>>>> standard modes and uses guest-configured
>>>> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
>>>>
>>>> **Risk:**
>>>> Malicious or malformed queue commands from guests could bypass
>>>> validation, manipulate SMMUv3 state,
>>>> or cause system instability.
>>>>
>>>> **Mitigation:** *(Handled by design)*
>>>> Built-in validation of command queue entries and sanitization mechanisms
>>>> ensure only permitted configurations
>>>> are applied.
>>>
>>> This is true as long as we didn't make an mistake in the
>>> configurations ;).
>>>
>>
>> Yes, but I don’t see anything we can do to prevent configuration mistakes.
> 
> There is nothing really preventing it. Same for ...
>>
>>>
>>>> This is supported via additions in `vsmmuv3` and `cmdqueue`
>>>> handling code.
>>>>
>>>> 5. Observation:
>>>> ---------------
>>>> Device Tree modifications enable device assignment and configuration
>>>> through guest DT fragments (e.g., `iommus`)
>>>> are added via `libxl`.
>>>>
>>>> **Risk:**
>>>> Erroneous or malicious Device Tree injection could result in device
>>>> misbinding or guest access to unauthorized
>>>> hardware.
>>>
>>> The DT fragment are not security support and will never be at least
>>> until you have can a libfdt that is able to detect malformed Device-Tree
>>> (I haven't checked if this has changed recently).
>>>
>>
>> But this should still be considered a risk? Similar to the previous
>> observation, system integrator should ensure that DT fragments are correct.
> 
> ... this one. I agree they are risks, but they don't provide much input
> in the design of the vIOMMU.
> 

I get your point. I can remove them if considered to be overhead in this 
context.

> I am a lot more concerned for the scheduling part because the resources
> are shared.
> 
>>> My understanding is there is only a single physical event queue. Xen
>>> would be responsible to handle the events in the queue and forward to
>>> the respective guests. If so, it is not clear what you mean by "disable
>>> event queue".
>>>
>>
>> I was referring to emulated IOMMU event queue. The idea is to make it
>> optional for guests. When disabled, events won't be propagated to the
>> guest.
> 
> But Xen will still receive the events, correct? If so, how does it make
> it better?
> 

You are correct, Xen will still receive events and handle them in pIOMMU 
driver. This is only a mitigation for the part introduced by vIOMMU 
design (events emulation), not the complete solution. This risk has more 
general context and could be related to stage-2 only guests also (e.g. 
guests that perform DMA to an address they are not allowed to access, 
causing translation faults).
But imo mitigation for the physical event queue flooding should be part 
of the pIOMMU driver design

Best regards,
Milan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
  2025-12-03 15:47                         ` Milan Djokic
@ 2026-02-13  3:18                           ` Milan Djokic
  0 siblings, 0 replies; 40+ messages in thread
From: Milan Djokic @ 2026-02-13  3:18 UTC (permalink / raw)
  To: Julien Grall
  Cc: Julien Grall, xen-devel@lists.xenproject.org, Bertrand Marquis,
	Rahul Singh, Stefano Stabellini, Michal Orzel, Jan Beulich,
	Roger Pau Monné, Anthony PERARD, Nick Rosbrook,
	George Dunlap, Juergen Gross, Andrew Cooper, Volodymyr Babchuk

Hi Julien,

On 12/3/25 16:47, Milan Djokic wrote:
> Hi Julien,
> On 12/3/25 11:32, Julien Grall wrote:
>> Hi,
>>
>> On 02/12/2025 22:08, Milan Djokic wrote:
>>> Hi Julien,
>>>
>>> On 11/27/25 11:22, Julien Grall wrote:
>>>>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>>>>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>>>>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>>>>> only proper solution.
>>>
>>>    > Does this means in your solution you will end up with multiple
>>>    > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
>>>    > fixed at boot)
>>>    >
>>>
>>> To answer your question, yes we will have multiple vPCI nodes with this
>>> model, establishing 1-1 vSID-pSID mapping (same iommu-map range between
>>> pPCI-vPCI).
>>> For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My
>>> understanding is that vBDF->pBDF mapping does not affect vSID->pSID
>>> mapping. Am I wrong here?
>>
>>    From my understanding, the mapping between a vBDF and vSID is setup at
>> domain creation (as this is described in ACPI/Device-Tree). As PCI
>> devices can be hotplug, if you want to enforce vSID == pSID, then you
>> indirectly need to enforce vBDF == pBDF.
>>
> 
> I was not aware of that. I will have to do a detailed analysis on this
> and come back with a solution. Right now I'm not sure how and if
> enumeration will work with multi vIOMMU/vPCI model. If that's not
> possible, we will have to introduce a mapping layer for vSID->pSID and
> go back to single vPCI/vIOMMU model.
> 
>> [...]
>>

I have updated the vIOMMU design following our previous discussion on 
this topic and some additional usecases which we had in a meantime. I 
have changed the implementation which now provides a single vIOMMU to 
guest, with a Xen mapping layer which translates it into physical IOMMU 
layout. This design supports multiple physical IOMMUs and also aligns 
with the ongoing vPCI/PCI passthrough work.
New vIOMMU design is provided below, could you please review the updated 
design?
Changes comparing to previous design version:
- Switched from N-N  to 1-N vIOMMU-pIOMMU model, with the addition of 
vSID->pSID mapping layer
- Added some details related to vIOMMU emulation flow (commands, events) 
and the new vSID->pSID mapping layer
- Assumptions and constraints for vPCI compatibility. PCI support is not 
yet complete, planned to be implemented in alignment with the ongoing 
PCI passthrough work
- Removed security considerations which are not directly related to 
vIOMMU (xl, libfdt)
- Expanded mitigations for scheduling-related risks
- Added initial performance measurements for the Renesas R-Car platform 
(to be extended with future PCI support work)

==========================================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

:Author:     Milan Djokic <milan_djokic@epam.com>
:Date:       2026-02-13
:Status:     Draft

Introduction
============

The SMMUv3 supports two stages of translation. Each stage of translation
can be
independently enabled. An incoming address is logically translated from
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support
in Xen for ARM guests.

Motivation
==========

ARM systems utilizing SMMUv3 require stage-1 address translation to
ensure secure DMA and
guest managed I/O memory mappings.
With stage-1 enabled, guest manages IOVA to IPA mappings through its own
IOMMU driver.

This feature enables:

- Stage-1 translation for the guest domain
- Device passthrough with per-device I/O address space

Design Overview
===============

These changes provide emulated SMMUv3 support:

- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
   in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
   handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command
   queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
   device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for
   dynamic enablement.

A single vIOMMU device is exposed to the guest and mapped to one or more
physical IOMMUs through a Xen-managed translation layer.
The vIOMMU feature provides a generic framework together with a backend
implementation specific to the target IOMMU type. The backend is responsible
for implementing the hardware-specific data structures and command handling
logic (currently only SMMUv3 is supported).

This modular design allows the stage-1 support to be reused
for other IOMMU architectures in the future.

vIOMMU architecture
===================

Responsibilities:

Guest:
  - Configures stage-1 via vIOMMU commands.
  - Handles stage-1 faults received from Xen.

Xen:
  - Emulates the IOMMU interface (registers, commands, events).
  - Provides vSID->pSID mappings.
  - Programs stage-1/stage-2 configuration in the physical IOMMU.
  - Propagate stage-1 faults to guest.

vIOMMU commands and faults are transmitted between guest and Xen via
command and event queues (one command/event queue created per guest).

vIOMMU command Flow:

::

     Guest:
         smmu_cmd(vSID, IOVA -> IPA)

     Xen:
         trap MMIO read/write
         translate vSID->pSID
         store stage-1 state
         program pIOMMU for (pSID, IPA -> PA)

All hardware programming of the physical IOMMU is performed exclusively 
by Xen.

vIOMMU Stage-1 fault handling flow:

::

     Xen:
         receives stage-1 fault
         triggers vIOMMU callback
         injects virtual fault

     Guest:
         receives and handles fault

vSID Mapping Layer
------------------

Each guest-visible Stream ID (vSID) is mapped by Xen to a physical Stream ID
(pSID). The mapping is maintained per-domain. The allocation policy 
guarantees
vSID uniqueness within a domain while allowing reuse of pSIDs for different
pIOMMUs.

* Platform devices receive individually allocated vSIDs.
* PCI devices receive a contiguous vSID range derived from RID space.

Supported Device Model
======================

Currently, the vIOMMU framework supports only devices described via the
Device Tree (DT) model. This includes platform devices and basic PCI
devices support instantiated through the vPCI DT node. ACPI-described
devices are not supported.

Guest assigned platform devices are mapped via `iommus` property:

::

     <&pIOMMU pSID> -> <&vIOMMU vSID>

PCI devices use RID-based mapping via the root complex `iommu-map`:

::

     <RID-base &viommu vSID-base length>

PCI Topology Assumptions and Constraints:

- RID space must be contiguous
- Pre-defined continuous pSID space (0-0x1000)
- No runtime PCI reconfiguration
- Single root complex assumed
- Mapping is fixed at guest DT construction

Constraints for PCI devices will be addressed as part of the future work on
this feature.

Security Considerations
=======================

Stage-1 translation provides isolation between guest devices by
enforcing a per-device I/O address space, preventing unauthorized DMA.
With the introduction of emulated IOMMU, additional protection
mechanisms are required to minimize security risks.

1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures
(`s1_cfg` alongside `s2_cfg`)
and logic to write both Stage-1 and Stage-2 entries in the Stream Table
Entry (STE), including an `abort`
field to handle partial configuration states.

**Risk:**
Without proper handling, a partially applied configuration
might leave guest DMA mappings in an inconsistent state, potentially
enabling unauthorized access or causing cross-domain interference.

**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
STE and manages the `abort` field - only considering
configuration if fully attached. This ensures  incomplete or invalid
device configurations are safely ignored by the hypervisor.

2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding
to SMMUv3 hardware to maintain coherence.

**Risk:**
Failing to propagate cache invalidation could allow stale mappings,
enabling access to old mappings and possibly
data leakage or misrouting between devices assigned to the same guest.

**Mitigation:**
The guest must issue appropriate invalidation commands whenever
its stage-1 I/O mappings are modified to ensure that translation caches
remain coherent.

3. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl
guest config) means some guests
may opt-out.

**Risk:**
Guests without vIOMMU enabled (stage-2 only) could potentially dominate
access to the physical command and event queues, since they bypass the
emulation layer and processing is faster comparing to vIOMMU-enabled guests.

**Mitigation:**
Audit the impact of emulation overhead effect on IOMMU processing fairness
in a multi-guest environment.
Consider enabling/disabling stage-1 on a system level, instead of 
per-domain.

4. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache
invalidation, stream table entries
configuration, etc. An adversarial guest may issue a high volume of
commands in rapid succession.

**Risk:**
Excessive commands requests can cause high hypervisor CPU consumption
and disrupt scheduling,
leading to degraded system responsiveness and potential
denial-of-service scenarios.

**Mitigation:**

- Implement vIOMMU commands execution restart and continuation support:

   - Introduce processing budget with only a limited amount of commands
     handled per invocation.
   - If additional commands remain pending after the budget is exhausted,
     defer further processing and resume it asynchronously, e.g. via a
     per-domain tasklet.

- Batch multiple commands of same type to reduce emulation overhead:

   - Inspect the command queue and group commands that can be processed
     together (e.g. multiple successive invalidation requests or STE
     updates for the same SID).
   - Execute the entire batch in one go, reducing repeated accesses to
     guest memory and emulation overhead per command.
   - This reduces CPU time spent in the vIOMMU command processing loop.
     The optimization is applicable only when consecutive commands of the
     same type operate on the same SID/context.

5. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU
command queue (e.g. TLB invalidate).

**Risk:**
Excessive commands requests from abusive guest can cause flooding of
physical IOMMU command queue,
leading to degraded pIOMMU responsiveness on commands issued from other
guests.

**Mitigation:**

- Batch commands that are propagated to the pIOMMU command queue and
   implement batch execution pause/continuation.
   Rely on the same mechanisms as in the previous observation
   (command continuation and batching of pIOMMU-related commands of the 
    same
   type and context).
- If possible, implement domain penalization by adding a per-domain budget
   for vIOMMU/pIOMMU usage:

   - Apply per-domain dynamic budgeting of allowed IOMMU commands to
     execute per invocation, reducing the budget for guests with
     excessive command requests over a longer period of time
   - Combine with command continuation mechanism

6. Observation:
---------------
The vIOMMU feature includes an event queue used to forward IOMMU events
to the guest (e.g. translation faults, invalid Stream IDs, permission 
errors).
A malicious guest may misconfigure its IOMMU state or intentionally trigger
faults at a high rate.

**Risk:**
Occurrence of IOMMU events with high frequency can cause Xen to flood the
event queue and disrupt scheduling with
high hypervisor CPU load for events handling.

**Mitigation:**

- Implement fail-safe state by disabling events forwarding when faults
   are occurred with high frequency and
   not processed by guest:

   - Introduce a per-domain pending event counter.
   - Stop forwarding events to the guest once the number of unprocessed
     events reaches a predefined threshold.

- Consider disabling the emulated event queue for untrusted guests.
- Note that this risk is more general and may also apply to stage-2-only
   guests. This section addresses mitigations in the emulated IOMMU layer
   only. Mitigation of physical event queue flooding should also be 
considered
   in the target pIOMMU driver.

Performance Impact
==================

With iommu stage-1 and nested translation inclusion, performance
overhead is introduced comparing to existing,
stage-2 only usage in Xen. Once mappings are established, translations
should not introduce significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting
device initialization and event/command handling.
Testing is performed on Renesas R-Car platform.
Performance is mostly impacted by emulated vIOMMU operations, results
shown in the following table.

+-------------------------------+---------------------------------+
| vIOMMU Operation              | Execution time in guest         |
+===============================+=================================+
| Reg read                      | median: 645ns, worst-case: 2mμs |
+-------------------------------+---------------------------------+
| Reg write                     | median: 630ns, worst-case: 1μs  |
+-------------------------------+---------------------------------+
| Invalidate TLB                | median: 2μs, worst-case: 10μs   |
+-------------------------------+---------------------------------+
| Invalidate STE                | median: 5μs worst_case: 100μs   |
+-------------------------------+---------------------------------+

With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
and configure stage-1 mappings for the devices
attached to it.
Following table shows initialization stages which impact stage-1 enabled
guest boot time and compares it with
stage-1 disabled guest.

NOTE: Device probe execution time varies depending on device complexity.
A USB host controller was selected as the test case due to its extensive
use of dynamic DMA allocations and IOMMU mappings, making it a
representative workload for evaluating stage-1 vIOMMU behavior.

+---------------------+-----------------------+------------------------+
| Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init          | ~10ms                 | /                      |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~100ms                | ~90ms                  |
+---------------------+-----------------------+------------------------+

For devices configured with dynamic DMA mappings, DMA allocate/map/unmap
operations performance is
also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation trigger emulated IOMMU functions like mmio
write/read and TLB invalidation.

+---------------+---------------------------+--------------------------+
| DMA Op        | Stage-1 Enabled Guest     | Stage-1 Disabled Guest   |
+===============+===========================+==========================+
| dma_alloc     | median: 20µs, worst: 80µs | median: 8µs, worst: 60µs |
+---------------+---------------------------+--------------------------+
| dma_free      | median: 15µs, worst: 60µs | median: 6µs, worst: 30µs |
+---------------+---------------------------+--------------------------+
| dma_map       | median: 12µs, worst: 60µs | median: 3µs, worst: 20µs |
+---------------+---------------------------+--------------------------+
| dma_unmap     | median: 15µs, worst: 70µs | median: 3µs, worst: 20µs |
+---------------+---------------------------+--------------------------+

Testing
=======

- QEMU-based ARM system tests for Stage-1 translation.
- Actual hardware validation to ensure compatibility with real SMMUv3
implementations.
- Unit/Functional tests validating correct translations (not implemented).

Migration and Compatibility
===========================

This optional feature defaults to disabled (`viommu=""`) for backward
compatibility.

Future improvements
===================

- Implement the proposed mitigations to address security risks that are
   not covered by the current design
   (events batching, commands execution continuation)
- PCI support
- Support for other IOMMU HW (Renesas, RISC-V, etc.)

References
==========

- Original feature implemented by Rahul Singh:

https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ 

- SMMUv3 architecture documentation
- Existing vIOMMU code patterns (KVM, QEMU)

Best regards,
Milan

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-02-13  3:19 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 16:55 [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Milan Djokic
2025-08-07 16:57 ` [PATCH 01/20] xen/arm: smmuv3: Maintain a SID->device structure Milan Djokic
2025-08-07 16:59 ` [PATCH 02/20] xen/arm: smmuv3: Add support for stage-1 and nested stage translation Milan Djokic
2025-08-07 16:59 ` [PATCH 03/20] xen/arm: smmuv3: Alloc io_domain for each device Milan Djokic
2025-08-07 16:59 ` [PATCH 04/20] xen/arm: vIOMMU: add generic vIOMMU framework Milan Djokic
2025-08-08  6:16   ` Jan Beulich
2025-08-07 16:59 ` [PATCH 05/20] xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests Milan Djokic
2025-08-07 16:59 ` [PATCH 06/20] xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param Milan Djokic
2025-08-07 16:59 ` [PATCH 07/20] xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>" Milan Djokic
2025-08-08  6:20   ` Jan Beulich
2025-08-13 10:30     ` Milan Djokic
2025-08-07 16:59 ` [PATCH 08/20] xen/arm: vsmmuv3: Add support for registers emulation Milan Djokic
2025-08-07 16:59 ` [PATCH 09/20] xen/arm: vsmmuv3: Add support for cmdqueue handling Milan Djokic
2025-08-07 16:59 ` [PATCH 10/20] xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE Milan Djokic
2025-08-07 16:59 ` [PATCH 11/20] xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware Milan Djokic
2025-08-08  6:22   ` Jan Beulich
2025-08-07 16:59 ` [PATCH 12/20] xen/arm: vsmmuv3: Add support for event queue and global error Milan Djokic
2025-08-07 16:59 ` [PATCH 13/20] xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices Milan Djokic
2025-08-07 16:59 ` [PATCH 14/20] xen/arm: vIOMMU: IOMMU device tree node for dom0 Milan Djokic
2025-08-07 16:59 ` [PATCH 15/20] xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less Milan Djokic
2025-08-07 16:59 ` [PATCH 16/20] arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl Milan Djokic
2025-08-07 16:59 ` [PATCH 17/20] xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3 Milan Djokic
2025-08-07 16:59 ` [PATCH 18/20] xen/arm: vsmmuv3: Add support to send stage-1 event to guest Milan Djokic
2025-08-07 16:59 ` [PATCH 19/20] libxl/arm: vIOMMU: Modify the partial device tree for iommus Milan Djokic
2025-08-07 16:59 ` [PATCH 20/20] xen/arm: vIOMMU: Modify the partial device tree for dom0less Milan Djokic
2025-08-07 17:58 ` [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests Julien Grall
2025-08-13 10:04   ` Milan Djokic
2025-08-13 12:11     ` Julien Grall
2025-08-14 16:26       ` Milan Djokic
2025-08-26 23:28         ` Volodymyr Babchuk
2025-08-28 11:59           ` Milan Djokic
2025-08-29 16:27             ` Volodymyr Babchuk
2025-09-01 11:06               ` Milan Djokic
2025-11-03 13:16                 ` Milan Djokic
2025-11-27  2:31                   ` Volodymyr Babchuk
2025-11-27 10:22                   ` Julien Grall
2025-12-02 22:08                     ` Milan Djokic
2025-12-03 10:32                       ` Julien Grall
2025-12-03 15:47                         ` Milan Djokic
2026-02-13  3:18                           ` Milan Djokic

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.