linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC] ARM vGIC-ITS tables serialization when running protected VMs
@ 2025-04-14 11:12 Ilias Stamatis
  2025-04-14 11:12 ` [RFC PATCH 1/1] KVM: arm64: vgic-its: Add flag for saving ITTs in userspace buffer Ilias Stamatis
  2025-04-15  8:35 ` [RFC] ARM vGIC-ITS tables serialization when running protected VMs Marc Zyngier
  0 siblings, 2 replies; 8+ messages in thread
From: Ilias Stamatis @ 2025-04-14 11:12 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, maz, oliver.upton
  Cc: joey.gouly, suzuki.poulose, yuzenghui, eric.auger, andre.przywara,
	will, dwmw, ilstam, jgrall, ugurus, nh-open-source

# The problem

KVM's ARM Virtual Interrupt Translation Service (ITS) interface supports the
KVM_DEV_ARM_ITS_SAVE_TABLES and KVM_DEV_ARM_ITS_RESTORE_TABLES operations.
These operations save and restore a set of tables (Device Tables, Interrupt
Translation Tables, Collection Table) to and from guest memory.

This can be a problem when running a protected VM on top of pKVM or another
lowvisor since the host kernel (running at EL1) cannot access guest memory.

# Page declassification and why ITTs are special

The Collection and Device tables are page aligned and their sizes must be a
multiple of page size. If the lowvisor knows where these tables live, it is
possible to "declassify" the corresponding pages and configure the MMU such as
that the EL1 host can write to guest memory directly.

The ITTs (Interrupt Translation Tables) are different. They are NOT page
aligned, they are 256 byte aligned and their size is variable. That means that
the lowvisor can't declassify pages containing ITTs and configure the MMU
giving the host direct access as above since those pages may contain unrelated
data.

If the lowvisor knows where the ITTs live in guest memory it could instead
perform the guest memory accesses on behalf of the host. I.e. the EL1 host
would attempt to save the ITTs to guest memory like it does today, that would
generate a data abort, and then the EL2 lowvisor could perform the copy after
validating that the faulty address belongs to an ITT in guest memory.

One issue with the above is that the ITS save/restore happens at hypervisor
live update which is a time sensitive operation and the extra traps (one per
interrupt mapping?) can introduce significant additional overhead there.

Another issue is that it's actually hard for the lowvisor to know where these
tables live without trusting the EL1 host which virtualizes the ITS. It is
especially hard knowing the locations of the ITTs (compared to
Collection/Device tables) because that probably means having to parse the ITS
command queue from EL2 which is complex and undesirable.

# An alternative: Serializing ITTs into a userspace buffer

Rather than writing the ITTs to guest memory, the EL1 host can serialize them
into a buffer provided by userspace.

The struct kvm_device_attr passed to KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES has
a currently unused 'addr' field that can be used for the buffer address. The
upper 32-bits of 'attr' from the struct could be used for the buffer size (even
though that feels hacky). Also a flag in the 'flags' field can be used to
determine whether the userspace buffer must be used or not.

I'm attaching a not-so-pretty RFC patch that does just this. The format of the
blob stored in the buffer is the following. There is a 64-bit ITT start marker
which embeds the device ID owning the ITT. The start marker is followed by
64-bit ITEs stored using the existing ITS Table ABI REV0 with the 'next' field
being replaced by an 'event_id' field which stores the event ID rather than an
offset. An end marker indicates the end of the ITT and is followed by the start
marker for the ITT of the next device.

Note that I haven't actually documented this new ABI in the patch yet, because
it's unlikely to stay as it is. There are many different ways to achieve the
same result.

Also note that this patch treats the ITTs specially for the reasons mentioned
in the previous section (harder to declassify), but there could be flags for
choosing between storing all tables or just the ITTs in the userspace buffer.

Finally, the patch is based on an older kernel tree at the moment but from a
quick look it won't be too difficult to rebase on top of 6.15. However, I
wanted some initial feedback to see whether this approach is something that
maintainers would even consider before rebasing, adding tests and making this
as robust as possible.

# How can userspace calculate the size of the buffer?

This is another big problem. Today the guest is responsible for allocating
memory for the ITTs (and other tables) before passing it to the (v)GIC.
However, if userspace is to provide the host kernel with a buffer it has to
know how much memory to allocate for it.

Userspace should probably account for the worst case where ITTs are fully
populated. The theoritical maximum depends on the number of Device ID bits and
Event ID bits that the ITS advertises. Today these values are fixed in KVM's
vITS and set to 16 bits each. Therefore the theoritical maximum is a very large
number (2^16 * 2^16 * 8 bytes per ITT entry = 32GiB per vITS) which is
unrealistic.

However, it should be possible to make the Device ID and Event ID bits
configurable by userspace. That way by using smaller numbers there, the
theoritical maximum for the ITTs can be reduced to more realistic sizes.

# Please suggest something better

I'm fully aware that this is not a great patch and feels hacky and potentially
non-robust. However, I'm struggling to come up with anything better and this
RFC is meant to start a discussion around this problem. Is storing the ITS
tables in host memory something that makes sense or is there a better/simpler
approach that would work for Confidential Computing?

Thanks

Ilias Stamatis (1):
  KVM: arm64: vgic-its: Add flag for saving ITTs in userspace buffer

 arch/arm64/include/uapi/asm/kvm.h |   5 +
 arch/arm64/kvm/vgic/vgic-its.c    | 213 +++++++++++++++++++++++++++++-
 arch/arm64/kvm/vgic/vgic.h        |   4 +
 include/kvm/arm_vgic.h            |  11 ++
 4 files changed, 227 insertions(+), 6 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/1] KVM: arm64: vgic-its: Add flag for saving ITTs in userspace buffer
  2025-04-14 11:12 [RFC] ARM vGIC-ITS tables serialization when running protected VMs Ilias Stamatis
@ 2025-04-14 11:12 ` Ilias Stamatis
  2025-04-15  8:35 ` [RFC] ARM vGIC-ITS tables serialization when running protected VMs Marc Zyngier
  1 sibling, 0 replies; 8+ messages in thread
From: Ilias Stamatis @ 2025-04-14 11:12 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, maz, oliver.upton
  Cc: joey.gouly, suzuki.poulose, yuzenghui, eric.auger, andre.przywara,
	will, dwmw, ilstam, jgrall, ugurus, nh-open-source

When running a protected VM on top of pKVM or another lowvisor the EL1
host kernel cannot access guest memory in order to save/restore the ITT
tables for the KVM_DEV_ARM_ITS_SAVE_TABLES and
KVM_DEV_ARM_ITS_RESTORE_TABLES operations.

Introduce a new KVM_DEV_ARM_ITS_ITT_UBUF flag that when set instructs
the vITS to serialize the ITTs into a buffer provided by userspace or
restore them from it. The struct kvm_device_attr passed to
KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES has a currently unused 'addr'
field. Use that field to pass the buffer address. Also use the upper
32-bits of 'attr' from the same struct for the buffer size.

The format of the blob stored in the buffer is the following. There is a
64-bit ITT start marker which embeds the device ID owning the ITT. The
start marker is followed by 64-bit ITEs stored using the existing ITS
Table ABI REV0 with the 'next' field being replaced by an 'event_id'
field which stores the event ID rather than an offset. An end marker
indicates the end of the ITT and is followed by the start marker for the
ITT of the next device.

This is an RFC patch, the ABI is not documented yet.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/arm64/include/uapi/asm/kvm.h |   5 +
 arch/arm64/kvm/vgic/vgic-its.c    | 213 +++++++++++++++++++++++++++++-
 arch/arm64/kvm/vgic/vgic.h        |   4 +
 include/kvm/arm_vgic.h            |  11 ++
 4 files changed, 227 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index b57f28c9d60f..45edb34ec595 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -367,6 +367,11 @@ struct kvm_arm_counter_offset {
 #define   KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES	3
 #define   KVM_DEV_ARM_ITS_CTRL_RESET		4
 
+/*
+ * Flags for KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES
+ */
+#define   KVM_DEV_ARM_ITS_ITT_UBUF           (1ULL << 0)
+
 #define KVM_DEV_ARM_VGIC_NR_IRQS_SHIFT 12
 #define KVM_DEV_ARM_VGIC_NR_IRQS_MASK                                          \
 	((1 << KVM_DEV_ARM_VGIC_NR_IRQS_SHIFT) - 1)
diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index b8a78a678aa1..142e663eb75d 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -2225,6 +2225,62 @@ static int scan_its_table(struct vgic_its *its, gpa_t base, int size, u32 esz,
 	return 1;
 }
 
+static int vgic_its_ubuf_append_entry(struct vgic_its *its, u64 entry)
+{
+	entry = cpu_to_le64(entry);
+
+	if (its->itt_ubuf.slot_next > its->itt_ubuf.slot_max)
+		return -ENOMEM;
+
+	if (copy_to_user(&its->itt_ubuf.ubuf[its->itt_ubuf.slot_next], &entry, sizeof(entry)))
+		return -EFAULT;
+
+	its->itt_ubuf.slot_next++;
+	return 0;
+}
+
+static int vgic_its_save_itt_ubuf(struct vgic_its *its, struct its_device *device) {
+	int ret = 0;
+	u64 val;
+	struct its_ite *ite;
+
+	/*
+	 * Write the start marker. Here we abuse the ITS Table ABI REV0. A
+	 * valid physical LPI has an ID of 8192. We can use numbers lower than
+	 * that for different types of entries such as ITT start/end markers.
+	 * The high 16-bits of the entry contain the device ID.
+	 */
+	val = ((u64)device->device_id << KVM_ITS_ITE_NEXT_SHIFT) |
+	      ((u64)KVM_ITS_ITT_START_MARKER << KVM_ITS_ITE_PINTID_SHIFT);
+	if ((ret = vgic_its_ubuf_append_entry(its, val)))
+		return ret;
+
+	list_for_each_entry(ite, &device->itt_head, ite_list) {
+		/*
+		 * If an LPI carries the HW bit, this means that this
+		 * interrupt is controlled by GICv4, and we do not
+		 * have direct access to that state without GICv4.1.
+		 * Let's simply fail the save operation...
+		 */
+		if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1)
+			return -EACCES;
+
+		val = ((u64)ite->event_id << KVM_ITS_ITE_NEXT_SHIFT) |
+		       ((u64)ite->irq->intid << KVM_ITS_ITE_PINTID_SHIFT) |
+			ite->collection->collection_id;
+		if ((ret = vgic_its_ubuf_append_entry(its, val)))
+			return ret;
+	}
+
+	/* Write the end marker */
+	val = ((u64)device->device_id << KVM_ITS_ITE_NEXT_SHIFT) |
+	      ((u64)KVM_ITS_ITT_END_MARKER << KVM_ITS_ITE_PINTID_SHIFT);
+	if ((ret = vgic_its_ubuf_append_entry(its, val)))
+		return ret;
+
+	return ret;
+}
+
 /**
  * vgic_its_save_ite - Save an interrupt translation entry at @gpa
  */
@@ -2327,6 +2383,9 @@ static int vgic_its_save_itt(struct vgic_its *its, struct its_device *device)
 
 	list_sort(NULL, &device->itt_head, vgic_its_ite_cmp);
 
+	if (its->itt_ubuf.must_be_used)
+		return vgic_its_save_itt_ubuf(its, device);
+
 	list_for_each_entry(ite, &device->itt_head, ite_list) {
 		gpa_t gpa = base + ite->event_id * ite_esz;
 
@@ -2494,10 +2553,12 @@ static int vgic_its_restore_dte(struct vgic_its *its, u32 id,
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
-	ret = vgic_its_restore_itt(its, dev);
-	if (ret) {
-		vgic_its_free_device(its->dev->kvm, its, dev, false);
-		return ret;
+	if (!its->itt_ubuf.must_be_used) {
+		ret = vgic_its_restore_itt(its, dev);
+		if (ret) {
+			vgic_its_free_device(its->dev->kvm, its, dev, false);
+			return ret;
+		}
 	}
 
 	return offset;
@@ -2776,6 +2837,112 @@ static int vgic_its_save_tables_v0(struct vgic_its *its)
 	return vgic_its_save_collection_table(its);
 }
 
+static int vgic_its_ubuf_pop_entry(struct vgic_its *its, u64 *entry)
+{
+	if (!entry)
+		return -EINVAL;
+
+	if (its->itt_ubuf.slot_next > its->itt_ubuf.slot_max)
+		return -ENOMEM;
+
+	if (copy_from_user(entry, &its->itt_ubuf.ubuf[its->itt_ubuf.slot_next], sizeof(*entry)))
+		return -EFAULT;
+
+	its->itt_ubuf.slot_next++;
+
+	*entry = le64_to_cpu(*entry);
+
+	return 0;
+}
+
+static int vgic_its_restore_itt_ubuf(struct vgic_its *its, struct its_device *device)
+{
+	u64 entry, device_id, type, event_id;
+	bool found_end = false;
+	int ret;
+
+	/* Confirm there is a start marker matching the device ID */
+	ret = vgic_its_ubuf_pop_entry(its, &entry);
+	if (ret)
+		return ret;
+
+	/*
+	 * See the comment in vgic_its_save_itt_ubuf() explaining how the ITS
+	 * Table ABI REV0 is abused.
+	 */
+	device_id = entry >> KVM_ITS_ITE_NEXT_SHIFT;
+	type = (entry & KVM_ITS_ITE_PINTID_MASK) >> KVM_ITS_ITE_PINTID_SHIFT;
+
+	if (type != KVM_ITS_ITT_START_MARKER) {
+		printk(KERN_WARNING "Failed to restore vGIC interrupt translation entry: did not find start marker (device_id=%u)",
+		       device->device_id);
+		return -EBADF;
+	}
+	if (device_id != device->device_id) {
+		printk(KERN_WARNING "Failed to restore vGIC interrupt translation entry: found start marker for device_id=%llu instead of device_id=%u",
+		       device_id, device->device_id);
+		return -ENODEV;
+	}
+
+	while (its->itt_ubuf.slot_next <= its->itt_ubuf.slot_max) {
+		ret = vgic_its_ubuf_pop_entry(its, &entry);
+		if (ret)
+			return ret;
+
+		/*
+		 * Is this an ITE or is it an end marker?
+		 */
+		type = (entry & KVM_ITS_ITE_PINTID_MASK) >> KVM_ITS_ITE_PINTID_SHIFT;
+		if (type == KVM_ITS_ITT_END_MARKER) {
+			found_end = true;
+			device_id = entry >> KVM_ITS_ITE_NEXT_SHIFT;
+			if (device_id != device->device_id)
+				return -ENODEV;
+			break;
+		}
+
+		event_id = entry >> KVM_ITS_ITE_NEXT_SHIFT;
+		/*
+		 * Set the 'next' field of the entry to 0 which is a valid
+		 * value for vgic_its_restore_ite().
+		 */
+		entry &= ~KVM_ITS_ITE_NEXT_MASK;
+		ret = vgic_its_restore_ite(its, event_id, &entry, device);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (!found_end)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int vgic_its_restore_itt_all_ubuf(struct vgic_its *its)
+{
+	int ret;
+	struct its_device *dev;
+
+	/*
+	 * The list is sorted in vgic_its_save_device_tables() before
+	 * serialization, therefore we expect the ITTs to be sorted in the blob.
+	 */
+	list_sort(NULL, &its->device_list, vgic_its_device_cmp);
+
+	list_for_each_entry(dev, &its->device_list, dev_list) {
+		ret = vgic_its_restore_itt_ubuf(its, dev);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		vgic_its_free_device(its->dev->kvm, its, dev, false);
+		return ret;
+	}
+
+	return 0;
+}
+
 /**
  * vgic_its_restore_tables_v0 - Restore the ITS tables from guest RAM
  * to internal data structs according to V0 ABI
@@ -2789,7 +2956,14 @@ static int vgic_its_restore_tables_v0(struct vgic_its *its)
 	if (ret)
 		return ret;
 
-	return vgic_its_restore_device_tables(its);
+	ret = vgic_its_restore_device_tables(its);
+	if (ret)
+		return ret;
+
+	if (its->itt_ubuf.must_be_used)
+		return vgic_its_restore_itt_all_ubuf(its);
+
+	return 0;
 }
 
 static int vgic_its_commit_v0(struct vgic_its *its)
@@ -2860,7 +3034,11 @@ static int vgic_its_ctlr(struct kvm_device *dev,
 	struct kvm *kvm = dev->kvm;
 	const struct vgic_its_abi *abi = vgic_its_get_abi(its);
 	int ret = 0;
-	u64 attrval = attr->attr;
+	/*
+	 * The low 32 bits are used for the attribute, whereas the high 32 bits
+	 * have a special meaning for KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES
+	 */
+	u64 attrval = attr->attr & 0xffffffff;
 	bool need_itslock = true;
 
 	switch (attrval) {
@@ -2886,6 +3064,29 @@ static int vgic_its_ctlr(struct kvm_device *dev,
 		return -EBUSY;
 	}
 
+	if (attrval == KVM_DEV_ARM_ITS_SAVE_TABLES || attrval == KVM_DEV_ARM_ITS_RESTORE_TABLES) {
+		if (attr->flags & KVM_DEV_ARM_ITS_ITT_UBUF) {
+			u32 buf_size = attr->attr >> 32;
+			u32 num_slots = buf_size / sizeof(u64);
+			if (num_slots == 0)
+				return -ENOSPC;
+
+			its->itt_ubuf.must_be_used = true;
+			its->itt_ubuf.ubuf = (u64 *)attr->addr;
+			its->itt_ubuf.slot_next = 0;
+			its->itt_ubuf.slot_max = num_slots - 1;
+
+			if (attrval == KVM_DEV_ARM_ITS_SAVE_TABLES) {
+				/* Zero out the first entry */
+				u64 invalid = 0;
+				if (copy_to_user(&its->itt_ubuf.ubuf[0], &invalid, sizeof(invalid)))
+					return -EFAULT;
+			}
+		} else {
+			its->itt_ubuf.must_be_used = false;
+		}
+	}
+
 	switch (attrval) {
 	case KVM_DEV_ARM_ITS_CTRL_RESET:
 		vgic_its_reset(kvm, its);
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 1d0e8bfdc676..f5dac01a9a27 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -76,6 +76,7 @@ static inline bool vgic_irq_is_lpi(u32 const intid)
 #define KVM_ITS_CTE_RDBASE_SHIFT	16
 #define KVM_ITS_CTE_ICID_MASK		GENMASK_ULL(15, 0)
 #define KVM_ITS_ITE_NEXT_SHIFT		48
+#define KVM_ITS_ITE_NEXT_MASK		GENMASK_ULL(63, 48)
 #define KVM_ITS_ITE_PINTID_SHIFT	16
 #define KVM_ITS_ITE_PINTID_MASK		GENMASK_ULL(47, 16)
 #define KVM_ITS_ITE_ICID_MASK		GENMASK_ULL(15, 0)
@@ -90,6 +91,9 @@ static inline bool vgic_irq_is_lpi(u32 const intid)
 /* we only support 64 kB translation table page size */
 #define KVM_ITS_L1E_ADDR_MASK		GENMASK_ULL(51, 16)
 
+#define KVM_ITS_ITT_START_MARKER        1
+#define KVM_ITS_ITT_END_MARKER          2
+
 #define KVM_VGIC_V3_RDIST_INDEX_MASK	GENMASK_ULL(11, 0)
 #define KVM_VGIC_V3_RDIST_FLAGS_MASK	GENMASK_ULL(15, 12)
 #define KVM_VGIC_V3_RDIST_FLAGS_SHIFT	12
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 8cef0b7767e8..43e1f47a4c7a 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -190,6 +190,17 @@ struct vgic_its {
 	struct list_head	device_list;
 	struct list_head	collection_list;
 	struct list_head	inval_dte_list;
+
+	/*
+	 * Userspace buffer to be used by KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES
+	 * optionally for saving/restoring the ITTs of all device tables.
+	 */
+	struct {
+		bool must_be_used;
+		u64 __user *ubuf;
+		size_t slot_max;
+		size_t slot_next;
+	} itt_ubuf;
 };
 
 struct vgic_state_iter;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] ARM vGIC-ITS tables serialization when running protected VMs
  2025-04-14 11:12 [RFC] ARM vGIC-ITS tables serialization when running protected VMs Ilias Stamatis
  2025-04-14 11:12 ` [RFC PATCH 1/1] KVM: arm64: vgic-its: Add flag for saving ITTs in userspace buffer Ilias Stamatis
@ 2025-04-15  8:35 ` Marc Zyngier
  2025-04-15  9:44   ` David Woodhouse
  1 sibling, 1 reply; 8+ messages in thread
From: Marc Zyngier @ 2025-04-15  8:35 UTC (permalink / raw)
  To: Ilias Stamatis
  Cc: kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, eric.auger, andre.przywara, will, dwmw,
	jgrall, ugurus, nh-open-source

On Mon, 14 Apr 2025 12:12:43 +0100,
Ilias Stamatis <ilstam@amazon.com> wrote:
> 
> # The problem
> 
> KVM's ARM Virtual Interrupt Translation Service (ITS) interface supports the
> KVM_DEV_ARM_ITS_SAVE_TABLES and KVM_DEV_ARM_ITS_RESTORE_TABLES operations.
> These operations save and restore a set of tables (Device Tables, Interrupt
> Translation Tables, Collection Table) to and from guest memory.
> 
> This can be a problem when running a protected VM on top of pKVM or another
> lowvisor since the host kernel (running at EL1) cannot access guest memory.
>

pKVM doesn't allow a guest to be saved/restored, full stop.

> # Page declassification and why ITTs are special
> 
> The Collection and Device tables are page aligned and their sizes must be a
> multiple of page size. If the lowvisor knows where these tables live, it is
> possible to "declassify" the corresponding pages and configure the MMU such as
> that the EL1 host can write to guest memory directly.
> 
> The ITTs (Interrupt Translation Tables) are different. They are NOT page
> aligned, they are 256 byte aligned and their size is variable. That means that
> the lowvisor can't declassify pages containing ITTs and configure the MMU
> giving the host direct access as above since those pages may contain unrelated
> data.

And it is the responsibility of the guest to make these page aligned
if it intend to let the hypervisor use them. To sum it up, the ITT
isn't special at all.

> 
> If the lowvisor knows where the ITTs live in guest memory it could instead
> perform the guest memory accesses on behalf of the host. I.e. the EL1 host
> would attempt to save the ITTs to guest memory like it does today, that would
> generate a data abort, and then the EL2 lowvisor could perform the copy after
> validating that the faulty address belongs to an ITT in guest memory.
> 
> One issue with the above is that the ITS save/restore happens at hypervisor
> live update which is a time sensitive operation and the extra traps (one per
> interrupt mapping?) can introduce significant additional overhead
> there.

I don't believe this for a second.

> 
> Another issue is that it's actually hard for the lowvisor to know where these
> tables live without trusting the EL1 host which virtualizes the ITS. It is
> especially hard knowing the locations of the ITTs (compared to
> Collection/Device tables) because that probably means having to parse the ITS
> command queue from EL2 which is complex and undesirable.
> 
> # An alternative: Serializing ITTs into a userspace buffer

NAK.

Share the page-aligned memory with the rest of the hypervisor, and use
the existing API.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] ARM vGIC-ITS tables serialization when running protected VMs
  2025-04-15  8:35 ` [RFC] ARM vGIC-ITS tables serialization when running protected VMs Marc Zyngier
@ 2025-04-15  9:44   ` David Woodhouse
  2025-04-22 10:47     ` David Woodhouse
  2025-04-27 16:37     ` Marc Zyngier
  0 siblings, 2 replies; 8+ messages in thread
From: David Woodhouse @ 2025-04-15  9:44 UTC (permalink / raw)
  To: Marc Zyngier, Ilias Stamatis
  Cc: kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, eric.auger, andre.przywara, will,
	jgrall, ugurus, nh-open-source

[-- Attachment #1: Type: text/plain, Size: 4332 bytes --]

On Tue, 2025-04-15 at 09:35 +0100, Marc Zyngier wrote:
> On Mon, 14 Apr 2025 12:12:43 +0100,
> Ilias Stamatis <ilstam@amazon.com> wrote:
> > 
> > # The problem
> > 
> > KVM's ARM Virtual Interrupt Translation Service (ITS) interface supports the
> > KVM_DEV_ARM_ITS_SAVE_TABLES and KVM_DEV_ARM_ITS_RESTORE_TABLES operations.
> > These operations save and restore a set of tables (Device Tables, Interrupt
> > Translation Tables, Collection Table) to and from guest memory.
> > 
> > This can be a problem when running a protected VM on top of pKVM or another
> > lowvisor since the host kernel (running at EL1) cannot access guest memory.
> > 
> 
> pKVM doesn't allow a guest to be saved/restored, full stop.

Yet. Either it's going to need to learn to support live update, or
it'll remain a toy solution.

> > # Page declassification and why ITTs are special
> > 
> > The Collection and Device tables are page aligned and their sizes must be a
> > multiple of page size. If the lowvisor knows where these tables live, it is
> > possible to "declassify" the corresponding pages and configure the MMU such as
> > that the EL1 host can write to guest memory directly.
> > 
> > The ITTs (Interrupt Translation Tables) are different. They are NOT page
> > aligned, they are 256 byte aligned and their size is variable. That means that
> > the lowvisor can't declassify pages containing ITTs and configure the MMU
> > giving the host direct access as above since those pages may contain unrelated
> > data.
> 
> And it is the responsibility of the guest to make these page aligned
> if it intend to let the hypervisor use them. To sum it up, the ITT
> isn't special at all.

The ITT has nothing to do with virtualization, does it? And despite
this being logically "DMA", I don't believe it's possible to advertise
it as being behind the SMMU, which would have allowed for access
control (and would indeed have meant that the guest would be expected
to grant access to full pages).

What exactly are you suggesting? That the GIC specification should be
changed to require page alignment, or to document that in a
confidential compute setup, the remainder of any page which contains
ITTs will be implicitly made non-confidential and shared with the
hypervisor?

And then the lowvisor would also have to snoop the ITS command queues
to even find out which pages to implicitly allow access to?

> > 
> > If the lowvisor knows where the ITTs live in guest memory it could instead
> > perform the guest memory accesses on behalf of the host. I.e. the EL1 host
> > would attempt to save the ITTs to guest memory like it does today, that would
> > generate a data abort, and then the EL2 lowvisor could perform the copy after
> > validating that the faulty address belongs to an ITT in guest memory.
> > 
> > One issue with the above is that the ITS save/restore happens at hypervisor
> > live update which is a time sensitive operation and the extra traps (one per
> > interrupt mapping?) can introduce significant additional overhead
> > there.
> 
> I don't believe this for a second.

You don't believe that every millisecond of live update downtime,
perceived by the guest as unwanted steal time of a hypervisor that's
generally trying to be as quiescent as possible, is an issue?

> > 
> > Another issue is that it's actually hard for the lowvisor to know where these
> > tables live without trusting the EL1 host which virtualizes the ITS. It is
> > especially hard knowing the locations of the ITTs (compared to
> > Collection/Device tables) because that probably means having to parse the ITS
> > command queue from EL2 which is complex and undesirable.
> > 
> > # An alternative: Serializing ITTs into a userspace buffer
> 
> NAK.
> 
> Share the page-aligned memory with the rest of the hypervisor, and use
> the existing API.

That seems like a bad choice. All this is just using guest memory to
store KVM's state.

Yes, the guest provides a buffer which the virtual hardware *may* use
if it wants, but with no IOMMU or access control defined in the
specification.

It seems like it would be much cleaner just to let KVM pass its state
up to userspace for serialization like we do for all *other* KVM state,
which is what Ilias is proposing.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] ARM vGIC-ITS tables serialization when running protected VMs
  2025-04-15  9:44   ` David Woodhouse
@ 2025-04-22 10:47     ` David Woodhouse
  2025-04-27 16:43       ` Marc Zyngier
  2025-06-26  6:22       ` David Woodhouse
  2025-04-27 16:37     ` Marc Zyngier
  1 sibling, 2 replies; 8+ messages in thread
From: David Woodhouse @ 2025-04-22 10:47 UTC (permalink / raw)
  To: Marc Zyngier, Ilias Stamatis
  Cc: kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, eric.auger, andre.przywara, will,
	jgrall, ugurus, nh-open-source

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

On Tue, 2025-04-15 at 10:44 +0100, David Woodhouse wrote:
>  
> 
> > > Another issue is that it's actually hard for the lowvisor to know where these
> > > tables live without trusting the EL1 host which virtualizes the ITS. It is
> > > especially hard knowing the locations of the ITTs (compared to
> > > Collection/Device tables) because that probably means having to parse the ITS
> > > command queue from EL2 which is complex and undesirable.
> > > 
> > > # An alternative: Serializing ITTs into a userspace buffer
> > 
> > NAK.
> > 
> > Share the page-aligned memory with the rest of the hypervisor, and use
> > the existing API.
> 
> That seems like a bad choice. All this is just using guest memory to
> store KVM's state.
> 
> Yes, the guest provides a buffer which the virtual hardware *may* use
> if it wants, but with no IOMMU or access control defined in the
> specification.
> 
> It seems like it would be much cleaner just to let KVM pass its state
> up to userspace for serialization like we do for all *other* KVM state,
> which is what Ilias is proposing.

Ping?

Redefining the GIC specification to implicitly share whole pages with
the hypervisor in a protected guest, when they happen to have an ITT
somewhere inside the page, seems like a very bad idea. Did you have
some proposed wording for the specification update though, if that's
the approach you're proposing?

And *implementing* it by making the lowvisor snoop on the ITS command
queue is also awful.

Putting the GIC behind an IOMMU so that page-granular access control
can be done in a sane way might have made sense, as this is
conceptually DMA. The GIC specification made the same mistake that
early virtio made, and which we *should* have learned from. But like
virtio, I think it's too late to fix that without a time machine.

I think it's much better just to let KVM pass the state to userspace
for migration, just like KVM does for almost all *other* state.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] ARM vGIC-ITS tables serialization when running protected VMs
  2025-04-15  9:44   ` David Woodhouse
  2025-04-22 10:47     ` David Woodhouse
@ 2025-04-27 16:37     ` Marc Zyngier
  1 sibling, 0 replies; 8+ messages in thread
From: Marc Zyngier @ 2025-04-27 16:37 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Ilias Stamatis, kvmarm, linux-arm-kernel, oliver.upton,
	joey.gouly, suzuki.poulose, yuzenghui, eric.auger, andre.przywara,
	will, jgrall, ugurus, nh-open-source

On Tue, 15 Apr 2025 10:44:39 +0100,
David Woodhouse <dwmw2@infradead.org> wrote:
> 
> [1  <text/plain; UTF-8 (quoted-printable)>]
> On Tue, 2025-04-15 at 09:35 +0100, Marc Zyngier wrote:
> > On Mon, 14 Apr 2025 12:12:43 +0100,
> > Ilias Stamatis <ilstam@amazon.com> wrote:
> > > 
> > > # The problem
> > > 
> > > KVM's ARM Virtual Interrupt Translation Service (ITS) interface supports the
> > > KVM_DEV_ARM_ITS_SAVE_TABLES and KVM_DEV_ARM_ITS_RESTORE_TABLES operations.
> > > These operations save and restore a set of tables (Device Tables, Interrupt
> > > Translation Tables, Collection Table) to and from guest memory.
> > > 
> > > This can be a problem when running a protected VM on top of pKVM or another
> > > lowvisor since the host kernel (running at EL1) cannot access guest memory.
> > > 
> > 
> > pKVM doesn't allow a guest to be saved/restored, full stop.
> 
> Yet. Either it's going to need to learn to support live update, or
> it'll remain a toy solution.

Toy solution to what problem?

> 
> > > # Page declassification and why ITTs are special
> > > 
> > > The Collection and Device tables are page aligned and their sizes must be a
> > > multiple of page size. If the lowvisor knows where these tables live, it is
> > > possible to "declassify" the corresponding pages and configure the MMU such as
> > > that the EL1 host can write to guest memory directly.
> > > 
> > > The ITTs (Interrupt Translation Tables) are different. They are NOT page
> > > aligned, they are 256 byte aligned and their size is variable. That means that
> > > the lowvisor can't declassify pages containing ITTs and configure the MMU
> > > giving the host direct access as above since those pages may contain unrelated
> > > data.
> > 
> > And it is the responsibility of the guest to make these page aligned
> > if it intend to let the hypervisor use them. To sum it up, the ITT
> > isn't special at all.
> 
> The ITT has nothing to do with virtualization, does it? And despite
> this being logically "DMA", I don't believe it's possible to advertise
> it as being behind the SMMU, which would have allowed for access
> control (and would indeed have meant that the guest would be expected
> to grant access to full pages).
> 
> What exactly are you suggesting? That the GIC specification should be
> changed to require page alignment, or to document that in a
> confidential compute setup, the remainder of any page which contains
> ITTs will be implicitly made non-confidential and shared with the
> hypervisor?

The GIC architecture is of course absolutely perfect, and there is
nothing to change there.

I'm suggesting you do what any OS designed to run under a confidential
infrastructure do. Which is to expose page-sized data to the
non-trusted infrastructure. Linux has done that for a while (as part
of both CCA and pKVM enablement), and I don't see why your toy guests
can't do the same. It's not like using a page pool for ITT allocations
is rocket science, is it?

> And then the lowvisor would also have to snoop the ITS command queues
> to even find out which pages to implicitly allow access to?

Why should it? as long as you only expose pages that only contain
GIC-related data, you should be safe.

However, if your hypervisor doesn't fully validate the interaction of
the *host* with the HW, then you're dead in the water.

> > > If the lowvisor knows where the ITTs live in guest memory it could instead
> > > perform the guest memory accesses on behalf of the host. I.e. the EL1 host
> > > would attempt to save the ITTs to guest memory like it does today, that would
> > > generate a data abort, and then the EL2 lowvisor could perform the copy after
> > > validating that the faulty address belongs to an ITT in guest memory.
> > > 
> > > One issue with the above is that the ITS save/restore happens at hypervisor
> > > live update which is a time sensitive operation and the extra traps (one per
> > > interrupt mapping?) can introduce significant additional overhead
> > > there.
> > 
> > I don't believe this for a second.
> 
> You don't believe that every millisecond of live update downtime,
> perceived by the guest as unwanted steal time of a hypervisor that's
> generally trying to be as quiescent as possible, is an issue?

I absolutely don't. Certainly not for something that has no tangible
existence, with no performance numbers whatsoever, and based on shaky
premises.

> > > Another issue is that it's actually hard for the lowvisor to know where these
> > > tables live without trusting the EL1 host which virtualizes the ITS. It is
> > > especially hard knowing the locations of the ITTs (compared to
> > > Collection/Device tables) because that probably means having to parse the ITS
> > > command queue from EL2 which is complex and undesirable.
> > > 
> > > # An alternative: Serializing ITTs into a userspace buffer
> > 
> > NAK.
> > 
> > Share the page-aligned memory with the rest of the hypervisor, and use
> > the existing API.
> 
> That seems like a bad choice. All this is just using guest memory to
> store KVM's state.

The architecture *mandates* the memory allocation. KVM uses this
memory for the purpose described in the architecture. If you don't
like it, invent your own interrupt architecture. Trust me, it's real
fun!

> Yes, the guest provides a buffer which the virtual hardware *may* use
> if it wants, but with no IOMMU or access control defined in the
> specification.
> 
> It seems like it would be much cleaner just to let KVM pass its state
> up to userspace for serialization like we do for all *other* KVM state,
> which is what Ilias is proposing.

Sure. You could also decide that SMMU page tables should be extracted
separately, because that's the exact same rationale. You could also
build your own hypervisor instead of inventing new ways to make the
KVM API even more of a terrible mess.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] ARM vGIC-ITS tables serialization when running protected VMs
  2025-04-22 10:47     ` David Woodhouse
@ 2025-04-27 16:43       ` Marc Zyngier
  2025-06-26  6:22       ` David Woodhouse
  1 sibling, 0 replies; 8+ messages in thread
From: Marc Zyngier @ 2025-04-27 16:43 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Ilias Stamatis, kvmarm, linux-arm-kernel, oliver.upton,
	joey.gouly, suzuki.poulose, yuzenghui, eric.auger, andre.przywara,
	will, jgrall, ugurus, nh-open-source

On Tue, 22 Apr 2025 11:47:46 +0100,
David Woodhouse <dwmw2@infradead.org> wrote:
> 
> [1  <text/plain; UTF-8 (quoted-printable)>]
> On Tue, 2025-04-15 at 10:44 +0100, David Woodhouse wrote:
> >  
> > 
> > > > Another issue is that it's actually hard for the lowvisor to know where these
> > > > tables live without trusting the EL1 host which virtualizes the ITS. It is
> > > > especially hard knowing the locations of the ITTs (compared to
> > > > Collection/Device tables) because that probably means having to parse the ITS
> > > > command queue from EL2 which is complex and undesirable.
> > > > 
> > > > # An alternative: Serializing ITTs into a userspace buffer
> > > 
> > > NAK.
> > > 
> > > Share the page-aligned memory with the rest of the hypervisor, and use
> > > the existing API.
> > 
> > That seems like a bad choice. All this is just using guest memory to
> > store KVM's state.
> > 
> > Yes, the guest provides a buffer which the virtual hardware *may* use
> > if it wants, but with no IOMMU or access control defined in the
> > specification.
> > 
> > It seems like it would be much cleaner just to let KVM pass its state
> > up to userspace for serialization like we do for all *other* KVM state,
> > which is what Ilias is proposing.
> 
> Ping?
> 
> Redefining the GIC specification to implicitly share whole pages with
> the hypervisor in a protected guest, when they happen to have an ITT
> somewhere inside the page, seems like a very bad idea. Did you have
> some proposed wording for the specification update though, if that's
> the approach you're proposing?

That's already a requirement for CCA when used with GICv3/GICv4.

> And *implementing* it by making the lowvisor snoop on the ITS command
> queue is also awful.

Not only the command queue. *ANY* RD and ITS access. If you don't, it
is rather easy for the host to use the GIC to repaint your privileged
code and confidential guest, one bit at a time.

But that has nothing to do with sharing the ITT memory, which is only
manipulated by the KVM emulation.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] ARM vGIC-ITS tables serialization when running protected VMs
  2025-04-22 10:47     ` David Woodhouse
  2025-04-27 16:43       ` Marc Zyngier
@ 2025-06-26  6:22       ` David Woodhouse
  1 sibling, 0 replies; 8+ messages in thread
From: David Woodhouse @ 2025-06-26  6:22 UTC (permalink / raw)
  To: Marc Zyngier, Ilias Stamatis, kvm
  Cc: kvmarm, linux-arm-kernel, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, eric.auger, andre.przywara, will,
	jgrall, ugurus, nh-open-source

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Tue, 2025-04-22 at 11:47 +0100, David Woodhouse wrote:
> 
> I think it's much better just to let KVM pass the state to userspace
> for migration, just like KVM does for almost all *other* state.

Even without confidential VMs, we've just found another reason to avoid
this abomination where KVM scribbles its state into guest memory
instead of passing it up to userspace for serialization like normal KVM
code does.

When a guest resumes from hibernation, it comes up into a 'boot kernel'
as normal, and that kernel then finds the hibernation signature in its
swap space and restores the original running kernel. That transition
isn't a kexec; it takes yet another slightly different path through the
PM suspend and resume code (in the boot and the resumed kernel, resp.).

We've found that this seems to leave the GIC in the 'wrong' state
somehow. A subsequent KVM serialization for LU/LM causes memory
corruption, because KVM scribbles its state to addresses in guest
memory that the resumed kernel does not expect, causing subsequent
guest crashes.

It's probably going to turn out to be something we can blame on the
guest kernel, just as we can blame the memory corruption addressed by
https://lore.kernel.org/all/20250623132714.965474-2-dwmw2@infradead.org/
on the kernel — but ultimately, it's all just working around the
insanity of the GIC's proclivity for IOMMU-unprotected DMA.

Just because the hardware specification was designed by badgers, that
doesn't mean that KVM has to follow its lead into the host userspace
APIs.

Let's have a way for userspace to serialize GIC state like a normal KVM
device, to a userspace buffer. As Ilias was proposing.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-26  6:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 11:12 [RFC] ARM vGIC-ITS tables serialization when running protected VMs Ilias Stamatis
2025-04-14 11:12 ` [RFC PATCH 1/1] KVM: arm64: vgic-its: Add flag for saving ITTs in userspace buffer Ilias Stamatis
2025-04-15  8:35 ` [RFC] ARM vGIC-ITS tables serialization when running protected VMs Marc Zyngier
2025-04-15  9:44   ` David Woodhouse
2025-04-22 10:47     ` David Woodhouse
2025-04-27 16:43       ` Marc Zyngier
2025-06-26  6:22       ` David Woodhouse
2025-04-27 16:37     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).