public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] KVM: ITS hardening for pKVM
@ 2026-03-10 12:49 Sebastian Ene
  2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
                   ` (15 more replies)
  0 siblings, 16 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

This series introduces the necessary machinery to perform trap & emulate
on device access in pKVM. Furthermore, it hardens the GIC/ITS controller to
prevent an attacker from tampering with the hypervisor protected memory
through this device. 

In pKVM, the host kernel is initially trusted to manage the boot process but
its permissions are revoked once KVM initializes. The GIC/ITS device is
configured before the kernel deprivileges itself. Once the hypervisor
becomes available, sanitize the accesses to the ITS controller by
trapping and emulating certain registers and by shadowing some memory
structures used by the ITS.

This is required because the ITS can issue transactions on the memory
bus *directly*, without having an SMMU in front of it, which makes it
an interesting target for crossing the hypervisor-established privilege
boundary.


Patch overview
==============

The first patch is re-used from Mostafa's series[1] which brings SMMU-v3
support to pKVM.

[1] https://lore.kernel.org/linux-iommu/20251117184815.1027271-1-smostafa@google.com/#r

Some of the infrastructure built in that series might intersect and we
agreed to converge on some changes. The patches [1 - 3] allow unmapping
devices from the host address space and installing a handler to trap
accesses from the host. While executing in the handler, enough context
has to be given from mem-abort to perform the emulation of the device
such as: the offset, the access size, direction of the write and private
related data specific to the device. 
The unmapping of the device from the host address space is performed
after the host deprivilege (during _kvm_host_prot_finalize call).

The 4th patch looks up the ITS node from the device tree and adds it to
an array of unmapped devices. It install a handler that forwards all the
MMIO request to mediate the host access inside the emulation layer and
to prevent breaking ITS functionality. 

The 5th patch changes the GIC/ITS driver to exposes two new methods
which will be called from the KVM layer to setup the shadow state and
to take the appropriate locks. This one is the most intrusive as it
changes the current GIC/ITS driver. I tried to avoid creating a
dependency with KVM to keep the GIC driver agnostic of the virtualization
layer but I am happy to explore other options as well. 
To avoid re-programming the ITS device with new shadow structures after
pKVM is ready, I exposed two functions to change the
pointers inside the driver for the following structures:
- the command queue points to a newly allocated queue
- the GITS_BASER<n> tables configured with an indirect layout have the
  first layer shadowed and they point to a new memory region

Patch 6 adds the entry point into the emulation setup and sets up the
shadow command queue. It adds some helper macros to define the offset
register and the associate action that we want to execute in the
emulation. It also unmaps the state passed from the host kernel
to prevent it from playing nasty games later on. The patch
traps accesses to CWRITER register and copies the commands from the
host command queue to the shadow command queue. 

Patch 7 prevents the host from directly accessing the first layer of the
indirect tables held in GITS_BASER<n>. It also prevents the host from
directly accesssing the last layer of the Device Table (since the entries
in this table hold the address of the ITT table) and of the vPE Table
(since the vPE table entries hold the address of the virtual LPI pending
table.

Patches [8-10] sanitize the commands sent to the ITS and their
arguments.

Patches [11-13] restrict the access of the host to certain registers
and prevent undefined behaviour. Prevent the host from re-programming
the tables held in the GITS_BASER register.

The last patch introduces an hvc to setup the ITS emulation and calls
into the ITS driver to setup the shadow state. 


Design
======


1. Command queue shadowing

The ITS hardware supports a command queue which is programmed by the driver
in the GITS_CBASER register. To inform the hardware that a new command
has been added, the driver updates an index into the GITS_CWRITER
register. The driver then reads the GITS_CREADR register to see if the
command was processed or if the queue is stalled.
 
To create a new command, the emulation layer mirrors the behavior
as following:
 (i) The host ITS driver creates a command in the shadow queue:
	its_allocate_entry() -> builder()
 (ii) Notifies the hardware that a new command is available:
	its_post_commands()
 (iii) Hypervisor traps the write to GITS_CWRITER:
	handle_host_mem_abort() -> handle_host_mmio_trap() ->
            pkvm_handle_gic_emulation()
 (iv) Hypervisor copies the command from the host command queue
      to the original queue which is not accessible to the host.
      It parses the command and updates the hardware write.

The driver allocates space for the original command queue and programs
the hardware (GITS_CWRITER). When pKVM becomes available, the driver
allocates a new (shadow) queue and replaces its original pointer to
the queue with this new one. This is to prevent a malicious host from
tampering with the commands sent to the ITS hardware.

The entry point of our emulation shares the memory of the newly
allocated queue with the hypervisor and donates the memory of the
original queue to make it inaccesible to the host.


2. Indirect tables first level shadowing

The ITS hardware supports indirection to minimize the space required to
accommodate large tables (eg. deviceId space used to index the Device Table
is quite sparse). This is a 2-level indirection, with entries from the
first table pointing to a second table.

An attacker in control of the host can insert an address that points to
the hypervisor protected memory in the first level table and then use
subsequent ITS commands to write to this memory (MAPD).

To shadow this tables, we rely on the driver to allocate space for it
and we copy the original content from the table into the copy. When
pKVM becomes available we switch the pointers that hold the orginal
tables to point to the copy.
To keep the tables from the hypervisor in sync with what the host
has, we update the tables when commands are sent to the ITS.


3. Hiding the last layer of the Device Table and vPE Table from the host

An attacker in control of the host kernel can alter the content of these
tables directly (the Arm IHI 0069H.b spec says that is undefined behavior
if entries are created by software). Normally these entries are created in
response of commands sent to the ITS.

A Device Table entry that has the following structure:

type DeviceTableEntry is (
	boolean Valid,
	Address ITT_base,
	bits(5) ITT_size
) 

This can be maliciously created by an attacker and the ITT_base can be
pointed to hypervisor protected memory. The MAPTI command can then be
used to write over the ITT_base with an ITE entry.

Similarly a vCPU Table entry has the following structure:

type VCPUTableEntry is (
	boolean Valid,
	bits(32) RDbase,
	Address VPT_base,
	bits(5) VPT_size
)

VPT_base can be pointed to hypervisor protected memory and then a
command can be used to raise interrupts and set the corresponding
bit. This would give a 1-bit write primitive so is not "as generous"
as the others.


Notes
=====


Performance impact is expected with this as the emulation dance is not
cost free.
I haven't implemented any ITS quirks in the emulation and I don't know
whether we will need it ? (some hardware needs explicit dcache flushing
ITS_FLAGS_CMDQ_NEEDS_FLUSHING). 

Please note that Redistributors trapping hasn't been addressed at all in
this series and the solution is not sufficient but this can be extended
afterwards. 
The current series has been tested with Qemu (-machine
virt,virtualization=true,gic-version=4) and with Pixel 10.


Thanks,
Sebastian E.

Mostafa Saleh (1):
  KVM: arm64: Donate MMIO to the hypervisor

Sebastian Ene (13):
  KVM: arm64: Track host-unmapped MMIO regions in a static array
  KVM: arm64: Support host MMIO trap handlers for unmapped devices
  KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
  irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
  KVM: arm64: Add infrastructure for ITS emulation setup
  KVM: arm64: Restrict host access to the ITS tables
  KVM: arm64: Trap & emulate the ITS MAPD command
  KVM: arm64: Trap & emulate the ITS VMAPP command
  KVM: arm64: Trap & emulate the ITS MAPC command
  KVM: arm64: Restrict host updates to GITS_CTLR
  KVM: arm64: Restrict host updates to GITS_CBASER
  KVM: arm64 Restrict host updates to GITS_BASER
  KVM: arm64: Implement HVC interface for ITS emulation setup

 arch/arm64/include/asm/kvm_arm.h              |   3 +
 arch/arm64/include/asm/kvm_asm.h              |   1 +
 arch/arm64/include/asm/kvm_pkvm.h             |  20 +
 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 +
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
 arch/arm64/kvm/hyp/nvhe/Makefile              |   3 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  14 +
 arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 653 ++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 134 ++++
 arch/arm64/kvm/hyp/nvhe/setup.c               |  28 +
 arch/arm64/kvm/hyp/pgtable.c                  |   9 +-
 arch/arm64/kvm/pkvm.c                         |  60 ++
 drivers/irqchip/irq-gic-v3-its.c              | 177 ++++-
 include/linux/irqchip/arm-gic-v3.h            |  36 +
 14 files changed, 1126 insertions(+), 31 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c

-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-12 17:57   ` Fuad Tabba
                     ` (2 more replies)
  2026-03-10 12:49 ` [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array Sebastian Ene
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

From: Mostafa Saleh <smostafa@google.com>

Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
drivers can use that to protect the MMIO of IOMMU.
The initial attempt to implement this was to have a new flag to
"___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
it was quite intrusive for host/hyp to check/set page state to make it
aware of MMIO and to encode the state in the page table in that case.
Which is called in paths that can be sensitive to performance (FFA, VMs..)

As donating MMIO is very rare, and we don’t need to encode the full
state, it’s reasonable to have a separate function to do this.
It will init the host s2 page table with an invalid leaf with the owner ID
to prevent the host from mapping the page on faults.

Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
stage-2 PTEs, as this can be triggered from recycle logic under memory
pressure. There is no code relying on this, as all ownership changes is
done via kvm_pgtable_stage2_set_owner()

For error path in IOMMU drivers, add a function to donate MMIO back
from hyp to host.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  2 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 90 +++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c                  |  9 +-
 3 files changed, 94 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 5f9d56754e39..8b617e6fc0e0 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -31,6 +31,8 @@ enum pkvm_component_id {
 };
 
 extern unsigned long hyp_nr_cpus;
+int __pkvm_host_donate_hyp_mmio(u64 pfn);
+int __pkvm_hyp_donate_host_mmio(u64 pfn);
 
 int __pkvm_prot_finalize(void);
 int __pkvm_host_share_hyp(u64 pfn);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 38f66a56a766..0808367c52e5 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -784,6 +784,96 @@ int __pkvm_host_unshare_hyp(u64 pfn)
 	return ret;
 }
 
+int __pkvm_host_donate_hyp_mmio(u64 pfn)
+{
+	u64 phys = hyp_pfn_to_phys(pfn);
+	void *virt = __hyp_va(phys);
+	int ret;
+	kvm_pte_t pte;
+
+	if (addr_is_memory(phys))
+		return -EINVAL;
+
+	host_lock_component();
+	hyp_lock_component();
+
+	ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
+	if (ret)
+		goto unlock;
+
+	if (pte && !kvm_pte_valid(pte)) {
+		ret = -EPERM;
+		goto unlock;
+	}
+
+	ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
+	if (ret)
+		goto unlock;
+	if (pte) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
+	if (ret)
+		goto unlock;
+	/*
+	 * We set HYP as the owner of the MMIO pages in the host stage-2, for:
+	 * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
+	 * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
+	 *   kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
+	 * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
+	 */
+	WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
+				PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
+unlock:
+	hyp_unlock_component();
+	host_unlock_component();
+
+	return ret;
+}
+
+int __pkvm_hyp_donate_host_mmio(u64 pfn)
+{
+	u64 phys = hyp_pfn_to_phys(pfn);
+	u64 virt = (u64)__hyp_va(phys);
+	size_t size = PAGE_SIZE;
+	int ret;
+	kvm_pte_t pte;
+
+	if (addr_is_memory(phys))
+		return -EINVAL;
+
+	host_lock_component();
+	hyp_lock_component();
+
+	ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
+	if (ret)
+		goto unlock;
+	if (!kvm_pte_valid(pte)) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
+	if (ret)
+		goto unlock;
+
+	if (FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte) != PKVM_ID_HYP) {
+		ret = -EPERM;
+		goto unlock;
+	}
+
+	WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
+	WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
+				PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
+unlock:
+	hyp_unlock_component();
+	host_unlock_component();
+
+	return ret;
+}
+
 int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
 {
 	u64 phys = hyp_pfn_to_phys(pfn);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9b480f947da2..d954058e63ff 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1152,13 +1152,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	kvm_pte_t *childp = NULL;
 	bool need_flush = false;
 
-	if (!kvm_pte_valid(ctx->old)) {
-		if (stage2_pte_is_counted(ctx->old)) {
-			kvm_clear_pte(ctx->ptep);
-			mm_ops->put_page(ctx->ptep);
-		}
-		return 0;
-	}
+	if (!kvm_pte_valid(ctx->old))
+		return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
 
 	if (kvm_pte_table(ctx->old, ctx->level)) {
 		childp = kvm_pte_follow(ctx->old, mm_ops);
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
  2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-12 19:05   ` Fuad Tabba
  2026-03-24 10:46   ` Vincent Donnefort
  2026-03-10 12:49 ` [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices Sebastian Ene
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Introduce a registry to track protected MMIO regions that are unmapped
from the host stage-2 page tables. These regions are stored in a
fixed-size array and their ownership is donated to the hypervisor during
initialization to ensure host-exclusion and persistent tracking.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/include/asm/kvm_pkvm.h     | 10 ++++++++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  3 +++
 arch/arm64/kvm/hyp/nvhe/setup.c       | 25 +++++++++++++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index 757076ad4ec9..48ec7d519399 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -17,6 +17,16 @@
 
 #define HYP_MEMBLOCK_REGIONS 128
 
+#define PKVM_PROTECTED_REGS_NUM	8
+
+struct pkvm_protected_reg {
+	u64 start_pfn;
+	size_t num_pages;
+};
+
+extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
+extern unsigned int kvm_nvhe_sym(num_protected_reg);
+
 int pkvm_init_host_vm(struct kvm *kvm);
 int pkvm_create_hyp_vm(struct kvm *kvm);
 bool pkvm_hyp_vm_is_created(struct kvm *kvm);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 0808367c52e5..7c125836b533 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -23,6 +23,9 @@
 
 struct host_mmu host_mmu;
 
+struct pkvm_protected_reg pkvm_protected_regs[PKVM_PROTECTED_REGS_NUM];
+unsigned int num_protected_reg;
+
 static struct hyp_pool host_s2_pool;
 
 static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm);
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 90bd014e952f..ad5b96085e1b 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -284,6 +284,27 @@ static int fix_hyp_pgtable_refcnt(void)
 				&walker);
 }
 
+static int unmap_protected_regions(void)
+{
+	struct pkvm_protected_reg *reg;
+	int i, ret, j = 0;
+
+	for (i = 0; i < num_protected_reg; i++) {
+		reg = &pkvm_protected_regs[i];
+		for (j = 0; j < reg->num_pages; j++) {
+			ret = __pkvm_host_donate_hyp_mmio(reg->start_pfn + j);
+			if (ret)
+				goto err_setup;
+		}
+	}
+
+	return 0;
+err_setup:
+	for (j = j - 1; j >= 0; j--)
+		__pkvm_hyp_donate_host_mmio(reg->start_pfn + j);
+	return ret;
+}
+
 void __noreturn __pkvm_init_finalise(void)
 {
 	struct kvm_cpu_context *host_ctxt = host_data_ptr(host_ctxt);
@@ -324,6 +345,10 @@ void __noreturn __pkvm_init_finalise(void)
 	if (ret)
 		goto out;
 
+	ret = unmap_protected_regions();
+	if (ret)
+		goto out;
+
 	ret = hyp_ffa_init(ffa_proxy_pages);
 	if (ret)
 		goto out;
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
  2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
  2026-03-10 12:49 ` [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-13  9:31   ` Fuad Tabba
  2026-03-24 10:59   ` Vincent Donnefort
  2026-03-10 12:49 ` [PATCH 04/14] KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping Sebastian Ene
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Introduce a mechanism to register callbacks for MMIO accesses to regions
unmapped from the host Stage-2 page tables.

This infrastructure allows the hypervisor to intercept host accesses to
protected or emulated devices. When a Stage-2 fault occurs on a
registered device region, the hypervisor will invoke the associated
callback to emulate the access.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/include/asm/kvm_arm.h      |  3 ++
 arch/arm64/include/asm/kvm_pkvm.h     |  6 ++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c | 41 +++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/setup.c       |  3 ++
 4 files changed, 53 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 3f9233b5a130..8fe1e80ab3f4 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -304,6 +304,9 @@
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~UL(0xf))
+
+#define FAR_MASK	GENMASK_ULL(11, 0)
+
 /*
  * We have
  *	PAR	[PA_Shift - 1	: 12] = PA	[PA_Shift - 1 : 12]
diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index 48ec7d519399..5321ced2f50a 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -19,9 +19,15 @@
 
 #define PKVM_PROTECTED_REGS_NUM	8
 
+struct pkvm_protected_reg;
+
+typedef void (pkvm_emulate_handler)(struct pkvm_protected_reg *region, u64 offset, bool write,
+				    u64 *reg, u8 reg_size);
+
 struct pkvm_protected_reg {
 	u64 start_pfn;
 	size_t num_pages;
+	pkvm_emulate_handler *cb;
 };
 
 extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 7c125836b533..f405d2fbd88f 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -13,6 +13,7 @@
 #include <asm/stage2_pgtable.h>
 
 #include <hyp/fault.h>
+#include <hyp/adjust_pc.h>
 
 #include <nvhe/gfp.h>
 #include <nvhe/memory.h>
@@ -608,6 +609,41 @@ static int host_stage2_idmap(u64 addr)
 	return ret;
 }
 
+static bool handle_host_mmio_trap(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr)
+{
+	u64 offset, reg_value = 0, start, end;
+	u8 reg_size, reg_index;
+	bool write;
+	int i;
+
+	for (i = 0; i < num_protected_reg; i++) {
+		start = pkvm_protected_regs[i].start_pfn << PAGE_SHIFT;
+		end = start + (pkvm_protected_regs[i].num_pages << PAGE_SHIFT);
+
+		if (start > addr || addr > end)
+			continue;
+
+		reg_size = BIT((esr & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
+		reg_index = (esr & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
+		write = (esr & ESR_ELx_WNR) == ESR_ELx_WNR;
+		offset = addr - start;
+
+		if (write)
+			reg_value = host_ctxt->regs.regs[reg_index];
+
+		pkvm_protected_regs[i].cb(&pkvm_protected_regs[i], offset, write,
+					  &reg_value, reg_size);
+
+		if (!write)
+			host_ctxt->regs.regs[reg_index] = reg_value;
+
+		kvm_skip_host_instr();
+		return true;
+	}
+
+	return false;
+}
+
 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
 {
 	struct kvm_vcpu_fault_info fault;
@@ -630,6 +666,11 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
 	 */
 	BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS));
 	addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12;
+	addr |= fault.far_el2 & FAR_MASK;
+
+	if (ESR_ELx_EC(esr) == ESR_ELx_EC_DABT_LOW && !addr_is_memory(addr) &&
+	    handle_host_mmio_trap(host_ctxt, esr, addr))
+		return;
 
 	ret = host_stage2_idmap(addr);
 	BUG_ON(ret && ret != -EAGAIN);
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index ad5b96085e1b..f91dfebe9980 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -296,6 +296,9 @@ static int unmap_protected_regions(void)
 			if (ret)
 				goto err_setup;
 		}
+
+		if (reg->cb)
+			reg->cb = kern_hyp_va(reg->cb);
 	}
 
 	return 0;
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/14] KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (2 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-13  9:58   ` Fuad Tabba
  2026-03-10 12:49 ` [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege Sebastian Ene
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Unmap the ITS MMIO region from the host address space to enforce
hypervisor mediation.
Identify the ITS base address from the device tree and store it in a
protected region. A callback is registered to handle host accesses to
this region; currently, the handler simply forwards all MMIO requests
to the physical hardware. This provides the infrastructure for future
hardware state validation without changing current behavior.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/include/asm/kvm_pkvm.h     |  2 ++
 arch/arm64/kvm/hyp/nvhe/Makefile      |  3 ++-
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 23 ++++++++++++++++
 arch/arm64/kvm/pkvm.c                 | 38 +++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c

diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index 5321ced2f50a..ef00c1bf7d00 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -32,6 +32,8 @@ struct pkvm_protected_reg {
 
 extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
 extern unsigned int kvm_nvhe_sym(num_protected_reg);
+extern void kvm_nvhe_sym(pkvm_handle_forward_req)(struct pkvm_protected_reg *region, u64 offset,
+						  bool write, u64 *reg, u8 reg_size);
 
 int pkvm_init_host_vm(struct kvm *kvm);
 int pkvm_create_hyp_vm(struct kvm *kvm);
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index a244ec25f8c5..eb43269fbac2 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -24,7 +24,8 @@ CFLAGS_switch.nvhe.o += -Wno-override-init
 
 hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
 	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \
-	 cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o
+	 cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o \
+	 its_emulate.o
 hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
 hyp-obj-y += ../../../kernel/smccc-call.o
diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
new file mode 100644
index 000000000000..0eecbb011898
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <asm/kvm_pkvm.h>
+#include <nvhe/mem_protect.h>
+
+
+void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool write,
+			     u64 *reg, u8 reg_size)
+{
+	void __iomem *addr = __hyp_va((region->start_pfn << PAGE_SHIFT) + offset);
+
+	if (reg_size == sizeof(u32)) {
+		if (!write)
+			*reg = readl_relaxed(addr);
+		else
+			writel_relaxed(*reg, addr);
+	} else if (reg_size == sizeof(u64)) {
+		if (!write)
+			*reg = readq_relaxed(addr);
+		else
+			writeq_relaxed(*reg, addr);
+	}
+}
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index d7a0f69a9982..a766be6de735 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -11,6 +11,9 @@
 #include <asm/kvm_mmu.h>
 #include <linux/memblock.h>
 #include <linux/mutex.h>
+#include <linux/of_address.h>
+#include <linux/of_reserved_mem.h>
+#include <linux/platform_device.h>
 
 #include <asm/kvm_pkvm.h>
 
@@ -18,6 +21,7 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
 
+static struct pkvm_protected_reg *pkvm_protected_regs = kvm_nvhe_sym(pkvm_protected_regs);
 static struct memblock_region *hyp_memory = kvm_nvhe_sym(hyp_memory);
 static unsigned int *hyp_memblock_nr_ptr = &kvm_nvhe_sym(hyp_memblock_nr);
 
@@ -39,6 +43,34 @@ static int __init register_memblock_regions(void)
 	return 0;
 }
 
+static int __init register_protected_regions(void)
+{
+	int i = 0, ret;
+	struct device_node *np;
+	struct resource res;
+
+	for_each_compatible_node(np, NULL, "arm,gic-v3-its") {
+		ret = of_address_to_resource(np, i, &res);
+		if (ret)
+			return ret;
+
+		if (i >= PKVM_PROTECTED_REGS_NUM)
+			return -ENOMEM;
+
+		if (!PAGE_ALIGNED(res.start) || !PAGE_ALIGNED(resource_size(&res)))
+			return -EINVAL;
+
+		pkvm_protected_regs[i].start_pfn = res.start >> PAGE_SHIFT;
+		pkvm_protected_regs[i].num_pages = resource_size(&res) >> PAGE_SHIFT;
+		pkvm_protected_regs[i].cb = lm_alias(&kvm_nvhe_sym(pkvm_handle_forward_req));
+		i++;
+	}
+
+	kvm_nvhe_sym(num_protected_reg) = i;
+
+	return 0;
+}
+
 void __init kvm_hyp_reserve(void)
 {
 	u64 hyp_mem_pages = 0;
@@ -57,6 +89,12 @@ void __init kvm_hyp_reserve(void)
 		return;
 	}
 
+	ret = register_protected_regions();
+	if (ret) {
+		kvm_err("Failed to register protected reg: %d\n", ret);
+		return;
+	}
+
 	hyp_mem_pages += hyp_s1_pgtable_pages();
 	hyp_mem_pages += host_s2_pgtable_pages();
 	hyp_mem_pages += hyp_vm_table_pages();
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (3 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 04/14] KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-13 11:26   ` Fuad Tabba
  2026-03-10 12:49 ` [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup Sebastian Ene
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Expose two helper functions to support emulated ITS in the hypervisor.
These allow the KVM layer to notify the driver when hypervisor
initialization is complete.
The caller is expected to use the functions as follows:
1. its_start_deprivilege(): Acquire the ITS locks.
2. on_each_cpu(_kvm_host_prot_finalize, ...): Finalizes pKVM init
3. its_end_deprivilege(): Shadow the ITS structures, invoke the KVM
   callback, and release locks.
Specifically, this shadows the ITS command queue and the 1st level
indirect tables. These shadow buffers will be used by the driver after
host deprivilege, while the hypervisor unmaps and takes ownership of the
original structures.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 drivers/irqchip/irq-gic-v3-its.c   | 165 +++++++++++++++++++++++++++--
 include/linux/irqchip/arm-gic-v3.h |  24 +++++
 2 files changed, 178 insertions(+), 11 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 291d7668cc8d..278dbc56f962 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -78,17 +78,6 @@ struct its_collection {
 	u16			col_id;
 };
 
-/*
- * The ITS_BASER structure - contains memory information, cached
- * value of BASER register configuration and ITS page size.
- */
-struct its_baser {
-	void		*base;
-	u64		val;
-	u32		order;
-	u32		psz;
-};
-
 struct its_device;
 
 /*
@@ -5232,6 +5221,160 @@ static int __init its_compute_its_list_map(struct its_node *its)
 	return its_number;
 }
 
+static void its_free_shadow_tables(struct its_shadow_tables *shadow)
+{
+	int i;
+
+	if (shadow->cmd_shadow)
+		its_free_pages(shadow->cmd_shadow, get_order(ITS_CMD_QUEUE_SZ));
+
+	for (i = 0; i < GITS_BASER_NR_REGS; i++) {
+		if (!shadow->tables[i].shadow)
+			continue;
+
+		its_free_pages(shadow->tables[i].shadow, 0);
+	}
+
+	its_free_pages(shadow, 0);
+}
+
+static struct its_shadow_tables *its_get_shadow_tables(struct its_node *its)
+{
+	void *page;
+	struct its_shadow_tables *shadow;
+	int i;
+
+	page = its_alloc_pages_node(its->numa_node, GFP_KERNEL | __GFP_ZERO, 0);
+	if (!page)
+		return NULL;
+
+	shadow = (void *)page_address(page);
+	page = its_alloc_pages_node(its->numa_node,
+				    GFP_KERNEL | __GFP_ZERO,
+				    get_order(ITS_CMD_QUEUE_SZ));
+	if (!page)
+		goto err_alloc_shadow;
+
+	shadow->cmd_shadow = page_address(page);
+	shadow->cmdq_len = ITS_CMD_QUEUE_SZ;
+	shadow->cmd_original = its->cmd_base;
+
+	memcpy(shadow->tables, its->tables, sizeof(struct its_baser) * GITS_BASER_NR_REGS);
+
+	for (i = 0; i < GITS_BASER_NR_REGS; i++) {
+		if (!(shadow->tables[i].val & GITS_BASER_VALID))
+			continue;
+
+		if (!(shadow->tables[i].val & GITS_BASER_INDIRECT))
+			continue;
+
+		page = its_alloc_pages_node(its->numa_node,
+					    GFP_KERNEL | __GFP_ZERO,
+					    shadow->tables[i].order);
+		if (!page)
+			goto err_alloc_shadow;
+
+		shadow->tables[i].shadow = page_address(page);
+
+		memcpy(shadow->tables[i].shadow, shadow->tables[i].base,
+		       PAGE_ORDER_TO_SIZE(shadow->tables[i].order));
+	}
+
+	return shadow;
+
+err_alloc_shadow:
+	its_free_shadow_tables(shadow);
+	return NULL;
+}
+
+void *its_start_depriviledge(void)
+{
+	struct its_node *its;
+	int num_nodes = 0, i = 0;
+	unsigned long *flags;
+
+	raw_spin_lock(&its_lock);
+	list_for_each_entry(its, &its_nodes, entry) {
+		num_nodes++;
+	}
+
+	flags = kzalloc(num_nodes * sizeof(unsigned long), GFP_KERNEL_ACCOUNT);
+	if (!flags) {
+		raw_spin_unlock(&its_lock);
+		return NULL;
+	}
+
+	list_for_each_entry(its, &its_nodes, entry) {
+		raw_spin_lock_irqsave(&its->lock, flags[i++]);
+	}
+
+	return flags;
+}
+EXPORT_SYMBOL_GPL(its_start_depriviledge);
+
+static int its_switch_to_shadow_locked(struct its_node *its, its_init_emulate init_emulate_cb)
+{
+	struct its_shadow_tables *hyp_shadow, shadow;
+	int i, ret;
+	u64 baser, baser_phys;
+
+	hyp_shadow = its_get_shadow_tables(its);
+	if (!hyp_shadow)
+		return -ENOMEM;
+
+	memcpy(&shadow, hyp_shadow, sizeof(shadow));
+	ret = init_emulate_cb(its->phys_base, hyp_shadow);
+	if (ret) {
+		its_free_shadow_tables(hyp_shadow);
+		return ret;
+	}
+
+	/* Switch the driver command queue to use the shadow and save the original */
+	its->cmd_write = (its->cmd_write - its->cmd_base) +
+		(struct its_cmd_block *)shadow.cmd_shadow;
+	its->cmd_base = shadow.cmd_shadow;
+
+	/* Shadow the first level of the indirect tables */
+	for (i = 0; i < GITS_BASER_NR_REGS; i++) {
+		baser = shadow.tables[i].val;
+
+		if (!shadow.tables[i].shadow)
+			continue;
+
+		baser_phys = virt_to_phys(shadow.tables[i].shadow);
+		if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48))
+			baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
+
+		its->tables[i].val &= ~GENMASK(47, 12);
+		its->tables[i].val |= baser_phys;
+		its->tables[i].base = shadow.tables[i].shadow;
+	}
+
+	return 0;
+}
+
+int its_end_depriviledge(int ret_pkvm_finalize, unsigned long *flags, its_init_emulate cb)
+{
+	struct its_node *its;
+	int i = 0, ret = 0;
+
+	if (!flags || !cb)
+		return -EINVAL;
+
+	list_for_each_entry(its, &its_nodes, entry) {
+		if (!ret_pkvm_finalize && !ret)
+			ret = its_switch_to_shadow_locked(its, cb);
+
+		raw_spin_unlock_irqrestore(&its->lock, flags[i++]);
+	}
+
+	kfree(flags);
+	raw_spin_unlock(&its_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(its_end_depriviledge);
+
 static int __init its_probe_one(struct its_node *its)
 {
 	u64 baser, tmp;
diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
index 0225121f3013..40457a4375d4 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -657,6 +657,30 @@ static inline bool gic_enable_sre(void)
 	return !!(val & ICC_SRE_EL1_SRE);
 }
 
+/*
+ * The ITS_BASER structure - contains memory information, cached
+ * value of BASER register configuration and ITS page size.
+ */
+struct its_baser {
+	void		*base;
+	void		*shadow;
+	u64		val;
+	u32		order;
+	u32		psz;
+};
+
+struct its_shadow_tables {
+	struct its_baser	tables[GITS_BASER_NR_REGS];
+	void			*cmd_shadow;
+	void			*cmd_original;
+	size_t			cmdq_len;
+};
+
+typedef int (*its_init_emulate)(phys_addr_t its_phys_base, struct its_shadow_tables *shadow);
+
+void *its_start_depriviledge(void);
+int its_end_depriviledge(int ret, unsigned long *flags, its_init_emulate cb);
+
 #endif
 
 #endif
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (4 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-16 10:46   ` Fuad Tabba
  2026-03-10 12:49 ` [PATCH 07/14] KVM: arm64: Restrict host access to the ITS tables Sebastian Ene
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Share the host command queue with the hypervisor. Donate
the original command queue memory to the hypervisor to ensure
host exclusion and trap accesses on GITS_CWRITE register.
On a CWRITER write, the hypervisor copies commands from the
host's queue to the protected queue before updating the
hardware register.
This ensures the hypervisor mediates all commands sent to
the physical ITS.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/include/asm/kvm_pkvm.h             |   1 +
 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 ++
 arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 203 ++++++++++++++++++
 3 files changed, 221 insertions(+)
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h

diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index ef00c1bf7d00..dc5ef2f9ac49 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -28,6 +28,7 @@ struct pkvm_protected_reg {
 	u64 start_pfn;
 	size_t num_pages;
 	pkvm_emulate_handler *cb;
+	void *priv;
 };
 
 extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
diff --git a/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h b/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
new file mode 100644
index 000000000000..6be24c723658
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef __NVHE_ITS_EMULATE_H
+#define __NVHE_ITS_EMULATE_H
+
+
+#include <asm/kvm_pkvm.h>
+
+
+struct its_shadow_tables;
+
+int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *priv_state,
+				struct its_shadow_tables *shadow);
+
+void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bool write,
+			       u64 *reg, u8 reg_size);
+#endif /* __NVHE_ITS_EMULATE_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 0eecbb011898..4a3ccc90a1a9 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -1,8 +1,75 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <asm/kvm_pkvm.h>
+#include <linux/irqchip/arm-gic-v3.h>
+#include <nvhe/its_emulate.h>
 #include <nvhe/mem_protect.h>
 
+struct its_priv_state {
+	void *base;
+	void *cmd_hyp_base;
+	void *cmd_host_base;
+	void *cmd_host_cwriter;
+	struct its_shadow_tables *shadow;
+	hyp_spinlock_t its_lock;
+};
+
+struct its_handler {
+	u64 offset;
+	u8 access_size;
+	void (*write)(struct its_priv_state *its, u64 offset, u64 value);
+	void (*read)(struct its_priv_state *its, u64 offset, u64 *read);
+};
+
+DEFINE_HYP_SPINLOCK(its_setup_lock);
+
+static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
+{
+	u64 cwriter_offset = value & GENMASK(19, 5);
+	int cmd_len, cmd_offset;
+	size_t cmdq_sz = its->shadow->cmdq_len;
+
+	if (cwriter_offset > cmdq_sz)
+		return;
+
+	cmd_offset = its->cmd_host_cwriter - its->cmd_host_base;
+	cmd_len = cwriter_offset - cmd_offset;
+	if (cmd_len < 0)
+		cmd_len = cmdq_sz - cmd_offset;
+
+	if (cmd_offset + cmd_len > cmdq_sz)
+		return;
+
+	memcpy(its->cmd_hyp_base + cmd_offset, its->cmd_host_cwriter, cmd_len);
+
+	its->cmd_host_cwriter = its->cmd_host_base +
+		(cmd_offset + cmd_len) % cmdq_sz;
+	if (its->cmd_host_cwriter == its->cmd_host_base) {
+		memcpy(its->cmd_hyp_base, its->cmd_host_base, cwriter_offset);
+
+		its->cmd_host_cwriter = its->cmd_host_base + cwriter_offset;
+	}
+
+	writeq_relaxed(value, its->base + GITS_CWRITER);
+}
+
+static void cwriter_read(struct its_priv_state *its, u64 offset, u64 *read)
+{
+	*read = readq_relaxed(its->base + GITS_CWRITER);
+}
+
+#define ITS_HANDLER(off, sz, write_cb, read_cb)	\
+{							\
+	.offset = (off),				\
+	.access_size = (sz),				\
+	.write = (write_cb),				\
+	.read = (read_cb),				\
+}
+
+static struct its_handler its_handlers[] = {
+	ITS_HANDLER(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
+	{},
+};
 
 void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool write,
 			     u64 *reg, u8 reg_size)
@@ -21,3 +88,139 @@ void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool
 			writeq_relaxed(*reg, addr);
 	}
 }
+
+void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bool write,
+			       u64 *reg, u8 reg_size)
+{
+	struct its_priv_state *its_priv = region->priv;
+	void __iomem *addr;
+	struct its_handler *reg_handler;
+
+	if (!its_priv)
+		return;
+
+	addr = its_priv->base + offset;
+	for (reg_handler = its_handlers; reg_handler->access_size; reg_handler++) {
+		if (reg_handler->offset > offset ||
+		    reg_handler->offset + reg_handler->access_size <= offset)
+			continue;
+
+		if (reg_handler->access_size & (reg_size - 1))
+			continue;
+
+		if (write && reg_handler->write) {
+			hyp_spin_lock(&its_priv->its_lock);
+			reg_handler->write(its_priv, offset, *reg);
+			hyp_spin_unlock(&its_priv->its_lock);
+			return;
+		}
+
+		if (!write && reg_handler->read) {
+			hyp_spin_lock(&its_priv->its_lock);
+			reg_handler->read(its_priv, offset, reg);
+			hyp_spin_unlock(&its_priv->its_lock);
+			return;
+		}
+
+		return;
+	}
+
+	pkvm_handle_forward_req(region, offset, write, reg, reg_size);
+}
+
+static struct pkvm_protected_reg *get_region(phys_addr_t dev_addr)
+{
+	int i;
+	u64 dev_pfn = dev_addr >> PAGE_SHIFT;
+
+	for (i = 0; i < PKVM_PROTECTED_REGS_NUM; i++) {
+		if (pkvm_protected_regs[i].start_pfn == dev_pfn)
+			return &pkvm_protected_regs[i];
+	}
+
+	return NULL;
+}
+
+static int pkvm_setup_its_shadow_cmdq(struct its_shadow_tables *shadow)
+{
+	int ret, i, num_pages;
+	u64 shadow_start_pfn, original_start_pfn;
+	void *cmd_shadow_va = kern_hyp_va(shadow->cmd_shadow);
+
+	shadow_start_pfn = hyp_virt_to_pfn(cmd_shadow_va);
+	original_start_pfn = hyp_virt_to_pfn(kern_hyp_va(shadow->cmd_original));
+	num_pages = shadow->cmdq_len >> PAGE_SHIFT;
+
+	for (i = 0; i < num_pages; i++) {
+		ret = __pkvm_host_share_hyp(shadow_start_pfn + i);
+		if (ret)
+			goto unshare_shadow;
+	}
+
+	ret = hyp_pin_shared_mem(cmd_shadow_va, cmd_shadow_va + shadow->cmdq_len);
+	if (ret)
+		goto unshare_shadow;
+
+	ret = __pkvm_host_donate_hyp(original_start_pfn, num_pages);
+	if (ret)
+		goto unpin_shadow;
+
+	return ret;
+
+unpin_shadow:
+	hyp_unpin_shared_mem(cmd_shadow_va, cmd_shadow_va + shadow->cmdq_len);
+
+unshare_shadow:
+	for (i = i - 1; i >= 0; i--)
+		__pkvm_host_unshare_hyp(shadow_start_pfn + i);
+
+	return ret;
+}
+
+int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
+				struct its_shadow_tables *host_shadow)
+{
+	int ret;
+	struct its_priv_state *priv_state = kern_hyp_va(host_priv_state);
+	struct its_shadow_tables *shadow = kern_hyp_va(host_shadow);
+	struct pkvm_protected_reg *its_reg;
+
+	hyp_spin_lock(&its_setup_lock);
+	its_reg = get_region(dev_addr);
+	if (!its_reg)
+		return -ENODEV;
+
+	if (its_reg->priv)
+		return -EOPNOTSUPP;
+
+	ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(priv_state), 1);
+	if (ret)
+		return ret;
+
+	ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(shadow), 1);
+	if (ret)
+		goto err_with_state;
+
+	ret = pkvm_setup_its_shadow_cmdq(shadow);
+	if (ret)
+		goto err_with_shadow;
+
+	its_reg->priv = priv_state;
+
+	hyp_spin_lock_init(&priv_state->its_lock);
+	priv_state->shadow = shadow;
+	priv_state->base = __hyp_va(dev_addr);
+
+	priv_state->cmd_hyp_base = kern_hyp_va(shadow->cmd_original);
+	priv_state->cmd_host_base = kern_hyp_va(shadow->cmd_shadow);
+	priv_state->cmd_host_cwriter = priv_state->cmd_host_base;
+
+	hyp_spin_unlock(&its_setup_lock);
+
+	return 0;
+err_with_shadow:
+	__pkvm_hyp_donate_host(hyp_virt_to_pfn(shadow), 1);
+err_with_state:
+	__pkvm_hyp_donate_host(hyp_virt_to_pfn(priv_state), 1);
+	return ret;
+}
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/14] KVM: arm64: Restrict host access to the ITS tables
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (5 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-16 16:13   ` Fuad Tabba
  2026-03-10 12:49 ` [PATCH 08/14] KVM: arm64: Trap & emulate the ITS MAPD command Sebastian Ene
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Setup shadow structures for ITS indirect tables held in
the GITS_BASER<n> registers.
Make the last level of the Device Table and vPE Table
inacessible to the host.
In a direct layout configuration, donate the table to
the hypervisor since the software is not expected to
program them directly.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 143 ++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 4a3ccc90a1a9..865a5d6353ed 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -141,6 +141,145 @@ static struct pkvm_protected_reg *get_region(phys_addr_t dev_addr)
 	return NULL;
 }
 
+static int pkvm_host_unmap_last_level(void *shadow, size_t num_pages, u32 psz)
+{
+	u64 *table = shadow;
+	int ret, i, end = (num_pages << PAGE_SHIFT) / sizeof(table);
+	phys_addr_t table_addr;
+
+	for (i = 0; i < end; i++) {
+		if (!(table[i] & GITS_BASER_VALID))
+			continue;
+
+		table_addr = table[i] & PHYS_MASK;
+		ret = __pkvm_host_donate_hyp(hyp_phys_to_pfn(table_addr), psz >> PAGE_SHIFT);
+		if (ret)
+			goto err_donate;
+	}
+
+	return 0;
+err_donate:
+	for (i = i - 1; i >= 0; i--) {
+		if (!(table[i] & GITS_BASER_VALID))
+			continue;
+
+		table_addr = table[i] & PHYS_MASK;
+		__pkvm_hyp_donate_host(hyp_phys_to_pfn(table_addr), psz >> PAGE_SHIFT);
+	}
+	return ret;
+}
+
+static int pkvm_share_shadow_table(void *shadow, u64 nr_pages)
+{
+	u64 i, ret, start_pfn = hyp_virt_to_pfn(shadow);
+
+	for (i = 0; i < nr_pages; i++) {
+		ret = __pkvm_host_share_hyp(start_pfn + i);
+		if (ret)
+			goto unshare;
+	}
+
+	ret = hyp_pin_shared_mem(shadow, shadow + (nr_pages << PAGE_SHIFT));
+	if (ret)
+		goto unshare;
+
+	return ret;
+unshare:
+	for (i = i - 1; i >= 0; i--)
+		__pkvm_host_unshare_hyp(start_pfn + i);
+	return ret;
+}
+
+static void pkvm_unshare_shadow_table(void *shadow, u64 nr_pages)
+{
+	u64 i, start_pfn = hyp_virt_to_pfn(shadow);
+
+	hyp_unpin_shared_mem(shadow, shadow + (nr_pages << PAGE_SHIFT));
+
+	for (i = 0; i < nr_pages; i++)
+		WARN_ON(__pkvm_host_unshare_hyp(start_pfn + i));
+}
+
+static void pkvm_host_map_last_level(void *shadow, size_t num_pages, u32 psz)
+{
+	u64 *table;
+	int i, end = (num_pages << PAGE_SHIFT) / sizeof(table);
+	phys_addr_t table_addr;
+
+	for (i = 0; i < end; i++) {
+		if (!(table[i] & GITS_BASER_VALID))
+			continue;
+
+		table_addr = table[i] & ~GITS_BASER_VALID;
+		WARN_ON(__pkvm_hyp_donate_host(hyp_phys_to_pfn(table_addr), psz >> PAGE_SHIFT));
+	}
+}
+
+static int pkvm_setup_its_shadow_baser(struct its_shadow_tables *shadow)
+{
+	int i, ret;
+	u64 baser_val, num_pages, type;
+	void *base, *host_base;
+
+	for (i = 0; i < GITS_BASER_NR_REGS; i++) {
+		baser_val = shadow->tables[i].val;
+		if (!(baser_val & GITS_BASER_VALID))
+			continue;
+
+		base = kern_hyp_va(shadow->tables[i].base);
+		num_pages = (1 << shadow->tables[i].order);
+
+		ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(base), num_pages);
+		if (ret)
+			goto err_donate;
+
+		if (baser_val & GITS_BASER_INDIRECT) {
+			host_base = kern_hyp_va(shadow->tables[i].shadow);
+			ret = pkvm_share_shadow_table(host_base, num_pages);
+			if (ret)
+				goto err_with_donation;
+
+			type = GITS_BASER_TYPE(baser_val);
+			if (type == GITS_BASER_TYPE_COLLECTION)
+				continue;
+
+			ret = pkvm_host_unmap_last_level(base, num_pages,
+							 shadow->tables[i].psz);
+			if (ret)
+				goto err_with_share;
+		}
+	}
+
+	return 0;
+err_with_share:
+	pkvm_unshare_shadow_table(host_base, num_pages);
+err_with_donation:
+	__pkvm_hyp_donate_host(hyp_virt_to_pfn(base), num_pages);
+err_donate:
+	for (i = i - 1; i >= 0; i--) {
+		baser_val = shadow->tables[i].val;
+		if (!(baser_val & GITS_BASER_VALID))
+			continue;
+
+		base = kern_hyp_va(shadow->tables[i].base);
+		num_pages = (1 << shadow->tables[i].order);
+
+		WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(base), num_pages));
+		if (baser_val & GITS_BASER_INDIRECT) {
+			host_base = kern_hyp_va(shadow->tables[i].shadow);
+			pkvm_unshare_shadow_table(host_base, num_pages);
+
+			type = GITS_BASER_TYPE(baser_val);
+			if (type == GITS_BASER_TYPE_COLLECTION)
+				continue;
+
+			pkvm_host_map_last_level(base, num_pages, shadow->tables[i].psz);
+		}
+	}
+
+	return ret;
+}
+
 static int pkvm_setup_its_shadow_cmdq(struct its_shadow_tables *shadow)
 {
 	int ret, i, num_pages;
@@ -205,6 +344,10 @@ int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
 	if (ret)
 		goto err_with_shadow;
 
+	ret = pkvm_setup_its_shadow_baser(shadow);
+	if (ret)
+		goto err_with_shadow;
+
 	its_reg->priv = priv_state;
 
 	hyp_spin_lock_init(&priv_state->its_lock);
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/14] KVM: arm64: Trap & emulate the ITS MAPD command
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (6 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 07/14] KVM: arm64: Restrict host access to the ITS tables Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-17 10:20   ` Fuad Tabba
  2026-03-10 12:49 ` [PATCH 09/14] KVM: arm64: Trap & emulate the ITS VMAPP command Sebastian Ene
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Parse the MAPD command and extract the ITT address to
sanitize it. When the command has the valid bit set,
share the memory that holds the ITT table
with the hypervisor to prevent it from being used
by someone else and track the pages in an array.
When the valid bit is cleared, check if the pages
are tracked and then remove the sharing with the
hypervisor.
Check if we need to do any shadow table updates
in case the device table is configured with an
indirect layout.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 182 ++++++++++++++++++++++++++
 drivers/irqchip/irq-gic-v3-its.c      |  12 --
 include/linux/irqchip/arm-gic-v3.h    |  12 ++
 3 files changed, 194 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 865a5d6353ed..722fe80dc2e5 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -12,8 +12,13 @@ struct its_priv_state {
 	void *cmd_host_cwriter;
 	struct its_shadow_tables *shadow;
 	hyp_spinlock_t its_lock;
+	u16 empty_idx;
+	u64 tracked_pfns[];
 };
 
+#define MAX_TRACKED_PFNS	((PAGE_SIZE - offsetof(struct its_priv_state, \
+				  tracked_pfns)) / sizeof(u64))
+
 struct its_handler {
 	u64 offset;
 	u8 access_size;
@@ -23,6 +28,178 @@ struct its_handler {
 
 DEFINE_HYP_SPINLOCK(its_setup_lock);
 
+static int track_pfn_add(struct its_priv_state *its, u64 pfn)
+{
+	int ret, i;
+
+	if (its->empty_idx + 1 >= MAX_TRACKED_PFNS)
+		return -ENOSPC;
+
+	ret = __pkvm_host_share_hyp(pfn);
+	if (ret)
+		return ret;
+
+	its->tracked_pfns[its->empty_idx] = pfn;
+	for (i = 0; i < MAX_TRACKED_PFNS; i++) {
+		if (!its->tracked_pfns[i])
+			break;
+	}
+
+	its->empty_idx = i;
+	return 0;
+}
+
+static int track_pfn_remove(struct its_priv_state *its, u64 pfn)
+{
+	int i, ret;
+
+	for (i = 0; i < MAX_TRACKED_PFNS; i++) {
+		if (its->tracked_pfns[i] != pfn)
+			continue;
+
+		ret = __pkvm_host_unshare_hyp(pfn);
+		if (ret)
+			return ret;
+
+		its->tracked_pfns[i] = 0;
+		its->empty_idx = i;
+	}
+
+	return 0;
+}
+
+static int get_num_itt_pages(struct its_priv_state *its, u8 num_bits)
+{
+	int nr_ites = 1 << (num_bits + 1);
+	u64 size, gits_typer = readq_relaxed(its->base + GITS_TYPER);
+
+	size = nr_ites * (FIELD_GET(GITS_TYPER_ITT_ENTRY_SIZE, gits_typer) + 1);
+	size = max(size, ITS_ITT_ALIGN) + ITS_ITT_ALIGN - 1;
+
+	return PAGE_ALIGN(size) >> PAGE_SHIFT;
+}
+
+static int track_pfn(struct its_priv_state *its, u64 start_pfn, int num_pages, bool remove)
+{
+	int i, ret;
+
+	for (i = 0; i < num_pages; i++) {
+		if (remove)
+			ret = track_pfn_remove(its, start_pfn + i);
+		else
+			ret = track_pfn_add(its, start_pfn + i);
+
+		if (ret)
+			goto err_track;
+	}
+
+	return 0;
+err_track:
+	for (i = i - 1; i >= 0; i--) {
+		if (remove)
+			track_pfn_add(its, start_pfn + i);
+		else
+			track_pfn_remove(its, start_pfn + i);
+	}
+
+	return ret;
+}
+
+static struct its_baser *get_table(struct its_priv_state *its, u64 type)
+{
+	int i;
+	struct its_shadow_tables *shadow = its->shadow;
+
+	for (i = 0; i < GITS_BASER_NR_REGS; i++) {
+		if (GITS_BASER_TYPE(shadow->tables[i].val) == type)
+			return &shadow->tables[i];
+	}
+
+	return NULL;
+}
+
+static int check_table_update(struct its_priv_state *its, u32 id, u64 type)
+{
+	u32 lvl1_idx;
+	u64 esz, *host_table, *hyp_table, new_entry, update;
+	struct its_baser *table = get_table(its, type);
+	int ret;
+	phys_addr_t new_lvl2_table, lvl2_table;
+
+	if (!table)
+		return -EINVAL;
+
+	if (!(table->val & GITS_BASER_INDIRECT))
+		return 0;
+
+	esz = GITS_BASER_ENTRY_SIZE(table->val);
+	lvl1_idx = id / (table->psz / esz);
+
+	host_table = kern_hyp_va(table->shadow);
+	hyp_table = kern_hyp_va(table->base);
+
+	new_entry = host_table[id];
+	update = new_entry ^ hyp_table[id];
+	if (!update || !(update & GITS_BASER_VALID))
+		return 0;
+
+	new_lvl2_table = hyp_phys_to_pfn(new_entry & PHYS_MASK_SHIFT);
+	lvl2_table = hyp_phys_to_pfn(hyp_table[id] & PHYS_MASK_SHIFT);
+	if (new_entry & GITS_BASER_VALID)
+		ret = __pkvm_host_donate_hyp(new_lvl2_table, table->psz >> PAGE_SHIFT);
+	else
+		ret = __pkvm_hyp_donate_host(lvl2_table, table->psz >> PAGE_SHIFT);
+	if (ret)
+		return ret;
+
+	hyp_table[id] = new_entry;
+	return 0;
+}
+
+static int process_its_mapd(struct its_priv_state *its, struct its_cmd_block *cmd)
+{
+	phys_addr_t itt_addr = cmd->raw_cmd[2] & GENMASK(51, 8);
+	u8 size = cmd->raw_cmd[1] & GENMASK(4, 0);
+	bool remove = !(cmd->raw_cmd[2] & BIT(63));
+	u32 device_id = cmd->raw_cmd[0] >> 32;
+	int num_pages, ret;
+	u64 base_pfn;
+
+	if (PAGE_ALIGNED(itt_addr))
+		return -EINVAL;
+
+	base_pfn = hyp_phys_to_pfn(itt_addr);
+	num_pages = get_num_itt_pages(its, size);
+
+	ret = check_table_update(its, device_id, GITS_BASER_TYPE_DEVICE);
+	if (ret)
+		return ret;
+
+	return track_pfn(its, base_pfn, num_pages, remove);
+}
+
+static int parse_its_cmdq(struct its_priv_state *its, int offset, ssize_t len)
+{
+	struct its_cmd_block *cmd = its->cmd_hyp_base + offset;
+	u8 req_type;
+	int ret = 0;
+
+	while (len > 0 && !ret) {
+		req_type = cmd->raw_cmd[0] & GENMASK(7, 0);
+
+		switch (req_type) {
+		case GITS_CMD_MAPD:
+			ret = process_its_mapd(its, cmd);
+			break;
+		}
+
+		cmd++;
+		len -= sizeof(struct its_cmd_block);
+	}
+
+	return ret;
+}
+
 static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
 {
 	u64 cwriter_offset = value & GENMASK(19, 5);
@@ -41,11 +218,15 @@ static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
 		return;
 
 	memcpy(its->cmd_hyp_base + cmd_offset, its->cmd_host_cwriter, cmd_len);
+	if (parse_its_cmdq(its, cmd_offset, cmd_len))
+		return;
 
 	its->cmd_host_cwriter = its->cmd_host_base +
 		(cmd_offset + cmd_len) % cmdq_sz;
 	if (its->cmd_host_cwriter == its->cmd_host_base) {
 		memcpy(its->cmd_hyp_base, its->cmd_host_base, cwriter_offset);
+		if (parse_its_cmdq(its, cmd_offset, cmd_len))
+			return;
 
 		its->cmd_host_cwriter = its->cmd_host_base + cwriter_offset;
 	}
@@ -357,6 +538,7 @@ int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
 	priv_state->cmd_hyp_base = kern_hyp_va(shadow->cmd_original);
 	priv_state->cmd_host_base = kern_hyp_va(shadow->cmd_shadow);
 	priv_state->cmd_host_cwriter = priv_state->cmd_host_base;
+	priv_state->empty_idx = 0;
 
 	hyp_spin_unlock(&its_setup_lock);
 
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 278dbc56f962..be78f7dccb9f 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -121,8 +121,6 @@ static DEFINE_PER_CPU(struct its_node *, local_4_1_its);
 #define is_v4_1(its)		(!!((its)->typer & GITS_TYPER_VMAPP))
 #define device_ids(its)		(FIELD_GET(GITS_TYPER_DEVBITS, (its)->typer) + 1)
 
-#define ITS_ITT_ALIGN		SZ_256
-
 /* The maximum number of VPEID bits supported by VLPI commands */
 #define ITS_MAX_VPEID_BITS						\
 	({								\
@@ -515,16 +513,6 @@ struct its_cmd_desc {
 	};
 };
 
-/*
- * The ITS command block, which is what the ITS actually parses.
- */
-struct its_cmd_block {
-	union {
-		u64	raw_cmd[4];
-		__le64	raw_cmd_le[4];
-	};
-};
-
 #define ITS_CMD_QUEUE_SZ		SZ_64K
 #define ITS_CMD_QUEUE_NR_ENTRIES	(ITS_CMD_QUEUE_SZ / sizeof(struct its_cmd_block))
 
diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
index 40457a4375d4..4f7d47f3d970 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -612,6 +612,8 @@
  */
 #define GIC_IRQ_TYPE_LPI		0xa110c8ed
 
+#define ITS_ITT_ALIGN			SZ_256
+
 struct rdists {
 	struct {
 		raw_spinlock_t	rd_lock;
@@ -634,6 +636,16 @@ struct rdists {
 	bool			has_vpend_valid_dirty;
 };
 
+/*
+ * The ITS command block, which is what the ITS actually parses.
+ */
+struct its_cmd_block {
+	union {
+		u64	raw_cmd[4];
+		__le64	raw_cmd_le[4];
+	};
+};
+
 struct irq_domain;
 struct fwnode_handle;
 int __init its_lpi_memreserve_init(void);
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/14] KVM: arm64: Trap & emulate the ITS VMAPP command
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (7 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 08/14] KVM: arm64: Trap & emulate the ITS MAPD command Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-10 12:49 ` [PATCH 10/14] KVM: arm64: Trap & emulate the ITS MAPC command Sebastian Ene
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Parse the VMAPP command and extract the virtual pending
table address and the size of the table. When the command
has the valid bit set, share the memory that holds this
table with the hypervisor and track it in an array.
Unshare this from the hypervisor when the valid bit
is cleared.
Check if we need to do any shadow table updates
in case the vPE table is configured with an
indirect layout.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 722fe80dc2e5..7049d307a236 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -178,6 +178,26 @@ static int process_its_mapd(struct its_priv_state *its, struct its_cmd_block *cm
 	return track_pfn(its, base_pfn, num_pages, remove);
 }
 
+static int process_its_vmapp(struct its_priv_state *its, struct its_cmd_block *cmd)
+{
+	bool remove = !(cmd->raw_cmd[2] & BIT(63));
+	phys_addr_t vpt_addr = cmd->raw_cmd[3] & GENMASK(51, 16);
+	u8 vpt_size = cmd->raw_cmd[3] & GENMASK(4, 0);
+	u32 vpe_id = (cmd->raw_cmd[1] & GENMASK(47, 32)) >> 32;
+	int num_pages;
+	u64 base_pfn;
+	int ret;
+
+	base_pfn = hyp_phys_to_pfn(vpt_addr);
+	num_pages = ALIGN(BIT((vpt_size + 1) >> 3), SZ_64K);
+
+	ret = check_table_update(its, vpe_id, GITS_BASER_TYPE_VCPU);
+	if (ret)
+		return ret;
+
+	return track_pfn(its, base_pfn, num_pages, remove);
+}
+
 static int parse_its_cmdq(struct its_priv_state *its, int offset, ssize_t len)
 {
 	struct its_cmd_block *cmd = its->cmd_hyp_base + offset;
@@ -191,6 +211,10 @@ static int parse_its_cmdq(struct its_priv_state *its, int offset, ssize_t len)
 		case GITS_CMD_MAPD:
 			ret = process_its_mapd(its, cmd);
 			break;
+
+		case GITS_CMD_VMAPP:
+			ret = process_its_vmapp(its, cmd);
+			break;
 		}
 
 		cmd++;
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/14] KVM: arm64: Trap & emulate the ITS MAPC command
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (8 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 09/14] KVM: arm64: Trap & emulate the ITS VMAPP command Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-10 12:49 ` [PATCH 11/14] KVM: arm64: Restrict host updates to GITS_CTLR Sebastian Ene
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Parse the MAPC command and verify if we need to do any
updates to the shadow if the collection table is
configured with an indirect layout.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 7049d307a236..4782a9a24caa 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -198,6 +198,13 @@ static int process_its_vmapp(struct its_priv_state *its, struct its_cmd_block *c
 	return track_pfn(its, base_pfn, num_pages, remove);
 }
 
+static int process_its_mapc(struct its_priv_state *its, struct its_cmd_block *cmd)
+{
+	u32 icid = cmd->raw_cmd[2] & GENMASK(15, 0);
+
+	return check_table_update(its, icid, GITS_BASER_TYPE_COLLECTION);
+}
+
 static int parse_its_cmdq(struct its_priv_state *its, int offset, ssize_t len)
 {
 	struct its_cmd_block *cmd = its->cmd_hyp_base + offset;
@@ -215,6 +222,10 @@ static int parse_its_cmdq(struct its_priv_state *its, int offset, ssize_t len)
 		case GITS_CMD_VMAPP:
 			ret = process_its_vmapp(its, cmd);
 			break;
+
+		case GITS_CMD_MAPC:
+			ret = process_its_mapc(its, cmd);
+			break;
 		}
 
 		cmd++;
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/14] KVM: arm64: Restrict host updates to GITS_CTLR
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (9 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 10/14] KVM: arm64: Trap & emulate the ITS MAPC command Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-10 12:49 ` [PATCH 12/14] KVM: arm64: Restrict host updates to GITS_CBASER Sebastian Ene
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Prevent unpredictable hardware behavior when the
host tries to enable the ITS while it is not in
quiescent state.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 4782a9a24caa..539d2ee3b58e 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -274,6 +274,23 @@ static void cwriter_read(struct its_priv_state *its, u64 offset, u64 *read)
 	*read = readq_relaxed(its->base + GITS_CWRITER);
 }
 
+static void ctlr_read(struct its_priv_state *its, u64 offset, u64 *read)
+{
+	*read = readq_relaxed(its->base + GITS_CTLR);
+}
+
+static void ctlr_write(struct its_priv_state *its, u64 offset, u64 value)
+{
+	u64 ctlr = readq_relaxed(its->base + GITS_CTLR);
+	bool is_quiescent = !!(ctlr & GITS_CTLR_QUIESCENT);
+	bool is_enabled = !!(ctlr & GITS_CTLR_ENABLE);
+
+	if (!is_enabled && (value & GITS_CTLR_ENABLE) && !is_quiescent)
+		return;
+
+	writeq_relaxed(value, its->base + GITS_CTLR);
+}
+
 #define ITS_HANDLER(off, sz, write_cb, read_cb)	\
 {							\
 	.offset = (off),				\
@@ -284,6 +301,7 @@ static void cwriter_read(struct its_priv_state *its, u64 offset, u64 *read)
 
 static struct its_handler its_handlers[] = {
 	ITS_HANDLER(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
+	ITS_HANDLER(GITS_CTLR, sizeof(u64), ctlr_write, ctlr_read),
 	{},
 };
 
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/14] KVM: arm64: Restrict host updates to GITS_CBASER
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (10 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 11/14] KVM: arm64: Restrict host updates to GITS_CTLR Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-10 12:49 ` [PATCH 13/14] KVM: arm64: Restrict host updates to GITS_BASER Sebastian Ene
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Prevent the host from updating the ITS command queue base address
(GITS_CBASER) while the ITS is enabled or not in a quiescent state.
This enforcement prevents unpredictable hardware behavior and ensures
the host cannot update the hardware with a new queue address behind the
hypervisor's back, which would bypass the command queue shadowing
mechanism.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 539d2ee3b58e..9715f15cd432 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -291,6 +291,30 @@ static void ctlr_write(struct its_priv_state *its, u64 offset, u64 value)
 	writeq_relaxed(value, its->base + GITS_CTLR);
 }
 
+static void cbaser_write(struct its_priv_state *its, u64 offset, u64 value)
+{
+	u64 ctlr = readq_relaxed(its->base + GITS_CTLR);
+	int num_pages;
+
+	if ((ctlr & GITS_CTLR_ENABLE) ||
+	    !(ctlr & GITS_CTLR_QUIESCENT))
+		return;
+
+	num_pages = its->shadow->cmdq_len / SZ_4K;
+	value &= ~GENMASK(7, 0) | ~GENMASK_ULL(51, 12);
+
+	value |= (num_pages - 1) & GENMASK(7, 0);
+	value |= __hyp_pa(its->cmd_hyp_base) & GENMASK_ULL(51, 12);
+
+	its->cmd_host_cwriter = its->cmd_host_base;
+	writeq_relaxed(value, its->base + GITS_CBASER);
+}
+
+static void cbaser_read(struct its_priv_state *its, u64 offset, u64 *read)
+{
+	*read = readq_relaxed(its->base + GITS_CBASER);
+}
+
 #define ITS_HANDLER(off, sz, write_cb, read_cb)	\
 {							\
 	.offset = (off),				\
@@ -302,6 +326,7 @@ static void ctlr_write(struct its_priv_state *its, u64 offset, u64 value)
 static struct its_handler its_handlers[] = {
 	ITS_HANDLER(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
 	ITS_HANDLER(GITS_CTLR, sizeof(u64), ctlr_write, ctlr_read),
+	ITS_HANDLER(GITS_CBASER, sizeof(u64), cbaser_write, cbaser_read),
 	{},
 };
 
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/14] KVM: arm64: Restrict host updates to GITS_BASER
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (11 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 12/14] KVM: arm64: Restrict host updates to GITS_CBASER Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-10 12:49 ` [PATCH 14/14] KVM: arm64: Implement HVC interface for ITS emulation setup Sebastian Ene
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Prevent the host from updating the ITS tables while
the ITS is enabled and the tables are already set.
This enforcement prevents unpredictable hardware behavior
and ensures the host cannot update the hardware with
an unverified table address or size or change its layout.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/its_emulate.c | 45 +++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
index 9715f15cd432..e4136a4a2ecb 100644
--- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
+++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
@@ -22,6 +22,7 @@ struct its_priv_state {
 struct its_handler {
 	u64 offset;
 	u8 access_size;
+	u8 num_registers;
 	void (*write)(struct its_priv_state *its, u64 offset, u64 value);
 	void (*read)(struct its_priv_state *its, u64 offset, u64 *read);
 };
@@ -315,18 +316,48 @@ static void cbaser_read(struct its_priv_state *its, u64 offset, u64 *read)
 	*read = readq_relaxed(its->base + GITS_CBASER);
 }
 
-#define ITS_HANDLER(off, sz, write_cb, read_cb)	\
+static void baser_write(struct its_priv_state *its, u64 offset, u64 value)
+{
+	u64 baser, ctlr = readq_relaxed(its->base + GITS_CTLR);
+	int baser_idx;
+
+	if ((ctlr & GITS_CTLR_ENABLE) ||
+	    !(ctlr & GITS_CTLR_QUIESCENT))
+		return;
+
+	baser_idx = (offset - GITS_BASER) >> 3;
+	baser = its->shadow->tables[baser_idx].val;
+	if ((value & GITS_BASER_INDIRECT) != (baser & GITS_BASER_INDIRECT))
+		return;
+
+	value &= ~GENMASK(47, 12) | ~GENMASK(9, 0);
+	value |= (baser & GENMASK(47, 12)) | (baser & GENMASK(9, 0));
+
+	writeq_relaxed(value, its->base + offset);
+}
+
+static void baser_read(struct its_priv_state *its, u64 offset, u64 *read)
+{
+	*read = readq_relaxed(its->base + offset);
+}
+
+#define ITS_HANDLER(off, sz, num, write_cb, read_cb)	\
 {							\
 	.offset = (off),				\
 	.access_size = (sz),				\
+	.num_registers = (num),				\
 	.write = (write_cb),				\
 	.read = (read_cb),				\
 }
 
+#define ITS_REG(off, sz, write_cb, read_cb)	\
+	ITS_HANDLER(off, sz, 1, write_cb, read_cb)
+
 static struct its_handler its_handlers[] = {
-	ITS_HANDLER(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
-	ITS_HANDLER(GITS_CTLR, sizeof(u64), ctlr_write, ctlr_read),
-	ITS_HANDLER(GITS_CBASER, sizeof(u64), cbaser_write, cbaser_read),
+	ITS_REG(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
+	ITS_REG(GITS_CTLR, sizeof(u64), ctlr_write, ctlr_read),
+	ITS_REG(GITS_CBASER, sizeof(u64), cbaser_write, cbaser_read),
+	ITS_HANDLER(GITS_BASER, sizeof(u64), 8, baser_write, baser_read),
 	{},
 };
 
@@ -354,14 +385,16 @@ void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bo
 	struct its_priv_state *its_priv = region->priv;
 	void __iomem *addr;
 	struct its_handler *reg_handler;
+	u64 end;
 
 	if (!its_priv)
 		return;
 
 	addr = its_priv->base + offset;
 	for (reg_handler = its_handlers; reg_handler->access_size; reg_handler++) {
-		if (reg_handler->offset > offset ||
-		    reg_handler->offset + reg_handler->access_size <= offset)
+		end = reg_handler->offset + reg_handler->access_size * reg_handler->num_registers;
+
+		if (reg_handler->offset > offset || end <= offset)
 			continue;
 
 		if (reg_handler->access_size & (reg_size - 1))
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/14] KVM: arm64: Implement HVC interface for ITS emulation setup
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (12 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 13/14] KVM: arm64: Restrict host updates to GITS_BASER Sebastian Ene
@ 2026-03-10 12:49 ` Sebastian Ene
  2026-03-12 17:56 ` [RFC PATCH 00/14] KVM: ITS hardening for pKVM Fuad Tabba
  2026-03-13 15:18 ` Mostafa Saleh
  15 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-10 12:49 UTC (permalink / raw)
  To: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, sebastianene, smostafa,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Introduce a new HVC to allow the host to trigger the ITS emulation
setup.

This interface notifies the ITS driver that hypervisor initialization is
complete. Upon invocation, the hypervisor replaces the initial
"trap-and-forward" MMIO handler with a full-featured emulation handler.
This transition enables mediated access to the ITS hardware, enforcing
the verifications required for a protected hypervisor environment.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  1 +
 arch/arm64/include/asm/kvm_pkvm.h  |  3 ++-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 14 ++++++++++++++
 arch/arm64/kvm/pkvm.c              | 24 +++++++++++++++++++++++-
 4 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index a1ad12c72ebf..550dafee88ef 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -89,6 +89,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load,
 	__KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put,
 	__KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid,
+	__KVM_HOST_SMCCC_FUNC___pkvm_init_its_emulation,
 };
 
 #define DECLARE_KVM_VHE_SYM(sym)	extern char sym[]
diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index dc5ef2f9ac49..20fb2678a9b9 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -35,7 +35,8 @@ extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
 extern unsigned int kvm_nvhe_sym(num_protected_reg);
 extern void kvm_nvhe_sym(pkvm_handle_forward_req)(struct pkvm_protected_reg *region, u64 offset,
 						  bool write, u64 *reg, u8 reg_size);
-
+extern void kvm_nvhe_sym(pkvm_handle_gic_emulation)(struct pkvm_protected_reg *region, u64 offset,
+						    bool write, u64 *reg, u8 reg_size);
 int pkvm_init_host_vm(struct kvm *kvm);
 int pkvm_create_hyp_vm(struct kvm *kvm);
 bool pkvm_hyp_vm_is_created(struct kvm *kvm);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index e7790097db93..4e58e24a1eed 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -14,6 +14,7 @@
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 
+#include <nvhe/its_emulate.h>
 #include <nvhe/ffa.h>
 #include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
@@ -421,6 +422,18 @@ static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
 	__kvm_tlb_flush_vmid(kern_hyp_va(mmu));
 }
 
+static void handle___pkvm_init_its_emulation(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(phys_addr_t, dev_addr, host_ctxt, 1);
+	DECLARE_REG(void *, its_state, host_ctxt, 2);
+	DECLARE_REG(struct its_shadow_tables *, shadow, host_ctxt, 3);
+
+	if (!is_protected_kvm_enabled())
+		return;
+
+	cpu_reg(host_ctxt, 1) = pkvm_init_gic_its_emulation(dev_addr, its_state, shadow);
+}
+
 static void handle___pkvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
 {
 	DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1);
@@ -630,6 +643,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__pkvm_vcpu_load),
 	HANDLE_FUNC(__pkvm_vcpu_put),
 	HANDLE_FUNC(__pkvm_tlb_flush_vmid),
+	HANDLE_FUNC(__pkvm_init_its_emulation),
 };
 
 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index a766be6de735..5399998d5235 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -6,6 +6,7 @@
 
 #include <linux/init.h>
 #include <linux/interval_tree_generic.h>
+#include <linux/irqchip/arm-gic-v3.h>
 #include <linux/kmemleak.h>
 #include <linux/kvm_host.h>
 #include <asm/kvm_mmu.h>
@@ -62,7 +63,7 @@ static int __init register_protected_regions(void)
 
 		pkvm_protected_regs[i].start_pfn = res.start >> PAGE_SHIFT;
 		pkvm_protected_regs[i].num_pages = resource_size(&res) >> PAGE_SHIFT;
-		pkvm_protected_regs[i].cb = lm_alias(&kvm_nvhe_sym(pkvm_handle_forward_req));
+		pkvm_protected_regs[i].cb = lm_alias(&kvm_nvhe_sym(pkvm_handle_gic_emulation));
 		i++;
 	}
 
@@ -286,16 +287,37 @@ static void __init _kvm_host_prot_finalize(void *arg)
 		WRITE_ONCE(*err, -EINVAL);
 }
 
+static int pkvm_init_its_emulation(phys_addr_t dev_addr, struct its_shadow_tables *shadow)
+{
+	void *its_state;
+	int ret;
+
+	its_state = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!its_state)
+		return -ENOMEM;
+
+	ret = kvm_call_hyp_nvhe(__pkvm_init_its_emulation, dev_addr, its_state, shadow);
+	if (ret)
+		free_page((unsigned long)its_state);
+
+	return ret;
+}
+
 static int __init pkvm_drop_host_privileges(void)
 {
 	int ret = 0;
+	void *flags;
 
 	/*
 	 * Flip the static key upfront as that may no longer be possible
 	 * once the host stage 2 is installed.
 	 */
 	static_branch_enable(&kvm_protected_mode_initialized);
+
+	flags = its_start_depriviledge();
 	on_each_cpu(_kvm_host_prot_finalize, &ret, 1);
+	its_end_depriviledge(ret, flags, &pkvm_init_its_emulation);
+
 	return ret;
 }
 
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/14] KVM: ITS hardening for pKVM
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (13 preceding siblings ...)
  2026-03-10 12:49 ` [PATCH 14/14] KVM: arm64: Implement HVC interface for ITS emulation setup Sebastian Ene
@ 2026-03-12 17:56 ` Fuad Tabba
  2026-03-20 14:42   ` Sebastian Ene
  2026-03-13 15:18 ` Mostafa Saleh
  15 siblings, 1 reply; 36+ messages in thread
From: Fuad Tabba @ 2026-03-12 17:56 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> This series introduces the necessary machinery to perform trap & emulate
> on device access in pKVM. Furthermore, it hardens the GIC/ITS controller to
> prevent an attacker from tampering with the hypervisor protected memory
> through this device.
>
> In pKVM, the host kernel is initially trusted to manage the boot process but
> its permissions are revoked once KVM initializes. The GIC/ITS device is
> configured before the kernel deprivileges itself. Once the hypervisor
> becomes available, sanitize the accesses to the ITS controller by
> trapping and emulating certain registers and by shadowing some memory
> structures used by the ITS.
>
> This is required because the ITS can issue transactions on the memory
> bus *directly*, without having an SMMU in front of it, which makes it
> an interesting target for crossing the hypervisor-established privilege
> boundary.
>
>
> Patch overview
> ==============
>
> The first patch is re-used from Mostafa's series[1] which brings SMMU-v3
> support to pKVM.
>
> [1] https://lore.kernel.org/linux-iommu/20251117184815.1027271-1-smostafa@google.com/#r
>
> Some of the infrastructure built in that series might intersect and we
> agreed to converge on some changes. The patches [1 - 3] allow unmapping
> devices from the host address space and installing a handler to trap
> accesses from the host. While executing in the handler, enough context
> has to be given from mem-abort to perform the emulation of the device
> such as: the offset, the access size, direction of the write and private
> related data specific to the device.
> The unmapping of the device from the host address space is performed
> after the host deprivilege (during _kvm_host_prot_finalize call).
>
> The 4th patch looks up the ITS node from the device tree and adds it to
> an array of unmapped devices. It install a handler that forwards all the
> MMIO request to mediate the host access inside the emulation layer and
> to prevent breaking ITS functionality.
>
> The 5th patch changes the GIC/ITS driver to exposes two new methods
> which will be called from the KVM layer to setup the shadow state and
> to take the appropriate locks. This one is the most intrusive as it
> changes the current GIC/ITS driver. I tried to avoid creating a
> dependency with KVM to keep the GIC driver agnostic of the virtualization
> layer but I am happy to explore other options as well.
> To avoid re-programming the ITS device with new shadow structures after
> pKVM is ready, I exposed two functions to change the
> pointers inside the driver for the following structures:
> - the command queue points to a newly allocated queue
> - the GITS_BASER<n> tables configured with an indirect layout have the
>   first layer shadowed and they point to a new memory region

We used the term shadow for the hyp version of structs in an early
pKVM patch series, but after a bit of discussion, we refer to it as
the hypervisor state [1]. So please use this terminology instead of
shadow.

[1] https://lore.kernel.org/all/YthwzIS18mutjGhN@google.com/

> Patch 6 adds the entry point into the emulation setup and sets up the
> shadow command queue. It adds some helper macros to define the offset
> register and the associate action that we want to execute in the
> emulation. It also unmaps the state passed from the host kernel
> to prevent it from playing nasty games later on. The patch
> traps accesses to CWRITER register and copies the commands from the
> host command queue to the shadow command queue.
>
> Patch 7 prevents the host from directly accessing the first layer of the
> indirect tables held in GITS_BASER<n>. It also prevents the host from
> directly accesssing the last layer of the Device Table (since the entries
> in this table hold the address of the ITT table) and of the vPE Table
> (since the vPE table entries hold the address of the virtual LPI pending
> table.
>
> Patches [8-10] sanitize the commands sent to the ITS and their
> arguments.
>
> Patches [11-13] restrict the access of the host to certain registers
> and prevent undefined behaviour. Prevent the host from re-programming
> the tables held in the GITS_BASER register.
>
> The last patch introduces an hvc to setup the ITS emulation and calls
> into the ITS driver to setup the shadow state.
>
>
> Design
> ======
>
>
> 1. Command queue shadowing
>
> The ITS hardware supports a command queue which is programmed by the driver
> in the GITS_CBASER register. To inform the hardware that a new command
> has been added, the driver updates an index into the GITS_CWRITER

It updates a base address offset, but that's probably what you meant.

> register. The driver then reads the GITS_CREADR register to see if the
> command was processed or if the queue is stalled.
>
> To create a new command, the emulation layer mirrors the behavior
> as following:
>  (i) The host ITS driver creates a command in the shadow queue:
>         its_allocate_entry() -> builder()
>  (ii) Notifies the hardware that a new command is available:
>         its_post_commands()
>  (iii) Hypervisor traps the write to GITS_CWRITER:
>         handle_host_mem_abort() -> handle_host_mmio_trap() ->
>             pkvm_handle_gic_emulation()
>  (iv) Hypervisor copies the command from the host command queue
>       to the original queue which is not accessible to the host.
>       It parses the command and updates the hardware write.
>
> The driver allocates space for the original command queue and programs
> the hardware (GITS_CWRITER). When pKVM becomes available, the driver

You mean GITS_CBASER, right?

> allocates a new (shadow) queue and replaces its original pointer to
> the queue with this new one. This is to prevent a malicious host from
> tampering with the commands sent to the ITS hardware.
>
> The entry point of our emulation shares the memory of the newly
> allocated queue with the hypervisor and donates the memory of the
> original queue to make it inaccesible to the host.
>
>
> 2. Indirect tables first level shadowing
>
> The ITS hardware supports indirection to minimize the space required to
> accommodate large tables (eg. deviceId space used to index the Device Table
> is quite sparse). This is a 2-level indirection, with entries from the
> first table pointing to a second table.
>
> An attacker in control of the host can insert an address that points to
> the hypervisor protected memory in the first level table and then use
> subsequent ITS commands to write to this memory (MAPD).
>
> To shadow this tables, we rely on the driver to allocate space for it
> and we copy the original content from the table into the copy. When
> pKVM becomes available we switch the pointers that hold the orginal
> tables to point to the copy.
> To keep the tables from the hypervisor in sync with what the host
> has, we update the tables when commands are sent to the ITS.
>
>
> 3. Hiding the last layer of the Device Table and vPE Table from the host
>
> An attacker in control of the host kernel can alter the content of these
> tables directly (the Arm IHI 0069H.b spec says that is undefined behavior
> if entries are created by software). Normally these entries are created in
> response of commands sent to the ITS.

nit: unpredictable behavior. Undefined usually refers to instructions.

>
> A Device Table entry that has the following structure:
>
> type DeviceTableEntry is (
>         boolean Valid,
>         Address ITT_base,
>         bits(5) ITT_size
> )

Be careful, this might be true for a specific GIC implementation,
e.g., Arm CoreLink GIC-600, but according to the spec (5.2) the
formats of the tables in system memory are IMPLEMENTATION DEFINED. If
the format is relevant to us, then we verify specific GIC
implementation via GITS_IIDR. If the series depends on this, then we
must decide what to do in case the specific implementation does not
match what we expect.

> This can be maliciously created by an attacker and the ITT_base can be
> pointed to hypervisor protected memory. The MAPTI command can then be
> used to write over the ITT_base with an ITE entry.

You mean it writes to the memory addressed by ITT_base, rather than
writes over the ITT_base itself.

> Similarly a vCPU Table entry has the following structure:
>
> type VCPUTableEntry is (
>         boolean Valid,
>         bits(32) RDbase,
>         Address VPT_base,
>         bits(5) VPT_size
> )
>
> VPT_base can be pointed to hypervisor protected memory and then a
> command can be used to raise interrupts and set the corresponding
> bit. This would give a 1-bit write primitive so is not "as generous"
> as the others.
>
>
> Notes
> =====
>
>
> Performance impact is expected with this as the emulation dance is not
> cost free.
> I haven't implemented any ITS quirks in the emulation and I don't know
> whether we will need it ? (some hardware needs explicit dcache flushing
> ITS_FLAGS_CMDQ_NEEDS_FLUSHING).

It's not a quirk. We should handle this in the next respin, because
cache maintenance of the command queue is an explicit architectural
requirement depending on how the hardware is integrated and
configured.

According to the spec, the cacheability attributes of the ITS command
queue are strictly governed by the InnerCache and OuterCache fields of
the GITS_CBASER register. These fields can be configured for various
memory types, including Device-nGnRnE or Normal Non-cacheable.

Because pKVM now takes responsibility for physically writing the
command packets into the true hardware queue, the hypervisor must obey
the cacheability attributes programmed into the physical GITS_CBASER.

If the software provisions GITS_CBASER as Non-cacheable, the
hypervisor must perform explicit data cache maintenance (such as DC
CVAU or DC CVAC) after copying the commands to the shadow queue. If
you don't implement this, the physical ITS hardware (acting as a
non-coherent bus master) will read stale memory, which will inevitably
lead to queue stalls or the ITS executing garbage commands.

Since we are shielding the physical queue from the host, we inherit
the host's responsibility to manage its cache coherency based on the
GITS_CBASER configuration.

> Please note that Redistributors trapping hasn't been addressed at all in
> this series and the solution is not sufficient but this can be extended
> afterwards.
> The current series has been tested with Qemu (-machine
> virt,virtualization=true,gic-version=4) and with Pixel 10.

It would be helpful to mention that this is based on Linux 7.0-rc3
(applied cleanly, and confirmed with you offline).

Also, it would be helpful if you could share how to tested this
series, and how we could reproduce your tests.

Thanks,
/fuad


>
>
> Thanks,
> Sebastian E.
>
> Mostafa Saleh (1):
>   KVM: arm64: Donate MMIO to the hypervisor
>
> Sebastian Ene (13):
>   KVM: arm64: Track host-unmapped MMIO regions in a static array
>   KVM: arm64: Support host MMIO trap handlers for unmapped devices
>   KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
>   irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
>   KVM: arm64: Add infrastructure for ITS emulation setup
>   KVM: arm64: Restrict host access to the ITS tables
>   KVM: arm64: Trap & emulate the ITS MAPD command
>   KVM: arm64: Trap & emulate the ITS VMAPP command
>   KVM: arm64: Trap & emulate the ITS MAPC command
>   KVM: arm64: Restrict host updates to GITS_CTLR
>   KVM: arm64: Restrict host updates to GITS_CBASER
>   KVM: arm64 Restrict host updates to GITS_BASER
>   KVM: arm64: Implement HVC interface for ITS emulation setup
>
>  arch/arm64/include/asm/kvm_arm.h              |   3 +
>  arch/arm64/include/asm/kvm_asm.h              |   1 +
>  arch/arm64/include/asm/kvm_pkvm.h             |  20 +
>  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 +
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
>  arch/arm64/kvm/hyp/nvhe/Makefile              |   3 +-
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  14 +
>  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 653 ++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 134 ++++
>  arch/arm64/kvm/hyp/nvhe/setup.c               |  28 +
>  arch/arm64/kvm/hyp/pgtable.c                  |   9 +-
>  arch/arm64/kvm/pkvm.c                         |  60 ++
>  drivers/irqchip/irq-gic-v3-its.c              | 177 ++++-
>  include/linux/irqchip/arm-gic-v3.h            |  36 +
>  14 files changed, 1126 insertions(+), 31 deletions(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c
>
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor
  2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
@ 2026-03-12 17:57   ` Fuad Tabba
  2026-03-13 10:40   ` Suzuki K Poulose
  2026-03-24 10:39   ` Vincent Donnefort
  2 siblings, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-12 17:57 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian (and Mostafa),

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> From: Mostafa Saleh <smostafa@google.com>
>
> Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> drivers can use that to protect the MMIO of IOMMU.
> The initial attempt to implement this was to have a new flag to
> "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> it was quite intrusive for host/hyp to check/set page state to make it
> aware of MMIO and to encode the state in the page table in that case.
> Which is called in paths that can be sensitive to performance (FFA, VMs..)
>
> As donating MMIO is very rare, and we don’t need to encode the full
> state, it’s reasonable to have a separate function to do this.
> It will init the host s2 page table with an invalid leaf with the owner ID
> to prevent the host from mapping the page on faults.
>
> Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> stage-2 PTEs, as this can be triggered from recycle logic under memory
> pressure. There is no code relying on this, as all ownership changes is
> done via kvm_pgtable_stage2_set_owner()
>
> For error path in IOMMU drivers, add a function to donate MMIO back
> from hyp to host.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  2 +
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 90 +++++++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c                  |  9 +-
>  3 files changed, 94 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 5f9d56754e39..8b617e6fc0e0 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -31,6 +31,8 @@ enum pkvm_component_id {
>  };
>
>  extern unsigned long hyp_nr_cpus;
> +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> +int __pkvm_hyp_donate_host_mmio(u64 pfn);
>
>  int __pkvm_prot_finalize(void);
>  int __pkvm_host_share_hyp(u64 pfn);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 38f66a56a766..0808367c52e5 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -784,6 +784,96 @@ int __pkvm_host_unshare_hyp(u64 pfn)
>         return ret;
>  }
>
> +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> +{
> +       u64 phys = hyp_pfn_to_phys(pfn);
> +       void *virt = __hyp_va(phys);
> +       int ret;
> +       kvm_pte_t pte;

nit: prefer reverse-christmas tree ordering for declarations.


> +
> +       if (addr_is_memory(phys))
> +               return -EINVAL;
> +
> +       host_lock_component();
> +       hyp_lock_component();
> +
> +       ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> +       if (ret)
> +               goto unlock;
> +
> +       if (pte && !kvm_pte_valid(pte)) {
> +               ret = -EPERM;
> +               goto unlock;
> +       }
> +
> +       ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> +       if (ret)
> +               goto unlock;
> +       if (pte) {
> +               ret = -EBUSY;
> +               goto unlock;
> +       }
> +
> +       ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> +       if (ret)
> +               goto unlock;
> +       /*
> +        * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> +        * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> +        * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> +        *   kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> +        * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> +        */

Rephrasing this to see if I got this right. Because MMIO pages are not
tracked in the hyp_vmemmap, the host's stage-2 PTE serves as the sole
record of ownership. We encode PKVM_ID_HYP into an invalid PTE (a
"counted" PTE) to ensure:
- Fault Handling: host_stage2_adjust_range() will explicitly fail if
the host attempts to legitimately map or adjust the protected range.
- Reclaim Protection: Under memory pressure,
host_stage2_unmap_dev_all() will not erase this ownership marker,
because you changed stage2_unmap_walker() to preserve counted PTEs.
- Exclusive Ownership: Subsequent donation attempts (to HYP or a
guest) will safely fail, as donation requires the host PTE to be
either valid (host-owned) or completely empty.

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad









> +       WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> +                               PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> +unlock:
> +       hyp_unlock_component();
> +       host_unlock_component();
> +
> +       return ret;
> +}
> +
> +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> +{
> +       u64 phys = hyp_pfn_to_phys(pfn);
> +       u64 virt = (u64)__hyp_va(phys);
> +       size_t size = PAGE_SIZE;
> +       int ret;
> +       kvm_pte_t pte;
> +
> +       if (addr_is_memory(phys))
> +               return -EINVAL;
> +
> +       host_lock_component();
> +       hyp_lock_component();
> +
> +       ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> +       if (ret)
> +               goto unlock;
> +       if (!kvm_pte_valid(pte)) {
> +               ret = -ENOENT;
> +               goto unlock;
> +       }
> +
> +       ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> +       if (ret)
> +               goto unlock;
> +
> +       if (FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte) != PKVM_ID_HYP) {
> +               ret = -EPERM;
> +               goto unlock;
> +       }
> +
> +       WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> +       WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> +                               PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> +unlock:
> +       hyp_unlock_component();
> +       host_unlock_component();
> +
> +       return ret;
> +}
> +
>  int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
>  {
>         u64 phys = hyp_pfn_to_phys(pfn);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9b480f947da2..d954058e63ff 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1152,13 +1152,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>         kvm_pte_t *childp = NULL;
>         bool need_flush = false;
>
> -       if (!kvm_pte_valid(ctx->old)) {
> -               if (stage2_pte_is_counted(ctx->old)) {
> -                       kvm_clear_pte(ctx->ptep);
> -                       mm_ops->put_page(ctx->ptep);
> -               }
> -               return 0;
> -       }
> +       if (!kvm_pte_valid(ctx->old))
> +               return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
>
>         if (kvm_pte_table(ctx->old, ctx->level)) {
>                 childp = kvm_pte_follow(ctx->old, mm_ops);
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array
  2026-03-10 12:49 ` [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array Sebastian Ene
@ 2026-03-12 19:05   ` Fuad Tabba
  2026-03-24 10:46   ` Vincent Donnefort
  1 sibling, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-12 19:05 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Introduce a registry to track protected MMIO regions that are unmapped
> from the host stage-2 page tables. These regions are stored in a
> fixed-size array and their ownership is donated to the hypervisor during
> initialization to ensure host-exclusion and persistent tracking.
>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/include/asm/kvm_pkvm.h     | 10 ++++++++++
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  3 +++
>  arch/arm64/kvm/hyp/nvhe/setup.c       | 25 +++++++++++++++++++++++++
>  3 files changed, 38 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> index 757076ad4ec9..48ec7d519399 100644
> --- a/arch/arm64/include/asm/kvm_pkvm.h
> +++ b/arch/arm64/include/asm/kvm_pkvm.h
> @@ -17,6 +17,16 @@
>
>  #define HYP_MEMBLOCK_REGIONS 128
>
> +#define PKVM_PROTECTED_REGS_NUM        8
> +
> +struct pkvm_protected_reg {
> +       u64 start_pfn;
> +       size_t num_pages;
> +};
> +
> +extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
> +extern unsigned int kvm_nvhe_sym(num_protected_reg);
> +
>  int pkvm_init_host_vm(struct kvm *kvm);
>  int pkvm_create_hyp_vm(struct kvm *kvm);
>  bool pkvm_hyp_vm_is_created(struct kvm *kvm);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 0808367c52e5..7c125836b533 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -23,6 +23,9 @@
>
>  struct host_mmu host_mmu;
>
> +struct pkvm_protected_reg pkvm_protected_regs[PKVM_PROTECTED_REGS_NUM];
> +unsigned int num_protected_reg;
> +
>  static struct hyp_pool host_s2_pool;
>
>  static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm);
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 90bd014e952f..ad5b96085e1b 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -284,6 +284,27 @@ static int fix_hyp_pgtable_refcnt(void)
>                                 &walker);
>  }
>
> +static int unmap_protected_regions(void)

I think the name of this function is confusing. It's not really
unmapping, it's donating the regions to hyp. Maybe
donate_protected_regions() or claim_protected_mmio_regions to reflect
that we are doing an ownership transfer?

> +{
> +       struct pkvm_protected_reg *reg;
> +       int i, ret, j = 0;

Please don't interleave the iterators with ret. Moreover, you can
define ret inside the loop below. Also, you don't need to initialze j.

> +
> +       for (i = 0; i < num_protected_reg; i++) {
> +               reg = &pkvm_protected_regs[i];
> +               for (j = 0; j < reg->num_pages; j++) {
> +                       ret = __pkvm_host_donate_hyp_mmio(reg->start_pfn + j);
> +                       if (ret)
> +                               goto err_setup;
> +               }
> +       }
> +
> +       return 0;
> +err_setup:
> +       for (j = j - 1; j >= 0; j--)
> +               __pkvm_hyp_donate_host_mmio(reg->start_pfn + j);

This rolls back the regions only for the latest `i` iteration, not all
the `i` iterations.

How about (not tested):

+    err_setup:
+       while (j--)
+           __pkvm_hyp_donate_host_mmio(reg->start_pfn + j);
+
+       while (i--) {
+           reg = &pkvm_protected_regs[i];
+           for (j = reg->num_pages - 1; j >= 0; j--)
+               __pkvm_hyp_donate_host_mmio(reg->start_pfn + j);
+       }

Thanks,
/fuad


> +       return ret;
> +}
> +
>  void __noreturn __pkvm_init_finalise(void)
>  {
>         struct kvm_cpu_context *host_ctxt = host_data_ptr(host_ctxt);
> @@ -324,6 +345,10 @@ void __noreturn __pkvm_init_finalise(void)
>         if (ret)
>                 goto out;
>
> +       ret = unmap_protected_regions();
> +       if (ret)
> +               goto out;
> +
>         ret = hyp_ffa_init(ffa_proxy_pages);
>         if (ret)
>                 goto out;
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices
  2026-03-10 12:49 ` [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices Sebastian Ene
@ 2026-03-13  9:31   ` Fuad Tabba
  2026-03-24 10:59   ` Vincent Donnefort
  1 sibling, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-13  9:31 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Introduce a mechanism to register callbacks for MMIO accesses to regions
> unmapped from the host Stage-2 page tables.
>
> This infrastructure allows the hypervisor to intercept host accesses to
> protected or emulated devices. When a Stage-2 fault occurs on a
> registered device region, the hypervisor will invoke the associated
> callback to emulate the access.
>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/include/asm/kvm_arm.h      |  3 ++
>  arch/arm64/include/asm/kvm_pkvm.h     |  6 ++++
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c | 41 +++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  3 ++
>  4 files changed, 53 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 3f9233b5a130..8fe1e80ab3f4 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -304,6 +304,9 @@
>
>  /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>  #define HPFAR_MASK     (~UL(0xf))
> +
> +#define FAR_MASK       GENMASK_ULL(11, 0)
> +
>  /*
>   * We have
>   *     PAR     [PA_Shift - 1   : 12] = PA      [PA_Shift - 1 : 12]
> diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> index 48ec7d519399..5321ced2f50a 100644
> --- a/arch/arm64/include/asm/kvm_pkvm.h
> +++ b/arch/arm64/include/asm/kvm_pkvm.h
> @@ -19,9 +19,15 @@
>
>  #define PKVM_PROTECTED_REGS_NUM        8
>
> +struct pkvm_protected_reg;
> +
> +typedef void (pkvm_emulate_handler)(struct pkvm_protected_reg *region, u64 offset, bool write,
> +                                   u64 *reg, u8 reg_size);
> +
>  struct pkvm_protected_reg {
>         u64 start_pfn;
>         size_t num_pages;
> +       pkvm_emulate_handler *cb;
>  };
>
>  extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 7c125836b533..f405d2fbd88f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -13,6 +13,7 @@
>  #include <asm/stage2_pgtable.h>
>
>  #include <hyp/fault.h>
> +#include <hyp/adjust_pc.h>

Please sort includes alphabetically.

>
>  #include <nvhe/gfp.h>
>  #include <nvhe/memory.h>
> @@ -608,6 +609,41 @@ static int host_stage2_idmap(u64 addr)
>         return ret;
>  }
>
> +static bool handle_host_mmio_trap(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr)
> +{
> +       u64 offset, reg_value = 0, start, end;
> +       u8 reg_size, reg_index;
> +       bool write;
> +       int i;

What do you plan to do if there is no valid syndrome, i.e.,
ESR_EL2.ISV == 0? I am still reviewing, so maybe this is solved in a
future patch, or maybe you know that, in practice, all instructions
would have a valid syndrome. Regardless of which it is, you should
definitely add the following check to _this_ patch (or reconsider the
approach if it is possible to get legit accesses with ESR_EL2.ISV ==
0):

+      if (!(esr & ESR_ELx_ISV))
+              return false;

> +
> +       for (i = 0; i < num_protected_reg; i++) {
> +               start = pkvm_protected_regs[i].start_pfn << PAGE_SHIFT;
> +               end = start + (pkvm_protected_regs[i].num_pages << PAGE_SHIFT);
> +
> +               if (start > addr || addr > end)

Because end is calculated by adding the size, it represents the first
byte after the region, so this should be:
+               if (start > addr || addr >= end)
> +                       continue;

You also need to make sure that the entire access fits within the
protected region, to avoid a malicious or misaligned cross-boundary
access, i.e.:

+                if (addr + reg_size > end)
+                        return false;


> +               reg_size = BIT((esr & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
> +               reg_index = (esr & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
> +               write = (esr & ESR_ELx_WNR) == ESR_ELx_WNR;
> +               offset = addr - start;
> +
> +               if (write)
> +                       reg_value = host_ctxt->regs.regs[reg_index];

You need to handle the zero register (index 31) for writes, e.g.:
+                       reg_value = (reg_index == 31) ? 0 :
host_ctxt->regs.regs[reg_index];

> +
> +               pkvm_protected_regs[i].cb(&pkvm_protected_regs[i], offset, write,
> +                                         &reg_value, reg_size);
> +
> +               if (!write)
> +                       host_ctxt->regs.regs[reg_index] = reg_value;

and for reads:
+               if (!write & reg_index != 31)

Cheers,
/fuad

> +
> +               kvm_skip_host_instr();
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
>  {
>         struct kvm_vcpu_fault_info fault;
> @@ -630,6 +666,11 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
>          */
>         BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS));
>         addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12;
> +       addr |= fault.far_el2 & FAR_MASK;
> +
> +       if (ESR_ELx_EC(esr) == ESR_ELx_EC_DABT_LOW && !addr_is_memory(addr) &&
> +           handle_host_mmio_trap(host_ctxt, esr, addr))
> +               return;
>
>         ret = host_stage2_idmap(addr);
>         BUG_ON(ret && ret != -EAGAIN);
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index ad5b96085e1b..f91dfebe9980 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -296,6 +296,9 @@ static int unmap_protected_regions(void)
>                         if (ret)
>                                 goto err_setup;
>                 }
> +
> +               if (reg->cb)
> +                       reg->cb = kern_hyp_va(reg->cb);
>         }
>
>         return 0;
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/14] KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
  2026-03-10 12:49 ` [PATCH 04/14] KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping Sebastian Ene
@ 2026-03-13  9:58   ` Fuad Tabba
  0 siblings, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-13  9:58 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Unmap the ITS MMIO region from the host address space to enforce
> hypervisor mediation.
> Identify the ITS base address from the device tree and store it in a
> protected region. A callback is registered to handle host accesses to
> this region; currently, the handler simply forwards all MMIO requests
> to the physical hardware. This provides the infrastructure for future
> hardware state validation without changing current behavior.
>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/include/asm/kvm_pkvm.h     |  2 ++
>  arch/arm64/kvm/hyp/nvhe/Makefile      |  3 ++-
>  arch/arm64/kvm/hyp/nvhe/its_emulate.c | 23 ++++++++++++++++
>  arch/arm64/kvm/pkvm.c                 | 38 +++++++++++++++++++++++++++
>  4 files changed, 65 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c
>
> diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> index 5321ced2f50a..ef00c1bf7d00 100644
> --- a/arch/arm64/include/asm/kvm_pkvm.h
> +++ b/arch/arm64/include/asm/kvm_pkvm.h
> @@ -32,6 +32,8 @@ struct pkvm_protected_reg {
>
>  extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
>  extern unsigned int kvm_nvhe_sym(num_protected_reg);
> +extern void kvm_nvhe_sym(pkvm_handle_forward_req)(struct pkvm_protected_reg *region, u64 offset,
> +                                                 bool write, u64 *reg, u8 reg_size);
>
>  int pkvm_init_host_vm(struct kvm *kvm);
>  int pkvm_create_hyp_vm(struct kvm *kvm);
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index a244ec25f8c5..eb43269fbac2 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -24,7 +24,8 @@ CFLAGS_switch.nvhe.o += -Wno-override-init
>
>  hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
>          hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \
> -        cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o
> +        cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o \
> +        its_emulate.o
>  hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
>          ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
>  hyp-obj-y += ../../../kernel/smccc-call.o
> diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> new file mode 100644
> index 000000000000..0eecbb011898
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> @@ -0,0 +1,23 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <asm/kvm_pkvm.h>
> +#include <nvhe/mem_protect.h>
> +
> +
> +void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool write,
> +                            u64 *reg, u8 reg_size)
> +{
> +       void __iomem *addr = __hyp_va((region->start_pfn << PAGE_SHIFT) + offset);

Better use the macros to do this:
+       void __iomem *addr = __hyp_va(PFN_PHYS(region->start_pfn) + offset);

Also, considering all the potential issues with hyp_va, it might be
better to store the virtual address in pkvm_protected_regs (void
__iomem *base_va) assigned to the MMIO region at the time it was
successfully mapped into EL2, rather than recalculating it on the fly.

> +
> +       if (reg_size == sizeof(u32)) {
> +               if (!write)
> +                       *reg = readl_relaxed(addr);
> +               else
> +                       writel_relaxed(*reg, addr);
> +       } else if (reg_size == sizeof(u64)) {
> +               if (!write)
> +                       *reg = readq_relaxed(addr);
> +               else
> +                       writeq_relaxed(*reg, addr);
> +       }

The spec permits 8-bit, 16-bit, 32-bit, and 64-bit accesses depending
on the specific register. This only checks for sizeof(u32) and
sizeof(u64). You should either handle them, or you intentionally
wanted to block 8-bit/16-bit accesses, then you must explicitly inject
a synchronous abort back to the host.

Also, I think that using `readl_relaxed` and `writel_relaxed` is
technically correct for pure passthrough, as it assumes the host
driver issued the necessary barriers before triggering the trap.
However, with memory barriers I am always a bit wary, and I'd rather
we check with someone to confirm whether this is ok, i.e., Will
Deacon.

> +}
> diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
> index d7a0f69a9982..a766be6de735 100644
> --- a/arch/arm64/kvm/pkvm.c
> +++ b/arch/arm64/kvm/pkvm.c
> @@ -11,6 +11,9 @@
>  #include <asm/kvm_mmu.h>
>  #include <linux/memblock.h>
>  #include <linux/mutex.h>
> +#include <linux/of_address.h>
> +#include <linux/of_reserved_mem.h>
> +#include <linux/platform_device.h>
>
>  #include <asm/kvm_pkvm.h>
>
> @@ -18,6 +21,7 @@
>
>  DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
>
> +static struct pkvm_protected_reg *pkvm_protected_regs = kvm_nvhe_sym(pkvm_protected_regs);
>  static struct memblock_region *hyp_memory = kvm_nvhe_sym(hyp_memory);
>  static unsigned int *hyp_memblock_nr_ptr = &kvm_nvhe_sym(hyp_memblock_nr);
>
> @@ -39,6 +43,34 @@ static int __init register_memblock_regions(void)
>         return 0;
>  }
>
> +static int __init register_protected_regions(void)
> +{
> +       int i = 0, ret;
> +       struct device_node *np;
> +       struct resource res;

Prefer reverse Christmas tree declaration order.

> +
> +       for_each_compatible_node(np, NULL, "arm,gic-v3-its") {
> +               ret = of_address_to_resource(np, i, &res);
> +               if (ret)
> +                       return ret;
> +
> +               if (i >= PKVM_PROTECTED_REGS_NUM)
> +                       return -ENOMEM;
> +
> +               if (!PAGE_ALIGNED(res.start) || !PAGE_ALIGNED(resource_size(&res)))
> +                       return -EINVAL;
> +
> +               pkvm_protected_regs[i].start_pfn = res.start >> PAGE_SHIFT;

and this:
... PHYS_PFN(res.start);


> +               pkvm_protected_regs[i].num_pages = resource_size(&res) >> PAGE_SHIFT;

and this:
... PFN_DOWN(resource_size(&res));

Cheers,
/fuad

> +               pkvm_protected_regs[i].cb = lm_alias(&kvm_nvhe_sym(pkvm_handle_forward_req));
> +               i++;
> +       }
> +
> +       kvm_nvhe_sym(num_protected_reg) = i;
> +
> +       return 0;
> +}
> +
>  void __init kvm_hyp_reserve(void)
>  {
>         u64 hyp_mem_pages = 0;
> @@ -57,6 +89,12 @@ void __init kvm_hyp_reserve(void)
>                 return;
>         }
>
> +       ret = register_protected_regions();
> +       if (ret) {
> +               kvm_err("Failed to register protected reg: %d\n", ret);
> +               return;
> +       }
> +
>         hyp_mem_pages += hyp_s1_pgtable_pages();
>         hyp_mem_pages += host_s2_pgtable_pages();
>         hyp_mem_pages += hyp_vm_table_pages();
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor
  2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
  2026-03-12 17:57   ` Fuad Tabba
@ 2026-03-13 10:40   ` Suzuki K Poulose
  2026-03-24 10:39   ` Vincent Donnefort
  2 siblings, 0 replies; 36+ messages in thread
From: Suzuki K Poulose @ 2026-03-13 10:40 UTC (permalink / raw)
  To: Sebastian Ene, alexandru.elisei, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm
  Cc: catalin.marinas, dbrazdil, joey.gouly, kees, mark.rutland, maz,
	oupton, perlarsen, qperret, rananta, smostafa, tabba, tglx,
	vdonnefort, bgrzesik, will, yuzenghui

On 10/03/2026 12:49, Sebastian Ene wrote:
> From: Mostafa Saleh <smostafa@google.com>
> 
> Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> drivers can use that to protect the MMIO of IOMMU.
> The initial attempt to implement this was to have a new flag to
> "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> it was quite intrusive for host/hyp to check/set page state to make it
> aware of MMIO and to encode the state in the page table in that case.
> Which is called in paths that can be sensitive to performance (FFA, VMs..)
> 
> As donating MMIO is very rare, and we don’t need to encode the full
> state, it’s reasonable to have a separate function to do this.
> It will init the host s2 page table with an invalid leaf with the owner ID
> to prevent the host from mapping the page on faults.
> 
> Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> stage-2 PTEs, as this can be triggered from recycle logic under memory
> pressure. There is no code relying on this, as all ownership changes is
> done via kvm_pgtable_stage2_set_owner()
> 
> For error path in IOMMU drivers, add a function to donate MMIO back
> from hyp to host.
> 
> Signed-off-by: Mostafa Saleh <smostafa@google.com>

nit: Your Signed-off-by: is missing. Even if you haven't made any
changes, when you are sending something, you need your S-o-B.

Suzuki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
  2026-03-10 12:49 ` [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege Sebastian Ene
@ 2026-03-13 11:26   ` Fuad Tabba
  2026-03-13 13:10     ` Fuad Tabba
  2026-03-20 15:11     ` Sebastian Ene
  0 siblings, 2 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-13 11:26 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Expose two helper functions to support emulated ITS in the hypervisor.
> These allow the KVM layer to notify the driver when hypervisor
> initialization is complete.
> The caller is expected to use the functions as follows:
> 1. its_start_deprivilege(): Acquire the ITS locks.
> 2. on_each_cpu(_kvm_host_prot_finalize, ...): Finalizes pKVM init
> 3. its_end_deprivilege(): Shadow the ITS structures, invoke the KVM
>    callback, and release locks.
> Specifically, this shadows the ITS command queue and the 1st level
> indirect tables. These shadow buffers will be used by the driver after
> host deprivilege, while the hypervisor unmaps and takes ownership of the
> original structures.

Just a note again on preferring not to use the "shadow" terminology. I
thought about it a bit more, since these are not at the host, perhaps
"proxy" is a better term, to convey that the host is writing to a
middle-man buffer.

Another term is "staging," which is common in DMA: the host "stages"
the commands here, and EL2 "commits" them to the hardware.

>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  drivers/irqchip/irq-gic-v3-its.c   | 165 +++++++++++++++++++++++++++--
>  include/linux/irqchip/arm-gic-v3.h |  24 +++++
>  2 files changed, 178 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 291d7668cc8d..278dbc56f962 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -78,17 +78,6 @@ struct its_collection {
>         u16                     col_id;
>  };
>
> -/*
> - * The ITS_BASER structure - contains memory information, cached
> - * value of BASER register configuration and ITS page size.
> - */
> -struct its_baser {
> -       void            *base;
> -       u64             val;
> -       u32             order;
> -       u32             psz;
> -};
> -
>  struct its_device;
>
>  /*
> @@ -5232,6 +5221,160 @@ static int __init its_compute_its_list_map(struct its_node *its)
>         return its_number;
>  }
>
> +static void its_free_shadow_tables(struct its_shadow_tables *shadow)
> +{
> +       int i;
> +
> +       if (shadow->cmd_shadow)
> +               its_free_pages(shadow->cmd_shadow, get_order(ITS_CMD_QUEUE_SZ));
> +
> +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> +               if (!shadow->tables[i].shadow)
> +                       continue;
> +
> +               its_free_pages(shadow->tables[i].shadow, 0);
> +       }
> +
> +       its_free_pages(shadow, 0);
> +}
> +
> +static struct its_shadow_tables *its_get_shadow_tables(struct its_node *its)
> +{
> +       void *page;
> +       struct its_shadow_tables *shadow;
> +       int i;

Prefer RCT declarations.

> +
> +       page = its_alloc_pages_node(its->numa_node, GFP_KERNEL | __GFP_ZERO, 0);

This is called with the raw_spin_lock_irqsave held, and GFP_KERNEL can
sleep. You have one of two options, either use GFP_ATOMIC, but that's
more likely to fail. The alternative is to move this to
its_start_deprivilege(), before any lock is held.

> +       if (!page)
> +               return NULL;
> +
> +       shadow = (void *)page_address(page);
> +       page = its_alloc_pages_node(its->numa_node,
> +                                   GFP_KERNEL | __GFP_ZERO,
> +                                   get_order(ITS_CMD_QUEUE_SZ));
> +       if (!page)
> +               goto err_alloc_shadow;
> +
> +       shadow->cmd_shadow = page_address(page);
> +       shadow->cmdq_len = ITS_CMD_QUEUE_SZ;
> +       shadow->cmd_original = its->cmd_base;
> +
> +       memcpy(shadow->tables, its->tables, sizeof(struct its_baser) * GITS_BASER_NR_REGS);
> +
> +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> +               if (!(shadow->tables[i].val & GITS_BASER_VALID))
> +                       continue;
> +
> +               if (!(shadow->tables[i].val & GITS_BASER_INDIRECT))
> +                       continue;
> +
> +               page = its_alloc_pages_node(its->numa_node,
> +                                           GFP_KERNEL | __GFP_ZERO,
> +                                           shadow->tables[i].order);
> +               if (!page)
> +                       goto err_alloc_shadow;
> +
> +               shadow->tables[i].shadow = page_address(page);
> +
> +               memcpy(shadow->tables[i].shadow, shadow->tables[i].base,
> +                      PAGE_ORDER_TO_SIZE(shadow->tables[i].order));
> +       }
> +
> +       return shadow;
> +
> +err_alloc_shadow:
> +       its_free_shadow_tables(shadow);
> +       return NULL;
> +}
> +
> +void *its_start_depriviledge(void)

Typo here and elsewhere in this patch:

s/depriviledge/deprivilege/g

This is particularly important because it also appears in exported
symbols as well (later in this patch).

> +{
> +       struct its_node *its;
> +       int num_nodes = 0, i = 0;
> +       unsigned long *flags;

RCT declaration order, and please untagle them, i.e., don't declare
the num_nodes and the iterator in the same line.

> +
> +       raw_spin_lock(&its_lock);
> +       list_for_each_entry(its, &its_nodes, entry) {
> +               num_nodes++;
> +       }
> +
> +       flags = kzalloc(num_nodes * sizeof(unsigned long), GFP_KERNEL_ACCOUNT);

Same as the other allocation. This can sleep. I think that for this as
well, it's better to move it before lock acquisition. Even if you use
a different allocator, it's still better to keep the critical section
short.

> +       if (!flags) {
> +               raw_spin_unlock(&its_lock);
> +               return NULL;
> +       }
> +
> +       list_for_each_entry(its, &its_nodes, entry) {
> +               raw_spin_lock_irqsave(&its->lock, flags[i++]);
> +       }
> +
> +       return flags;
> +}
> +EXPORT_SYMBOL_GPL(its_start_depriviledge);
> +
> +static int its_switch_to_shadow_locked(struct its_node *its, its_init_emulate init_emulate_cb)
> +{
> +       struct its_shadow_tables *hyp_shadow, shadow;
> +       int i, ret;
> +       u64 baser, baser_phys;
> +
> +       hyp_shadow = its_get_shadow_tables(its);
> +       if (!hyp_shadow)
> +               return -ENOMEM;
> +
> +       memcpy(&shadow, hyp_shadow, sizeof(shadow));
> +       ret = init_emulate_cb(its->phys_base, hyp_shadow);

You are performing this callback with the lock held and local
interrupts disabled. The hvc call is byitself expensive, especially
since it's going to do stage-2 manipulations.

You should decouple the synchronous pointer swapping (which must be
locked) from the hypervisor notification (which can be done outside
the lock). Instead of executing the callback inside the critical
section, its_end_deprivilege should:
- Lock everything.
- Perform the pointer swaps in the host driver structures.
- Save the hyp_shadow pointers to a temporary array.
- Unlock everything.
- Loop through the temporary array and call the KVM cb to notify EL2.

You should probably split this patch into two. The first patch would
implement the freeze/unfreeze locking mechanism, and the second would
swap the driver's internal memory pointers to the shadow structures,
and invoke the KVM callback to lock down the real hardware.

Cheers,
/fuad

> +       if (ret) {
> +               its_free_shadow_tables(hyp_shadow);
> +               return ret;
> +       }
> +
> +       /* Switch the driver command queue to use the shadow and save the original */
> +       its->cmd_write = (its->cmd_write - its->cmd_base) +
> +               (struct its_cmd_block *)shadow.cmd_shadow;
> +       its->cmd_base = shadow.cmd_shadow;
> +
> +       /* Shadow the first level of the indirect tables */
> +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> +               baser = shadow.tables[i].val;
> +
> +               if (!shadow.tables[i].shadow)
> +                       continue;
> +
> +               baser_phys = virt_to_phys(shadow.tables[i].shadow);
> +               if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48))
> +                       baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> +
> +               its->tables[i].val &= ~GENMASK(47, 12);
> +               its->tables[i].val |= baser_phys;
> +               its->tables[i].base = shadow.tables[i].shadow;
> +       }
> +
> +       return 0;
> +}
> +
> +int its_end_depriviledge(int ret_pkvm_finalize, unsigned long *flags, its_init_emulate cb)
> +{
> +       struct its_node *its;
> +       int i = 0, ret = 0;
> +
> +       if (!flags || !cb)
> +               return -EINVAL;
> +
> +       list_for_each_entry(its, &its_nodes, entry) {
> +               if (!ret_pkvm_finalize && !ret)
> +                       ret = its_switch_to_shadow_locked(its, cb);
> +
> +               raw_spin_unlock_irqrestore(&its->lock, flags[i++]);
> +       }
> +
> +       kfree(flags);
> +       raw_spin_unlock(&its_lock);
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(its_end_depriviledge);
> +
>  static int __init its_probe_one(struct its_node *its)
>  {
>         u64 baser, tmp;
> diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> index 0225121f3013..40457a4375d4 100644
> --- a/include/linux/irqchip/arm-gic-v3.h
> +++ b/include/linux/irqchip/arm-gic-v3.h
> @@ -657,6 +657,30 @@ static inline bool gic_enable_sre(void)
>         return !!(val & ICC_SRE_EL1_SRE);
>  }
>
> +/*
> + * The ITS_BASER structure - contains memory information, cached
> + * value of BASER register configuration and ITS page size.
> + */
> +struct its_baser {
> +       void            *base;
> +       void            *shadow;
> +       u64             val;
> +       u32             order;
> +       u32             psz;
> +};
> +
> +struct its_shadow_tables {
> +       struct its_baser        tables[GITS_BASER_NR_REGS];
> +       void                    *cmd_shadow;
> +       void                    *cmd_original;
> +       size_t                  cmdq_len;
> +};
> +
> +typedef int (*its_init_emulate)(phys_addr_t its_phys_base, struct its_shadow_tables *shadow);
> +
> +void *its_start_depriviledge(void);
> +int its_end_depriviledge(int ret, unsigned long *flags, its_init_emulate cb);
> +
>  #endif
>
>  #endif
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
  2026-03-13 11:26   ` Fuad Tabba
@ 2026-03-13 13:10     ` Fuad Tabba
  2026-03-20 15:11     ` Sebastian Ene
  1 sibling, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-13 13:10 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

On Fri, 13 Mar 2026 at 11:26, Fuad Tabba <tabba@google.com> wrote:
>
> Hi Sebastian,
>
> On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
> >
> > Expose two helper functions to support emulated ITS in the hypervisor.
> > These allow the KVM layer to notify the driver when hypervisor
> > initialization is complete.
> > The caller is expected to use the functions as follows:
> > 1. its_start_deprivilege(): Acquire the ITS locks.
> > 2. on_each_cpu(_kvm_host_prot_finalize, ...): Finalizes pKVM init
> > 3. its_end_deprivilege(): Shadow the ITS structures, invoke the KVM
> >    callback, and release locks.
> > Specifically, this shadows the ITS command queue and the 1st level
> > indirect tables. These shadow buffers will be used by the driver after
> > host deprivilege, while the hypervisor unmaps and takes ownership of the
> > original structures.
>
> Just a note again on preferring not to use the "shadow" terminology. I
> thought about it a bit more, since these are not at the host, perhaps
> "proxy" is a better term, to convey that the host is writing to a
> middle-man buffer.

I meant to say that, since these are not at the "hypervisor", or since
they *are* allocated in host.....

>
> Another term is "staging," which is common in DMA: the host "stages"
> the commands here, and EL2 "commits" them to the hardware.
>
> >
> > Signed-off-by: Sebastian Ene <sebastianene@google.com>
> > ---
> >  drivers/irqchip/irq-gic-v3-its.c   | 165 +++++++++++++++++++++++++++--
> >  include/linux/irqchip/arm-gic-v3.h |  24 +++++
> >  2 files changed, 178 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> > index 291d7668cc8d..278dbc56f962 100644
> > --- a/drivers/irqchip/irq-gic-v3-its.c
> > +++ b/drivers/irqchip/irq-gic-v3-its.c
> > @@ -78,17 +78,6 @@ struct its_collection {
> >         u16                     col_id;
> >  };
> >
> > -/*
> > - * The ITS_BASER structure - contains memory information, cached
> > - * value of BASER register configuration and ITS page size.
> > - */
> > -struct its_baser {
> > -       void            *base;
> > -       u64             val;
> > -       u32             order;
> > -       u32             psz;
> > -};
> > -
> >  struct its_device;
> >
> >  /*
> > @@ -5232,6 +5221,160 @@ static int __init its_compute_its_list_map(struct its_node *its)
> >         return its_number;
> >  }
> >
> > +static void its_free_shadow_tables(struct its_shadow_tables *shadow)
> > +{
> > +       int i;
> > +
> > +       if (shadow->cmd_shadow)
> > +               its_free_pages(shadow->cmd_shadow, get_order(ITS_CMD_QUEUE_SZ));
> > +
> > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > +               if (!shadow->tables[i].shadow)
> > +                       continue;
> > +
> > +               its_free_pages(shadow->tables[i].shadow, 0);
> > +       }
> > +
> > +       its_free_pages(shadow, 0);
> > +}
> > +
> > +static struct its_shadow_tables *its_get_shadow_tables(struct its_node *its)
> > +{
> > +       void *page;
> > +       struct its_shadow_tables *shadow;
> > +       int i;
>
> Prefer RCT declarations.
>
> > +
> > +       page = its_alloc_pages_node(its->numa_node, GFP_KERNEL | __GFP_ZERO, 0);
>
> This is called with the raw_spin_lock_irqsave held, and GFP_KERNEL can
> sleep. You have one of two options, either use GFP_ATOMIC, but that's
> more likely to fail. The alternative is to move this to
> its_start_deprivilege(), before any lock is held.
>
> > +       if (!page)
> > +               return NULL;
> > +
> > +       shadow = (void *)page_address(page);
> > +       page = its_alloc_pages_node(its->numa_node,
> > +                                   GFP_KERNEL | __GFP_ZERO,
> > +                                   get_order(ITS_CMD_QUEUE_SZ));
> > +       if (!page)
> > +               goto err_alloc_shadow;
> > +
> > +       shadow->cmd_shadow = page_address(page);
> > +       shadow->cmdq_len = ITS_CMD_QUEUE_SZ;
> > +       shadow->cmd_original = its->cmd_base;
> > +
> > +       memcpy(shadow->tables, its->tables, sizeof(struct its_baser) * GITS_BASER_NR_REGS);
> > +
> > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > +               if (!(shadow->tables[i].val & GITS_BASER_VALID))
> > +                       continue;
> > +
> > +               if (!(shadow->tables[i].val & GITS_BASER_INDIRECT))
> > +                       continue;
> > +
> > +               page = its_alloc_pages_node(its->numa_node,
> > +                                           GFP_KERNEL | __GFP_ZERO,
> > +                                           shadow->tables[i].order);
> > +               if (!page)
> > +                       goto err_alloc_shadow;
> > +
> > +               shadow->tables[i].shadow = page_address(page);
> > +
> > +               memcpy(shadow->tables[i].shadow, shadow->tables[i].base,
> > +                      PAGE_ORDER_TO_SIZE(shadow->tables[i].order));
> > +       }
> > +
> > +       return shadow;
> > +
> > +err_alloc_shadow:
> > +       its_free_shadow_tables(shadow);
> > +       return NULL;
> > +}
> > +
> > +void *its_start_depriviledge(void)
>
> Typo here and elsewhere in this patch:
>
> s/depriviledge/deprivilege/g
>
> This is particularly important because it also appears in exported
> symbols as well (later in this patch).
>
> > +{
> > +       struct its_node *its;
> > +       int num_nodes = 0, i = 0;
> > +       unsigned long *flags;
>
> RCT declaration order, and please untagle them, i.e., don't declare
> the num_nodes and the iterator in the same line.
>
> > +
> > +       raw_spin_lock(&its_lock);
> > +       list_for_each_entry(its, &its_nodes, entry) {
> > +               num_nodes++;
> > +       }
> > +
> > +       flags = kzalloc(num_nodes * sizeof(unsigned long), GFP_KERNEL_ACCOUNT);
>
> Same as the other allocation. This can sleep. I think that for this as
> well, it's better to move it before lock acquisition. Even if you use
> a different allocator, it's still better to keep the critical section
> short.
>
> > +       if (!flags) {
> > +               raw_spin_unlock(&its_lock);
> > +               return NULL;
> > +       }
> > +
> > +       list_for_each_entry(its, &its_nodes, entry) {
> > +               raw_spin_lock_irqsave(&its->lock, flags[i++]);
> > +       }
> > +
> > +       return flags;
> > +}
> > +EXPORT_SYMBOL_GPL(its_start_depriviledge);
> > +
> > +static int its_switch_to_shadow_locked(struct its_node *its, its_init_emulate init_emulate_cb)
> > +{
> > +       struct its_shadow_tables *hyp_shadow, shadow;
> > +       int i, ret;
> > +       u64 baser, baser_phys;
> > +
> > +       hyp_shadow = its_get_shadow_tables(its);
> > +       if (!hyp_shadow)
> > +               return -ENOMEM;
> > +
> > +       memcpy(&shadow, hyp_shadow, sizeof(shadow));
> > +       ret = init_emulate_cb(its->phys_base, hyp_shadow);
>
> You are performing this callback with the lock held and local
> interrupts disabled. The hvc call is byitself expensive, especially
> since it's going to do stage-2 manipulations.
>
> You should decouple the synchronous pointer swapping (which must be
> locked) from the hypervisor notification (which can be done outside
> the lock). Instead of executing the callback inside the critical
> section, its_end_deprivilege should:
> - Lock everything.
> - Perform the pointer swaps in the host driver structures.
> - Save the hyp_shadow pointers to a temporary array.
> - Unlock everything.
> - Loop through the temporary array and call the KVM cb to notify EL2.
>
> You should probably split this patch into two. The first patch would
> implement the freeze/unfreeze locking mechanism, and the second would
> swap the driver's internal memory pointers to the shadow structures,
> and invoke the KVM callback to lock down the real hardware.
>
> Cheers,
> /fuad
>
> > +       if (ret) {
> > +               its_free_shadow_tables(hyp_shadow);
> > +               return ret;
> > +       }
> > +
> > +       /* Switch the driver command queue to use the shadow and save the original */
> > +       its->cmd_write = (its->cmd_write - its->cmd_base) +
> > +               (struct its_cmd_block *)shadow.cmd_shadow;
> > +       its->cmd_base = shadow.cmd_shadow;
> > +
> > +       /* Shadow the first level of the indirect tables */
> > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > +               baser = shadow.tables[i].val;
> > +
> > +               if (!shadow.tables[i].shadow)
> > +                       continue;
> > +
> > +               baser_phys = virt_to_phys(shadow.tables[i].shadow);
> > +               if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48))
> > +                       baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> > +
> > +               its->tables[i].val &= ~GENMASK(47, 12);
> > +               its->tables[i].val |= baser_phys;
> > +               its->tables[i].base = shadow.tables[i].shadow;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +int its_end_depriviledge(int ret_pkvm_finalize, unsigned long *flags, its_init_emulate cb)
> > +{
> > +       struct its_node *its;
> > +       int i = 0, ret = 0;
> > +
> > +       if (!flags || !cb)
> > +               return -EINVAL;
> > +
> > +       list_for_each_entry(its, &its_nodes, entry) {
> > +               if (!ret_pkvm_finalize && !ret)
> > +                       ret = its_switch_to_shadow_locked(its, cb);
> > +
> > +               raw_spin_unlock_irqrestore(&its->lock, flags[i++]);
> > +       }
> > +
> > +       kfree(flags);
> > +       raw_spin_unlock(&its_lock);
> > +
> > +       return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(its_end_depriviledge);
> > +
> >  static int __init its_probe_one(struct its_node *its)
> >  {
> >         u64 baser, tmp;
> > diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> > index 0225121f3013..40457a4375d4 100644
> > --- a/include/linux/irqchip/arm-gic-v3.h
> > +++ b/include/linux/irqchip/arm-gic-v3.h
> > @@ -657,6 +657,30 @@ static inline bool gic_enable_sre(void)
> >         return !!(val & ICC_SRE_EL1_SRE);
> >  }
> >
> > +/*
> > + * The ITS_BASER structure - contains memory information, cached
> > + * value of BASER register configuration and ITS page size.
> > + */
> > +struct its_baser {
> > +       void            *base;
> > +       void            *shadow;
> > +       u64             val;
> > +       u32             order;
> > +       u32             psz;
> > +};
> > +
> > +struct its_shadow_tables {
> > +       struct its_baser        tables[GITS_BASER_NR_REGS];
> > +       void                    *cmd_shadow;
> > +       void                    *cmd_original;
> > +       size_t                  cmdq_len;
> > +};
> > +
> > +typedef int (*its_init_emulate)(phys_addr_t its_phys_base, struct its_shadow_tables *shadow);
> > +
> > +void *its_start_depriviledge(void);
> > +int its_end_depriviledge(int ret, unsigned long *flags, its_init_emulate cb);
> > +
> >  #endif
> >
> >  #endif
> > --
> > 2.53.0.473.g4a7958ca14-goog
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/14] KVM: ITS hardening for pKVM
  2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
                   ` (14 preceding siblings ...)
  2026-03-12 17:56 ` [RFC PATCH 00/14] KVM: ITS hardening for pKVM Fuad Tabba
@ 2026-03-13 15:18 ` Mostafa Saleh
  2026-03-15 13:24   ` Fuad Tabba
  2026-03-25 16:26   ` Sebastian Ene
  15 siblings, 2 replies; 36+ messages in thread
From: Mostafa Saleh @ 2026-03-13 15:18 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

Hi Seb,

On Tue, Mar 10, 2026 at 12:49:19PM +0000, Sebastian Ene wrote:
> This series introduces the necessary machinery to perform trap & emulate
> on device access in pKVM. Furthermore, it hardens the GIC/ITS controller to
> prevent an attacker from tampering with the hypervisor protected memory
> through this device. 
> 
> In pKVM, the host kernel is initially trusted to manage the boot process but
> its permissions are revoked once KVM initializes. The GIC/ITS device is
> configured before the kernel deprivileges itself. Once the hypervisor
> becomes available, sanitize the accesses to the ITS controller by
> trapping and emulating certain registers and by shadowing some memory
> structures used by the ITS.
> 
> This is required because the ITS can issue transactions on the memory
> bus *directly*, without having an SMMU in front of it, which makes it
> an interesting target for crossing the hypervisor-established privilege
> boundary.
> 
> 
> Patch overview
> ==============
> 
> The first patch is re-used from Mostafa's series[1] which brings SMMU-v3
> support to pKVM.
> 
> [1] https://lore.kernel.org/linux-iommu/20251117184815.1027271-1-smostafa@google.com/#r
> 
> Some of the infrastructure built in that series might intersect and we
> agreed to converge on some changes. The patches [1 - 3] allow unmapping
> devices from the host address space and installing a handler to trap
> accesses from the host. While executing in the handler, enough context
> has to be given from mem-abort to perform the emulation of the device
> such as: the offset, the access size, direction of the write and private
> related data specific to the device. 
> The unmapping of the device from the host address space is performed
> after the host deprivilege (during _kvm_host_prot_finalize call).
> 
> The 4th patch looks up the ITS node from the device tree and adds it to
> an array of unmapped devices. It install a handler that forwards all the
> MMIO request to mediate the host access inside the emulation layer and
> to prevent breaking ITS functionality. 
> 
> The 5th patch changes the GIC/ITS driver to exposes two new methods
> which will be called from the KVM layer to setup the shadow state and
> to take the appropriate locks. This one is the most intrusive as it
> changes the current GIC/ITS driver. I tried to avoid creating a
> dependency with KVM to keep the GIC driver agnostic of the virtualization
> layer but I am happy to explore other options as well. 
> To avoid re-programming the ITS device with new shadow structures after
> pKVM is ready, I exposed two functions to change the
> pointers inside the driver for the following structures:
> - the command queue points to a newly allocated queue
> - the GITS_BASER<n> tables configured with an indirect layout have the
>   first layer shadowed and they point to a new memory region
> 
> Patch 6 adds the entry point into the emulation setup and sets up the
> shadow command queue. It adds some helper macros to define the offset
> register and the associate action that we want to execute in the
> emulation. It also unmaps the state passed from the host kernel
> to prevent it from playing nasty games later on. The patch
> traps accesses to CWRITER register and copies the commands from the
> host command queue to the shadow command queue. 
> 
> Patch 7 prevents the host from directly accessing the first layer of the
> indirect tables held in GITS_BASER<n>. It also prevents the host from
> directly accesssing the last layer of the Device Table (since the entries
> in this table hold the address of the ITT table) and of the vPE Table
> (since the vPE table entries hold the address of the virtual LPI pending
> table.
> 
> Patches [8-10] sanitize the commands sent to the ITS and their
> arguments.
> 
> Patches [11-13] restrict the access of the host to certain registers
> and prevent undefined behaviour. Prevent the host from re-programming
> the tables held in the GITS_BASER register.
> 
> The last patch introduces an hvc to setup the ITS emulation and calls
> into the ITS driver to setup the shadow state. 
> 
> 
> Design
> ======
> 
> 
> 1. Command queue shadowing
> 
> The ITS hardware supports a command queue which is programmed by the driver
> in the GITS_CBASER register. To inform the hardware that a new command
> has been added, the driver updates an index into the GITS_CWRITER
> register. The driver then reads the GITS_CREADR register to see if the
> command was processed or if the queue is stalled.
>  
> To create a new command, the emulation layer mirrors the behavior
> as following:
>  (i) The host ITS driver creates a command in the shadow queue:
> 	its_allocate_entry() -> builder()
>  (ii) Notifies the hardware that a new command is available:
> 	its_post_commands()
>  (iii) Hypervisor traps the write to GITS_CWRITER:
> 	handle_host_mem_abort() -> handle_host_mmio_trap() ->
>             pkvm_handle_gic_emulation()
>  (iv) Hypervisor copies the command from the host command queue
>       to the original queue which is not accessible to the host.
>       It parses the command and updates the hardware write.
> 
> The driver allocates space for the original command queue and programs
> the hardware (GITS_CWRITER). When pKVM becomes available, the driver
> allocates a new (shadow) queue and replaces its original pointer to
> the queue with this new one. This is to prevent a malicious host from
> tampering with the commands sent to the ITS hardware.
> 
> The entry point of our emulation shares the memory of the newly
> allocated queue with the hypervisor and donates the memory of the
> original queue to make it inaccesible to the host.
> 
> 
> 2. Indirect tables first level shadowing
> 
> The ITS hardware supports indirection to minimize the space required to
> accommodate large tables (eg. deviceId space used to index the Device Table
> is quite sparse). This is a 2-level indirection, with entries from the
> first table pointing to a second table.
> 
> An attacker in control of the host can insert an address that points to
> the hypervisor protected memory in the first level table and then use
> subsequent ITS commands to write to this memory (MAPD).
> 
> To shadow this tables, we rely on the driver to allocate space for it
> and we copy the original content from the table into the copy. When
> pKVM becomes available we switch the pointers that hold the orginal
> tables to point to the copy.
> To keep the tables from the hypervisor in sync with what the host
> has, we update the tables when commands are sent to the ITS.
> 
> 
> 3. Hiding the last layer of the Device Table and vPE Table from the host
> 
> An attacker in control of the host kernel can alter the content of these
> tables directly (the Arm IHI 0069H.b spec says that is undefined behavior
> if entries are created by software). Normally these entries are created in
> response of commands sent to the ITS.
> 
> A Device Table entry that has the following structure:
> 
> type DeviceTableEntry is (
> 	boolean Valid,
> 	Address ITT_base,
> 	bits(5) ITT_size
> ) 
> 
> This can be maliciously created by an attacker and the ITT_base can be
> pointed to hypervisor protected memory. The MAPTI command can then be
> used to write over the ITT_base with an ITE entry.
> 
> Similarly a vCPU Table entry has the following structure:
> 
> type VCPUTableEntry is (
> 	boolean Valid,
> 	bits(32) RDbase,
> 	Address VPT_base,
> 	bits(5) VPT_size
> )
> 
> VPT_base can be pointed to hypervisor protected memory and then a
> command can be used to raise interrupts and set the corresponding
> bit. This would give a 1-bit write primitive so is not "as generous"
> as the others.
> 
> 
> Notes
> =====
> 
> 
> Performance impact is expected with this as the emulation dance is not
> cost free.
> I haven't implemented any ITS quirks in the emulation and I don't know
> whether we will need it ? (some hardware needs explicit dcache flushing
> ITS_FLAGS_CMDQ_NEEDS_FLUSHING). 
> 
> Please note that Redistributors trapping hasn't been addressed at all in
> this series and the solution is not sufficient but this can be extended
> afterwards. 
> The current series has been tested with Qemu (-machine
> virt,virtualization=true,gic-version=4) and with Pixel 10.
> 
> 
> Thanks,
> Sebastian E.
> 
> Mostafa Saleh (1):
>   KVM: arm64: Donate MMIO to the hypervisor
> 
> Sebastian Ene (13):
>   KVM: arm64: Track host-unmapped MMIO regions in a static array
>   KVM: arm64: Support host MMIO trap handlers for unmapped devices
>   KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
>   irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
>   KVM: arm64: Add infrastructure for ITS emulation setup
>   KVM: arm64: Restrict host access to the ITS tables
>   KVM: arm64: Trap & emulate the ITS MAPD command
>   KVM: arm64: Trap & emulate the ITS VMAPP command
>   KVM: arm64: Trap & emulate the ITS MAPC command
>   KVM: arm64: Restrict host updates to GITS_CTLR
>   KVM: arm64: Restrict host updates to GITS_CBASER
>   KVM: arm64 Restrict host updates to GITS_BASER
>   KVM: arm64: Implement HVC interface for ITS emulation setup

I tested the patches on Lenovo ideacenter Mini X Gen 10 Snapdragon,
and the kernel hangs at boot for me with messags the following log:

[    2.735838] ITS queue timeout (1056 1024)
[    2.739969] ITS cmd its_build_mapd_cmd failed
[    4.776344] ITS queue timeout (1120 1024)
[    4.780472] ITS cmd its_build_mapti_cmd failed
[    6.816677] ITS queue timeout (1184 1024)
[    6.820806] ITS cmd its_build_mapti_cmd failed
[    8.857009] ITS queue timeout (1248 1024)
[    8.861129] ITS cmd its_build_mapti_cmd failed

I am happy to do more debugging, let me know if I can try anything.

Thanks,
Mostafa

> 
>  arch/arm64/include/asm/kvm_arm.h              |   3 +
>  arch/arm64/include/asm/kvm_asm.h              |   1 +
>  arch/arm64/include/asm/kvm_pkvm.h             |  20 +
>  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 +
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
>  arch/arm64/kvm/hyp/nvhe/Makefile              |   3 +-
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  14 +
>  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 653 ++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 134 ++++
>  arch/arm64/kvm/hyp/nvhe/setup.c               |  28 +
>  arch/arm64/kvm/hyp/pgtable.c                  |   9 +-
>  arch/arm64/kvm/pkvm.c                         |  60 ++
>  drivers/irqchip/irq-gic-v3-its.c              | 177 ++++-
>  include/linux/irqchip/arm-gic-v3.h            |  36 +
>  14 files changed, 1126 insertions(+), 31 deletions(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c
> 
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/14] KVM: ITS hardening for pKVM
  2026-03-13 15:18 ` Mostafa Saleh
@ 2026-03-15 13:24   ` Fuad Tabba
  2026-03-25 16:26   ` Sebastian Ene
  1 sibling, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-15 13:24 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Sebastian Ene, alexandru.elisei, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, catalin.marinas, dbrazdil, joey.gouly,
	kees, mark.rutland, maz, oupton, perlarsen, qperret, rananta,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

> I tested the patches on Lenovo ideacenter Mini X Gen 10 Snapdragon,
> and the kernel hangs at boot for me with messags the following log:
>
> [    2.735838] ITS queue timeout (1056 1024)
> [    2.739969] ITS cmd its_build_mapd_cmd failed
> [    4.776344] ITS queue timeout (1120 1024)
> [    4.780472] ITS cmd its_build_mapti_cmd failed
> [    6.816677] ITS queue timeout (1184 1024)
> [    6.820806] ITS cmd its_build_mapti_cmd failed
> [    8.857009] ITS queue timeout (1248 1024)
> [    8.861129] ITS cmd its_build_mapti_cmd failed

I get the same running it on QEMU.

Cheers,
/fuad

> I am happy to do more debugging, let me know if I can try anything.
>
> Thanks,
> Mostafa
>
> >
> >  arch/arm64/include/asm/kvm_arm.h              |   3 +
> >  arch/arm64/include/asm/kvm_asm.h              |   1 +
> >  arch/arm64/include/asm/kvm_pkvm.h             |  20 +
> >  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 +
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
> >  arch/arm64/kvm/hyp/nvhe/Makefile              |   3 +-
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  14 +
> >  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 653 ++++++++++++++++++
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 134 ++++
> >  arch/arm64/kvm/hyp/nvhe/setup.c               |  28 +
> >  arch/arm64/kvm/hyp/pgtable.c                  |   9 +-
> >  arch/arm64/kvm/pkvm.c                         |  60 ++
> >  drivers/irqchip/irq-gic-v3-its.c              | 177 ++++-
> >  include/linux/irqchip/arm-gic-v3.h            |  36 +
> >  14 files changed, 1126 insertions(+), 31 deletions(-)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> >  create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c
> >
> > --
> > 2.53.0.473.g4a7958ca14-goog
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup
  2026-03-10 12:49 ` [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup Sebastian Ene
@ 2026-03-16 10:46   ` Fuad Tabba
  2026-03-17  9:40     ` Fuad Tabba
  0 siblings, 1 reply; 36+ messages in thread
From: Fuad Tabba @ 2026-03-16 10:46 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Share the host command queue with the hypervisor. Donate
> the original command queue memory to the hypervisor to ensure
> host exclusion and trap accesses on GITS_CWRITE register.

GITS_CWRITE -> GITS_CWRITER

> On a CWRITER write, the hypervisor copies commands from the
> host's queue to the protected queue before updating the
> hardware register.
> This ensures the hypervisor mediates all commands sent to
> the physical ITS.

This commit message demonstrates why we should be careful about
naming. This patch series is pretty complex, and having clear and
consistent naming would help. What I mean is, here you're referring to
protected vs host queue, earlier you were referring to shadow (which
now I am not clear if it meant the host version or the hyp version).
Maybe we could have a discussion about naming, but the most important
part is consistency.

>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/include/asm/kvm_pkvm.h             |   1 +
>  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 ++
>  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 203 ++++++++++++++++++
>  3 files changed, 221 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
>
> diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> index ef00c1bf7d00..dc5ef2f9ac49 100644
> --- a/arch/arm64/include/asm/kvm_pkvm.h
> +++ b/arch/arm64/include/asm/kvm_pkvm.h
> @@ -28,6 +28,7 @@ struct pkvm_protected_reg {
>         u64 start_pfn;
>         size_t num_pages;
>         pkvm_emulate_handler *cb;
> +       void *priv;
>  };
>
>  extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h b/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> new file mode 100644
> index 000000000000..6be24c723658
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#ifndef __NVHE_ITS_EMULATE_H
> +#define __NVHE_ITS_EMULATE_H
> +
> +
> +#include <asm/kvm_pkvm.h>
> +
> +
> +struct its_shadow_tables;
> +
> +int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *priv_state,
> +                               struct its_shadow_tables *shadow);
> +
> +void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bool write,
> +                              u64 *reg, u8 reg_size);
> +#endif /* __NVHE_ITS_EMULATE_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> index 0eecbb011898..4a3ccc90a1a9 100644
> --- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> +++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> @@ -1,8 +1,75 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>
>  #include <asm/kvm_pkvm.h>
> +#include <linux/irqchip/arm-gic-v3.h>
> +#include <nvhe/its_emulate.h>
>  #include <nvhe/mem_protect.h>
>
> +struct its_priv_state {
> +       void *base;
> +       void *cmd_hyp_base;
> +       void *cmd_host_base;
> +       void *cmd_host_cwriter;
> +       struct its_shadow_tables *shadow;
> +       hyp_spinlock_t its_lock;
> +};
> +
> +struct its_handler {
> +       u64 offset;
> +       u8 access_size;
> +       void (*write)(struct its_priv_state *its, u64 offset, u64 value);
> +       void (*read)(struct its_priv_state *its, u64 offset, u64 *read);
> +};
> +
> +DEFINE_HYP_SPINLOCK(its_setup_lock);
> +
> +static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
> +{
> +       u64 cwriter_offset = value & GENMASK(19, 5);
> +       int cmd_len, cmd_offset;
> +       size_t cmdq_sz = its->shadow->cmdq_len;

Prefer RCT declaration order.

> +
> +       if (cwriter_offset > cmdq_sz)
> +               return;

Off-by-one:
+       if (cwriter_offset >= cmdq_sz)

> +
> +       cmd_offset = its->cmd_host_cwriter - its->cmd_host_base;
> +       cmd_len = cwriter_offset - cmd_offset;
> +       if (cmd_len < 0)
> +               cmd_len = cmdq_sz - cmd_offset;

This relies on unsigned underflow. cwriter_offset is u64. If
cmd_offset > cwriter_offset, the result is a massive positive u64
(e.g., 0xFFFFFFF...). This works but relies on implementation-defined
behavior and is fragile. UBSAN might flag this, and if someone comes
along and "cleans up" the variables by changing cmd_len and cmd_offset
to size_t (unsigned 64-bit), which is the correct type for buffer
lengths, cmd_len < 0 will never be true. It's better to explicitly
handle the wrap-around using safe unsigned logic (which is the
idiomatic way of doing it):

+       size_t cwriter_offset = value & GENMASK(19, 5);
+       size_t cmd_len, cmd_offset;
...
+       if (cwriter_offset >= cmd_offset)
+              cmd_len = cwriter_offset - cmd_offset;
+       else
+              cmd_len = cmdq_sz - cmd_offset;

Also, cwriter_offset should be validated to ensure it is 32-byte
aligned, as the ITS commands are fixed-size packets.

> +
> +       if (cmd_offset + cmd_len > cmdq_sz)
> +               return;
> +
> +       memcpy(its->cmd_hyp_base + cmd_offset, its->cmd_host_cwriter, cmd_len);

I think we need a memory barrier (e.g., dsb(ishst)) after the memcpy
to ensure the writes are visible to the hardware before ringing the
GITS_CWRITER doorbell.

> +
> +       its->cmd_host_cwriter = its->cmd_host_base +
> +               (cmd_offset + cmd_len) % cmdq_sz;
> +       if (its->cmd_host_cwriter == its->cmd_host_base) {
> +               memcpy(its->cmd_hyp_base, its->cmd_host_base, cwriter_offset);
> +
> +               its->cmd_host_cwriter = its->cmd_host_base + cwriter_offset;
> +       }
> +
> +       writeq_relaxed(value, its->base + GITS_CWRITER);

You should sanitize this to

 +       writeq_relaxed(value & GENMASK(19, 5), its->base + GITS_CWRITER);

because value contains the raw 64-bit value written by the host. While
you extracted the offset correctly using GENMASK(19, 5), you are
forwarding the raw value (including any garbage in the RES0 bits) to
the physical hardware.

> +}
> +
> +static void cwriter_read(struct its_priv_state *its, u64 offset, u64 *read)
> +{
> +       *read = readq_relaxed(its->base + GITS_CWRITER);
> +}
> +
> +#define ITS_HANDLER(off, sz, write_cb, read_cb)        \
> +{                                                      \
> +       .offset = (off),                                \
> +       .access_size = (sz),                            \
> +       .write = (write_cb),                            \
> +       .read = (read_cb),                              \
> +}
> +
> +static struct its_handler its_handlers[] = {
> +       ITS_HANDLER(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
> +       {},
> +};
>
>  void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool write,
>                              u64 *reg, u8 reg_size)
> @@ -21,3 +88,139 @@ void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool
>                         writeq_relaxed(*reg, addr);
>         }
>  }
> +
> +void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bool write,
> +                              u64 *reg, u8 reg_size)
> +{
> +       struct its_priv_state *its_priv = region->priv;

Due to arm64's  memory model, you must use
smp_load_acquire(&region->priv) here to guarantee the struct fields
(base, shadow, etc.) are fully visible before dereferencing them.

(and a similar issue later on below... I'll point it out)

> +       void __iomem *addr;
> +       struct its_handler *reg_handler;

Prefer RCT declaration order.


> +
> +       if (!its_priv)
> +               return;

This breaks bisectability. In this patch, you redirect the MMIO trap
to pkvm_handle_gic_emulation, but the setup HVC
(pkvm_init_gic_its_emulation) isn't called until a later patch.
Therefore, its_priv is always NULL here, causing all host ITS accesses
to be silently dropped. The device appears dead.

 You must fall back to the passthrough handler if emulation is not yet
initialized:
+       if (!its_priv) {
+               pkvm_handle_forward_req(region, offset, write, reg, reg_size);
+               return;
+       }

at least until you setup the hvc...

> +
> +       addr = its_priv->base + offset;
> +       for (reg_handler = its_handlers; reg_handler->access_size; reg_handler++) {
> +               if (reg_handler->offset > offset ||
> +                   reg_handler->offset + reg_handler->access_size <= offset)
> +                       continue;
> +
> +               if (reg_handler->access_size & (reg_size - 1))
> +                       continue;

Is this deliberate? If access_size is 8 and reg_size is 4, 8 & 3 == 0,
so it continues and allows the 8-byte handler to process a 4-byte
write. Use explicit length matching (reg_handler->access_size !=
reg_size), but there's more later...

> +
> +               if (write && reg_handler->write) {
> +                       hyp_spin_lock(&its_priv->its_lock);
> +                       reg_handler->write(its_priv, offset, *reg);

The GIC specification requires support for 32-bit accesses to all
GITS_* registers. If the host executes a 32-bit write, *reg contains
only the 32-bit value. However, you do not pass reg_size to
cwriter_write. The handler assumes it is a full 64-bit value, which
will corrupt the other half of the emulated register with zeros. You
need to implement Read-Modify-Write logic for sub-register accesses.

> +                       hyp_spin_unlock(&its_priv->its_lock);
> +                       return;
> +               }
> +
> +               if (!write && reg_handler->read) {
> +                       hyp_spin_lock(&its_priv->its_lock);
> +                       reg_handler->read(its_priv, offset, reg);
> +                       hyp_spin_unlock(&its_priv->its_lock);
> +                       return;
> +               }
> +
> +               return;
> +       }
> +
> +       pkvm_handle_forward_req(region, offset, write, reg, reg_size);
> +}
> +
> +static struct pkvm_protected_reg *get_region(phys_addr_t dev_addr)
> +{
> +       int i;
> +       u64 dev_pfn = dev_addr >> PAGE_SHIFT;

Please prefer RCT.

> +
> +       for (i = 0; i < PKVM_PROTECTED_REGS_NUM; i++) {
> +               if (pkvm_protected_regs[i].start_pfn == dev_pfn)
> +                       return &pkvm_protected_regs[i];
> +       }
> +
> +       return NULL;
> +}
> +
> +static int pkvm_setup_its_shadow_cmdq(struct its_shadow_tables *shadow)
> +{
> +       int ret, i, num_pages;
> +       u64 shadow_start_pfn, original_start_pfn;
> +       void *cmd_shadow_va = kern_hyp_va(shadow->cmd_shadow);

Please prefer RCT. There are multiple variables declared on a single
line (int ret, i, num_pages;). Please untangle.

> +
> +       shadow_start_pfn = hyp_virt_to_pfn(cmd_shadow_va);
> +       original_start_pfn = hyp_virt_to_pfn(kern_hyp_va(shadow->cmd_original));
> +       num_pages = shadow->cmdq_len >> PAGE_SHIFT;
> +
> +       for (i = 0; i < num_pages; i++) {
> +               ret = __pkvm_host_share_hyp(shadow_start_pfn + i);
> +               if (ret)
> +                       goto unshare_shadow;
> +       }
> +
> +       ret = hyp_pin_shared_mem(cmd_shadow_va, cmd_shadow_va + shadow->cmdq_len);
> +       if (ret)
> +               goto unshare_shadow;
> +
> +       ret = __pkvm_host_donate_hyp(original_start_pfn, num_pages);
> +       if (ret)
> +               goto unpin_shadow;
> +
> +       return ret;
> +
> +unpin_shadow:
> +       hyp_unpin_shared_mem(cmd_shadow_va, cmd_shadow_va + shadow->cmdq_len);
> +
> +unshare_shadow:
> +       for (i = i - 1; i >= 0; i--)

Please use the standard while (i--) idiom for cleaner rollback loops.

> +               __pkvm_host_unshare_hyp(shadow_start_pfn + i);
> +
> +       return ret;
> +}
> +
> +int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
> +                               struct its_shadow_tables *host_shadow)
> +{
> +       int ret;
> +       struct its_priv_state *priv_state = kern_hyp_va(host_priv_state);
> +       struct its_shadow_tables *shadow = kern_hyp_va(host_shadow);
> +       struct pkvm_protected_reg *its_reg;
> +
> +       hyp_spin_lock(&its_setup_lock);

You're not releasing this lock on any of the error paths. Please add
err_unlock and goto err_unlock to release the spinlock.

> +       its_reg = get_region(dev_addr);
> +       if (!its_reg)
> +               return -ENODEV;
> +
> +       if (its_reg->priv)
> +               return -EOPNOTSUPP;

I think you need to add a check:

+       if (!PAGE_ALIGNED(priv_state) || !PAGE_ALIGNED(shadow))
+              return -EINVAL;

since __pkvm_host_donate_hyp() expects page-aligned addresses.

> +       ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(priv_state), 1);
> +       if (ret)
> +               return ret;
> +
> +       ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(shadow), 1);
> +       if (ret)
> +               goto err_with_state;
> +
> +       ret = pkvm_setup_its_shadow_cmdq(shadow);
> +       if (ret)
> +               goto err_with_shadow;
> +
> +       its_reg->priv = priv_state;

Related to above, please use smp_store_release(&its_reg->priv,
priv_state); to prevent the compiler and CPU from reordering the
struct initialization below this pointer assignment, which would
expose uninitialized memory to the lockless reader in the trap
handler.

Cheers,
/fuad




> +
> +       hyp_spin_lock_init(&priv_state->its_lock);
> +       priv_state->shadow = shadow;
> +       priv_state->base = __hyp_va(dev_addr);
> +
> +       priv_state->cmd_hyp_base = kern_hyp_va(shadow->cmd_original);
> +       priv_state->cmd_host_base = kern_hyp_va(shadow->cmd_shadow);
> +       priv_state->cmd_host_cwriter = priv_state->cmd_host_base;
> +
> +       hyp_spin_unlock(&its_setup_lock);
> +
> +       return 0;
> +err_with_shadow:
> +       __pkvm_hyp_donate_host(hyp_virt_to_pfn(shadow), 1);
> +err_with_state:
> +       __pkvm_hyp_donate_host(hyp_virt_to_pfn(priv_state), 1);
> +       return ret;
> +}
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/14] KVM: arm64: Restrict host access to the ITS tables
  2026-03-10 12:49 ` [PATCH 07/14] KVM: arm64: Restrict host access to the ITS tables Sebastian Ene
@ 2026-03-16 16:13   ` Fuad Tabba
  0 siblings, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-16 16:13 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Setup shadow structures for ITS indirect tables held in
> the GITS_BASER<n> registers.
> Make the last level of the Device Table and vPE Table
> inacessible to the host.

inacessible  -> inaccessible
> In a direct layout configuration, donate the table to
> the hypervisor since the software is not expected to
> program them directly.

This commit message is too brief and doesn't fully explain the
problem, the impact, and the mechanism of the solution. It also
appears to contradict the actual code changes.

For example, could you elaborate why must the last level of indirect
tables be inaccessible?

Can you also please explain the mechanism? You are parsing
GITS_BASER_INDIRECT to determine if a shadow Level 1 table must be
shared with the host, while unconditionally donating the original
physical tables. You also explicitly exclude Collection tables. The
msg should briefly justify why Collection tables are safe to leave
accessible to the host.

There is also a contradiction in the message. You state "In a direct
layout configuration, donate the table...". However, your code donates
the original hardware table unconditionally on every iteration of the
loop, regardless of whether GITS_BASER_INDIRECT is set. Please ensure
the commit log accurately reflects the code implementation.

Maybe you could say that the problem is Host DMA attacks via ITS table
manipulation. Whereas the mechanism is to unconditionally donate
hardware tables to EL2. For indirect Device/vPE tables, share a L1
shadow table with the host and strictly donate the L2 pages to prevent
the host from writing malicious L2 pointers.

>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/kvm/hyp/nvhe/its_emulate.c | 143 ++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> index 4a3ccc90a1a9..865a5d6353ed 100644
> --- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> +++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> @@ -141,6 +141,145 @@ static struct pkvm_protected_reg *get_region(phys_addr_t dev_addr)
>         return NULL;
>  }
>
> +static int pkvm_host_unmap_last_level(void *shadow, size_t num_pages, u32 psz)
> +{
> +       u64 *table = shadow;
> +       int ret, i, end = (num_pages << PAGE_SHIFT) / sizeof(table);
> +       phys_addr_t table_addr;

RCT, mixing initialized variables and uninitialized variables, plus
variables of conceptually different "types" in the same declaration.

Please use sizeof(*table): sizeof(table) evaluates to the size of the
pointer (8 bytes), NOT the size of the array element. In this case,
this happens to be the same, but it's still wrong.

Maybe the following is clearer:
+        int end = num_pages * (PAGE_SIZE / sizeof(*table));


> +
> +       for (i = 0; i < end; i++) {
> +               if (!(table[i] & GITS_BASER_VALID))
> +                       continue;
> +
> +               table_addr = table[i] & PHYS_MASK;
> +               ret = __pkvm_host_donate_hyp(hyp_phys_to_pfn(table_addr), psz >> PAGE_SHIFT);

The ITS-configured page size and the host page size could be
different, but the number of pages to donate for Level 2 tables is
calculated based on psz (the ITS).

If the ITS hardware is configured for 4KB pages, but the host kernel
is using (e.g.,) 64KB pages, psz >> PAGE_SHIFT evaluates to 0.

You need to account for mismatched page sizes, perhaps by using
DIV_ROUND_UP(psz, PAGE_SIZE) (or something similar) to ensure the
containing host page is donated.

> +               if (ret)
> +                       goto err_donate;
> +       }
> +
> +       return 0;
> +err_donate:
> +       for (i = i - 1; i >= 0; i--) {

Please use the while (i--) idiom for rollback loops.


> +               if (!(table[i] & GITS_BASER_VALID))
> +                       continue;
> +
> +               table_addr = table[i] & PHYS_MASK;
> +               __pkvm_hyp_donate_host(hyp_phys_to_pfn(table_addr), psz >> PAGE_SHIFT);

Please wrap this in WARN_ON(...). If donating back to the host fails
during a rollback, we have a fatal page leak that needs to be loudly
flagged, similar to how you handle it in pkvm_unshare_shadow_table.


> +       }
> +       return ret;
> +}
> +
> +static int pkvm_share_shadow_table(void *shadow, u64 nr_pages)
> +{
> +       u64 i, ret, start_pfn = hyp_virt_to_pfn(shadow);

Same comment as before with RCT and the mixing of declarations.


> +
> +       for (i = 0; i < nr_pages; i++) {
> +               ret = __pkvm_host_share_hyp(start_pfn + i);
> +               if (ret)
> +                       goto unshare;
> +       }
> +
> +       ret = hyp_pin_shared_mem(shadow, shadow + (nr_pages << PAGE_SHIFT));
> +       if (ret)
> +               goto unshare;
> +
> +       return ret;
> +unshare:

Please use the while (i--) idiom for rollback loops.

Also, please use consistent naming conventions for the labels. Here
you call it unshare, and earlier it was err_donate.


> +       for (i = i - 1; i >= 0; i--)
> +               __pkvm_host_unshare_hyp(start_pfn + i);
> +       return ret;
> +}
> +
> +static void pkvm_unshare_shadow_table(void *shadow, u64 nr_pages)
> +{
> +       u64 i, start_pfn = hyp_virt_to_pfn(shadow);
> +
> +       hyp_unpin_shared_mem(shadow, shadow + (nr_pages << PAGE_SHIFT));
> +
> +       for (i = 0; i < nr_pages; i++)
> +               WARN_ON(__pkvm_host_unshare_hyp(start_pfn + i));
> +}
> +
> +static void pkvm_host_map_last_level(void *shadow, size_t num_pages, u32 psz)
> +{
> +       u64 *table;

RCT, and you forgot to initialize table:
+       u64 *table = shadow;

> +       int i, end = (num_pages << PAGE_SHIFT) / sizeof(table);

Same sizeof(table) pointer-size bug as above.


> +       phys_addr_t table_addr;
> +
> +       for (i = 0; i < end; i++) {
> +               if (!(table[i] & GITS_BASER_VALID))
> +                       continue;
> +
> +               table_addr = table[i] & ~GITS_BASER_VALID;

Inconsistent masking logic, since in pkvm_host_unmap_last_level you
correctly used PHYS_MASK to extract the address, but here in the
rollback path you use ~GITS_BASER_VALID.

While both currently work because the upper bits and lower bits (below
the page size) are defined as RES0 in the GIC spec, ~GITS_BASER_VALID
is architecturally fragile. If a future hardware revision repurposes
the upper RES0 bits [62:52] for new attributes (e.g., memory
encryption flags), ~GITS_BASER_VALID will leak those attribute bits
into the physical address calculation.

Since PHYS_MASK correctly handles the address extraction across all
page sizes (relying on the lower bits being RES0) and safely masks off
future upper attribute bits, please standardize on using table_addr =
table[i] & PHYS_MASK; for both functions.


> +               WARN_ON(__pkvm_hyp_donate_host(hyp_phys_to_pfn(table_addr), psz >> PAGE_SHIFT));
> +       }
> +}
> +
> +static int pkvm_setup_its_shadow_baser(struct its_shadow_tables *shadow)
> +{
> +       int i, ret;
> +       u64 baser_val, num_pages, type;
> +       void *base, *host_base;
> +
> +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> +               baser_val = shadow->tables[i].val;
> +               if (!(baser_val & GITS_BASER_VALID))
> +                       continue;
> +
> +               base = kern_hyp_va(shadow->tables[i].base);
> +               num_pages = (1 << shadow->tables[i].order);
> +
> +               ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(base), num_pages);
> +               if (ret)
> +                       goto err_donate;
> +
> +               if (baser_val & GITS_BASER_INDIRECT) {
> +                       host_base = kern_hyp_va(shadow->tables[i].shadow);
> +                       ret = pkvm_share_shadow_table(host_base, num_pages);
> +                       if (ret)
> +                               goto err_with_donation;
> +
> +                       type = GITS_BASER_TYPE(baser_val);
> +                       if (type == GITS_BASER_TYPE_COLLECTION)
> +                               continue;
> +
> +                       ret = pkvm_host_unmap_last_level(base, num_pages,
> +                                                        shadow->tables[i].psz);
> +                       if (ret)
> +                               goto err_with_share;
> +               }
> +       }
> +
> +       return 0;
> +err_with_share:
> +       pkvm_unshare_shadow_table(host_base, num_pages);
> +err_with_donation:
> +       __pkvm_hyp_donate_host(hyp_virt_to_pfn(base), num_pages);
> +err_donate:
> +       for (i = i - 1; i >= 0; i--) {

Please use the while (i--) idiom for rollback loops.


> +               baser_val = shadow->tables[i].val;
> +               if (!(baser_val & GITS_BASER_VALID))
> +                       continue;
> +
> +               base = kern_hyp_va(shadow->tables[i].base);
> +               num_pages = (1 << shadow->tables[i].order);
> +
> +               WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(base), num_pages));

The sequence of rollback operations here creates a TOCTOU vulnerability.

- First, you donate base (the Level 1 indirect table) back to the host.
- Then, you pass base into pkvm_host_map_last_level().
- Finally, pkvm_host_map_last_level() reads table[i] out of base to
determine which Level 2 pages to donate back to the host.

Because the host regains ownership of base _first_, it can be running
concurrently on another CPU. A malicious host can overwrite the Level
1 table with pointers to arbitrary hypervisor-owned memory. The
hypervisor will then read those malicious pointers and dutifully grant
the host access to its own secure memory.

The order of operations needs to be reversed: you must read base to
roll back the L2 pages, unshare the shadow table, and *only then*
donate base back to the host.

Also, num_pages = (1 << shadow->tables[i].order); calculates a 32-bit
signed integer because the literal 1 is a signed 32-bit int. If order
is 31, this evaluates to a negative number. If order is 32 or higher,
this is undefined behavior. Because num_pages is declared as a u64,
you should use the standard kernel macro BIT_ULL().

Here's my suggested fix (not tested). Reorder the operations to safely
rollback L2 before donating L1, use the standard `while (i--)` loop,
and fix the page calculation:

+    while (i--) {
+        baser_val = shadow->tables[i].val;
+        if (!(baser_val & GITS_BASER_VALID))
+                continue;
+
+        base = kern_hyp_va(shadow->tables[i].base);
+        num_pages = BIT_ULL(shadow->tables[i].order);
+
+        if (baser_val & GITS_BASER_INDIRECT) {
+                host_base = kern_hyp_va(shadow->tables[i].shadow);
+
+            type = GITS_BASER_TYPE(baser_val);
+            if (type != GITS_BASER_TYPE_COLLECTION)
+                    pkvm_host_map_last_level(base, num_pages,
+                                 shadow->tables[i].psz);
+
+            pkvm_unshare_shadow_table(host_base, num_pages);
+        }
+
+        WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(base), num_pages));
+    }



> +               if (baser_val & GITS_BASER_INDIRECT) {
> +                       host_base = kern_hyp_va(shadow->tables[i].shadow);
> +                       pkvm_unshare_shadow_table(host_base, num_pages);
> +
> +                       type = GITS_BASER_TYPE(baser_val);
> +                       if (type == GITS_BASER_TYPE_COLLECTION)
> +                               continue;
> +
> +                       pkvm_host_map_last_level(base, num_pages, shadow->tables[i].psz);
> +               }
> +       }

You have duplicated the entire table decoding logic (calculating base,
num_pages, checking INDIRECT...) down here in the rollback path.
Consider abstracting "setup one table" and "teardown one table" into
helper functions to make pkvm_setup_its_shadow_baser more readable and
less prone to copy-pasta errors.

Cheers,
/fuad


> +
> +       return ret;
> +}
> +
>  static int pkvm_setup_its_shadow_cmdq(struct its_shadow_tables *shadow)
>  {
>         int ret, i, num_pages;
> @@ -205,6 +344,10 @@ int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
>         if (ret)
>                 goto err_with_shadow;
>
> +       ret = pkvm_setup_its_shadow_baser(shadow);
> +       if (ret)
> +               goto err_with_shadow;
> +
>         its_reg->priv = priv_state;
>
>         hyp_spin_lock_init(&priv_state->its_lock);
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup
  2026-03-16 10:46   ` Fuad Tabba
@ 2026-03-17  9:40     ` Fuad Tabba
  0 siblings, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-17  9:40 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

On Mon, 16 Mar 2026 at 10:46, Fuad Tabba <tabba@google.com> wrote:
>
> Hi Sebastian,
>
> On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
> >
> > Share the host command queue with the hypervisor. Donate
> > the original command queue memory to the hypervisor to ensure
> > host exclusion and trap accesses on GITS_CWRITE register.
>
> GITS_CWRITE -> GITS_CWRITER
>
> > On a CWRITER write, the hypervisor copies commands from the
> > host's queue to the protected queue before updating the
> > hardware register.
> > This ensures the hypervisor mediates all commands sent to
> > the physical ITS.
>
> This commit message demonstrates why we should be careful about
> naming. This patch series is pretty complex, and having clear and
> consistent naming would help. What I mean is, here you're referring to
> protected vs host queue, earlier you were referring to shadow (which
> now I am not clear if it meant the host version or the hyp version).
> Maybe we could have a discussion about naming, but the most important
> part is consistency.
>
> >
> > Signed-off-by: Sebastian Ene <sebastianene@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pkvm.h             |   1 +
> >  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 ++
> >  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 203 ++++++++++++++++++
> >  3 files changed, 221 insertions(+)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> >
> > diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> > index ef00c1bf7d00..dc5ef2f9ac49 100644
> > --- a/arch/arm64/include/asm/kvm_pkvm.h
> > +++ b/arch/arm64/include/asm/kvm_pkvm.h
> > @@ -28,6 +28,7 @@ struct pkvm_protected_reg {
> >         u64 start_pfn;
> >         size_t num_pages;
> >         pkvm_emulate_handler *cb;
> > +       void *priv;
> >  };
> >
> >  extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h b/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> > new file mode 100644
> > index 000000000000..6be24c723658
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> > @@ -0,0 +1,17 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +
> > +#ifndef __NVHE_ITS_EMULATE_H
> > +#define __NVHE_ITS_EMULATE_H
> > +
> > +
> > +#include <asm/kvm_pkvm.h>
> > +
> > +
> > +struct its_shadow_tables;
> > +
> > +int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *priv_state,
> > +                               struct its_shadow_tables *shadow);
> > +
> > +void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bool write,
> > +                              u64 *reg, u8 reg_size);
> > +#endif /* __NVHE_ITS_EMULATE_H */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> > index 0eecbb011898..4a3ccc90a1a9 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> > @@ -1,8 +1,75 @@
> >  // SPDX-License-Identifier: GPL-2.0-only
> >
> >  #include <asm/kvm_pkvm.h>
> > +#include <linux/irqchip/arm-gic-v3.h>
> > +#include <nvhe/its_emulate.h>
> >  #include <nvhe/mem_protect.h>
> >
> > +struct its_priv_state {
> > +       void *base;
> > +       void *cmd_hyp_base;
> > +       void *cmd_host_base;
> > +       void *cmd_host_cwriter;
> > +       struct its_shadow_tables *shadow;
> > +       hyp_spinlock_t its_lock;
> > +};
> > +
> > +struct its_handler {
> > +       u64 offset;
> > +       u8 access_size;
> > +       void (*write)(struct its_priv_state *its, u64 offset, u64 value);
> > +       void (*read)(struct its_priv_state *its, u64 offset, u64 *read);
> > +};
> > +
> > +DEFINE_HYP_SPINLOCK(its_setup_lock);
> > +
> > +static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
> > +{
> > +       u64 cwriter_offset = value & GENMASK(19, 5);
> > +       int cmd_len, cmd_offset;
> > +       size_t cmdq_sz = its->shadow->cmdq_len;
>
> Prefer RCT declaration order.
>
> > +
> > +       if (cwriter_offset > cmdq_sz)
> > +               return;
>
> Off-by-one:
> +       if (cwriter_offset >= cmdq_sz)
>
> > +
> > +       cmd_offset = its->cmd_host_cwriter - its->cmd_host_base;
> > +       cmd_len = cwriter_offset - cmd_offset;
> > +       if (cmd_len < 0)
> > +               cmd_len = cmdq_sz - cmd_offset;
>
> This relies on unsigned underflow. cwriter_offset is u64. If
> cmd_offset > cwriter_offset, the result is a massive positive u64
> (e.g., 0xFFFFFFF...). This works but relies on implementation-defined
> behavior and is fragile. UBSAN might flag this, and if someone comes
> along and "cleans up" the variables by changing cmd_len and cmd_offset
> to size_t (unsigned 64-bit), which is the correct type for buffer
> lengths, cmd_len < 0 will never be true. It's better to explicitly
> handle the wrap-around using safe unsigned logic (which is the
> idiomatic way of doing it):
>
> +       size_t cwriter_offset = value & GENMASK(19, 5);
> +       size_t cmd_len, cmd_offset;
> ...
> +       if (cwriter_offset >= cmd_offset)
> +              cmd_len = cwriter_offset - cmd_offset;
> +       else
> +              cmd_len = cmdq_sz - cmd_offset;

Will Deacon pointed out that I'm wrong about this (offline): the Linux
kernel relies on compiler guarantees for 2's complement architectures:
the value is truncated to the lower 32 bits. That said, the code I
suggested above is clearer and more idiomatic.

Cheers,
/fuad


>
> Also, cwriter_offset should be validated to ensure it is 32-byte
> aligned, as the ITS commands are fixed-size packets.
>
> > +
> > +       if (cmd_offset + cmd_len > cmdq_sz)
> > +               return;
> > +
> > +       memcpy(its->cmd_hyp_base + cmd_offset, its->cmd_host_cwriter, cmd_len);
>
> I think we need a memory barrier (e.g., dsb(ishst)) after the memcpy
> to ensure the writes are visible to the hardware before ringing the
> GITS_CWRITER doorbell.
>
> > +
> > +       its->cmd_host_cwriter = its->cmd_host_base +
> > +               (cmd_offset + cmd_len) % cmdq_sz;
> > +       if (its->cmd_host_cwriter == its->cmd_host_base) {
> > +               memcpy(its->cmd_hyp_base, its->cmd_host_base, cwriter_offset);
> > +
> > +               its->cmd_host_cwriter = its->cmd_host_base + cwriter_offset;
> > +       }
> > +
> > +       writeq_relaxed(value, its->base + GITS_CWRITER);
>
> You should sanitize this to
>
>  +       writeq_relaxed(value & GENMASK(19, 5), its->base + GITS_CWRITER);
>
> because value contains the raw 64-bit value written by the host. While
> you extracted the offset correctly using GENMASK(19, 5), you are
> forwarding the raw value (including any garbage in the RES0 bits) to
> the physical hardware.
>
> > +}
> > +
> > +static void cwriter_read(struct its_priv_state *its, u64 offset, u64 *read)
> > +{
> > +       *read = readq_relaxed(its->base + GITS_CWRITER);
> > +}
> > +
> > +#define ITS_HANDLER(off, sz, write_cb, read_cb)        \
> > +{                                                      \
> > +       .offset = (off),                                \
> > +       .access_size = (sz),                            \
> > +       .write = (write_cb),                            \
> > +       .read = (read_cb),                              \
> > +}
> > +
> > +static struct its_handler its_handlers[] = {
> > +       ITS_HANDLER(GITS_CWRITER, sizeof(u64), cwriter_write, cwriter_read),
> > +       {},
> > +};
> >
> >  void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool write,
> >                              u64 *reg, u8 reg_size)
> > @@ -21,3 +88,139 @@ void pkvm_handle_forward_req(struct pkvm_protected_reg *region, u64 offset, bool
> >                         writeq_relaxed(*reg, addr);
> >         }
> >  }
> > +
> > +void pkvm_handle_gic_emulation(struct pkvm_protected_reg *region, u64 offset, bool write,
> > +                              u64 *reg, u8 reg_size)
> > +{
> > +       struct its_priv_state *its_priv = region->priv;
>
> Due to arm64's  memory model, you must use
> smp_load_acquire(&region->priv) here to guarantee the struct fields
> (base, shadow, etc.) are fully visible before dereferencing them.
>
> (and a similar issue later on below... I'll point it out)
>
> > +       void __iomem *addr;
> > +       struct its_handler *reg_handler;
>
> Prefer RCT declaration order.
>
>
> > +
> > +       if (!its_priv)
> > +               return;
>
> This breaks bisectability. In this patch, you redirect the MMIO trap
> to pkvm_handle_gic_emulation, but the setup HVC
> (pkvm_init_gic_its_emulation) isn't called until a later patch.
> Therefore, its_priv is always NULL here, causing all host ITS accesses
> to be silently dropped. The device appears dead.
>
>  You must fall back to the passthrough handler if emulation is not yet
> initialized:
> +       if (!its_priv) {
> +               pkvm_handle_forward_req(region, offset, write, reg, reg_size);
> +               return;
> +       }
>
> at least until you setup the hvc...
>
> > +
> > +       addr = its_priv->base + offset;
> > +       for (reg_handler = its_handlers; reg_handler->access_size; reg_handler++) {
> > +               if (reg_handler->offset > offset ||
> > +                   reg_handler->offset + reg_handler->access_size <= offset)
> > +                       continue;
> > +
> > +               if (reg_handler->access_size & (reg_size - 1))
> > +                       continue;
>
> Is this deliberate? If access_size is 8 and reg_size is 4, 8 & 3 == 0,
> so it continues and allows the 8-byte handler to process a 4-byte
> write. Use explicit length matching (reg_handler->access_size !=
> reg_size), but there's more later...
>
> > +
> > +               if (write && reg_handler->write) {
> > +                       hyp_spin_lock(&its_priv->its_lock);
> > +                       reg_handler->write(its_priv, offset, *reg);
>
> The GIC specification requires support for 32-bit accesses to all
> GITS_* registers. If the host executes a 32-bit write, *reg contains
> only the 32-bit value. However, you do not pass reg_size to
> cwriter_write. The handler assumes it is a full 64-bit value, which
> will corrupt the other half of the emulated register with zeros. You
> need to implement Read-Modify-Write logic for sub-register accesses.
>
> > +                       hyp_spin_unlock(&its_priv->its_lock);
> > +                       return;
> > +               }
> > +
> > +               if (!write && reg_handler->read) {
> > +                       hyp_spin_lock(&its_priv->its_lock);
> > +                       reg_handler->read(its_priv, offset, reg);
> > +                       hyp_spin_unlock(&its_priv->its_lock);
> > +                       return;
> > +               }
> > +
> > +               return;
> > +       }
> > +
> > +       pkvm_handle_forward_req(region, offset, write, reg, reg_size);
> > +}
> > +
> > +static struct pkvm_protected_reg *get_region(phys_addr_t dev_addr)
> > +{
> > +       int i;
> > +       u64 dev_pfn = dev_addr >> PAGE_SHIFT;
>
> Please prefer RCT.
>
> > +
> > +       for (i = 0; i < PKVM_PROTECTED_REGS_NUM; i++) {
> > +               if (pkvm_protected_regs[i].start_pfn == dev_pfn)
> > +                       return &pkvm_protected_regs[i];
> > +       }
> > +
> > +       return NULL;
> > +}
> > +
> > +static int pkvm_setup_its_shadow_cmdq(struct its_shadow_tables *shadow)
> > +{
> > +       int ret, i, num_pages;
> > +       u64 shadow_start_pfn, original_start_pfn;
> > +       void *cmd_shadow_va = kern_hyp_va(shadow->cmd_shadow);
>
> Please prefer RCT. There are multiple variables declared on a single
> line (int ret, i, num_pages;). Please untangle.
>
> > +
> > +       shadow_start_pfn = hyp_virt_to_pfn(cmd_shadow_va);
> > +       original_start_pfn = hyp_virt_to_pfn(kern_hyp_va(shadow->cmd_original));
> > +       num_pages = shadow->cmdq_len >> PAGE_SHIFT;
> > +
> > +       for (i = 0; i < num_pages; i++) {
> > +               ret = __pkvm_host_share_hyp(shadow_start_pfn + i);
> > +               if (ret)
> > +                       goto unshare_shadow;
> > +       }
> > +
> > +       ret = hyp_pin_shared_mem(cmd_shadow_va, cmd_shadow_va + shadow->cmdq_len);
> > +       if (ret)
> > +               goto unshare_shadow;
> > +
> > +       ret = __pkvm_host_donate_hyp(original_start_pfn, num_pages);
> > +       if (ret)
> > +               goto unpin_shadow;
> > +
> > +       return ret;
> > +
> > +unpin_shadow:
> > +       hyp_unpin_shared_mem(cmd_shadow_va, cmd_shadow_va + shadow->cmdq_len);
> > +
> > +unshare_shadow:
> > +       for (i = i - 1; i >= 0; i--)
>
> Please use the standard while (i--) idiom for cleaner rollback loops.
>
> > +               __pkvm_host_unshare_hyp(shadow_start_pfn + i);
> > +
> > +       return ret;
> > +}
> > +
> > +int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
> > +                               struct its_shadow_tables *host_shadow)
> > +{
> > +       int ret;
> > +       struct its_priv_state *priv_state = kern_hyp_va(host_priv_state);
> > +       struct its_shadow_tables *shadow = kern_hyp_va(host_shadow);
> > +       struct pkvm_protected_reg *its_reg;
> > +
> > +       hyp_spin_lock(&its_setup_lock);
>
> You're not releasing this lock on any of the error paths. Please add
> err_unlock and goto err_unlock to release the spinlock.
>
> > +       its_reg = get_region(dev_addr);
> > +       if (!its_reg)
> > +               return -ENODEV;
> > +
> > +       if (its_reg->priv)
> > +               return -EOPNOTSUPP;
>
> I think you need to add a check:
>
> +       if (!PAGE_ALIGNED(priv_state) || !PAGE_ALIGNED(shadow))
> +              return -EINVAL;
>
> since __pkvm_host_donate_hyp() expects page-aligned addresses.
>
> > +       ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(priv_state), 1);
> > +       if (ret)
> > +               return ret;
> > +
> > +       ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn(shadow), 1);
> > +       if (ret)
> > +               goto err_with_state;
> > +
> > +       ret = pkvm_setup_its_shadow_cmdq(shadow);
> > +       if (ret)
> > +               goto err_with_shadow;
> > +
> > +       its_reg->priv = priv_state;
>
> Related to above, please use smp_store_release(&its_reg->priv,
> priv_state); to prevent the compiler and CPU from reordering the
> struct initialization below this pointer assignment, which would
> expose uninitialized memory to the lockless reader in the trap
> handler.
>
> Cheers,
> /fuad
>
>
>
>
> > +
> > +       hyp_spin_lock_init(&priv_state->its_lock);
> > +       priv_state->shadow = shadow;
> > +       priv_state->base = __hyp_va(dev_addr);
> > +
> > +       priv_state->cmd_hyp_base = kern_hyp_va(shadow->cmd_original);
> > +       priv_state->cmd_host_base = kern_hyp_va(shadow->cmd_shadow);
> > +       priv_state->cmd_host_cwriter = priv_state->cmd_host_base;
> > +
> > +       hyp_spin_unlock(&its_setup_lock);
> > +
> > +       return 0;
> > +err_with_shadow:
> > +       __pkvm_hyp_donate_host(hyp_virt_to_pfn(shadow), 1);
> > +err_with_state:
> > +       __pkvm_hyp_donate_host(hyp_virt_to_pfn(priv_state), 1);
> > +       return ret;
> > +}
> > --
> > 2.53.0.473.g4a7958ca14-goog
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/14] KVM: arm64: Trap & emulate the ITS MAPD command
  2026-03-10 12:49 ` [PATCH 08/14] KVM: arm64: Trap & emulate the ITS MAPD command Sebastian Ene
@ 2026-03-17 10:20   ` Fuad Tabba
  0 siblings, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-17 10:20 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Sebastian,

On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
>
> Parse the MAPD command and extract the ITT address to
> sanitize it. When the command has the valid bit set,
> share the memory that holds the ITT table
> with the hypervisor to prevent it from being used
> by someone else and track the pages in an array.
> When the valid bit is cleared, check if the pages
> are tracked and then remove the sharing with the
> hypervisor.
> Check if we need to do any shadow table updates
> in case the device table is configured with an
> indirect layout.

Same as the previous commit message, could you please clarify the
"why" rather than only the "how"?

For someone without deep context of the pKVM ITS isolation model, this
message does not explain the threat vector.

>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/kvm/hyp/nvhe/its_emulate.c | 182 ++++++++++++++++++++++++++
>  drivers/irqchip/irq-gic-v3-its.c      |  12 --
>  include/linux/irqchip/arm-gic-v3.h    |  12 ++
>  3 files changed, 194 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/its_emulate.c b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> index 865a5d6353ed..722fe80dc2e5 100644
> --- a/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> +++ b/arch/arm64/kvm/hyp/nvhe/its_emulate.c
> @@ -12,8 +12,13 @@ struct its_priv_state {
>         void *cmd_host_cwriter;
>         struct its_shadow_tables *shadow;
>         hyp_spinlock_t its_lock;
> +       u16 empty_idx;
> +       u64 tracked_pfns[];
>  };
>
> +#define MAX_TRACKED_PFNS       ((PAGE_SIZE - offsetof(struct its_priv_state, \
> +                                 tracked_pfns)) / sizeof(u64))
> +
>  struct its_handler {
>         u64 offset;
>         u8 access_size;
> @@ -23,6 +28,178 @@ struct its_handler {
>
>  DEFINE_HYP_SPINLOCK(its_setup_lock);
>
> +static int track_pfn_add(struct its_priv_state *its, u64 pfn)
> +{
> +       int ret, i;
> +
> +       if (its->empty_idx + 1 >= MAX_TRACKED_PFNS)
> +               return -ENOSPC;

Why +1? This wastes the final slot in the array. It should just be: if
(its->empty_idx >= MAX_TRACKED_PFNS).

More importantly, doing an O(N) linear array scan to manage empty_idx
inside track_pfn_add and track_pfn_remove while holding the raw
its_priv->its_lock needlessly inflates host IRQ latency. Please
replace this array with a  BITMAP.

> +
> +       ret = __pkvm_host_share_hyp(pfn);
> +       if (ret)
> +               return ret;
> +
> +       its->tracked_pfns[its->empty_idx] = pfn;
> +       for (i = 0; i < MAX_TRACKED_PFNS; i++) {
> +               if (!its->tracked_pfns[i])
> +                       break;
> +       }
> +
> +       its->empty_idx = i;
> +       return 0;
> +}
> +
> +static int track_pfn_remove(struct its_priv_state *its, u64 pfn)
> +{
> +       int i, ret;
> +
> +       for (i = 0; i < MAX_TRACKED_PFNS; i++) {
> +               if (its->tracked_pfns[i] != pfn)
> +                       continue;
> +
> +               ret = __pkvm_host_unshare_hyp(pfn);
> +               if (ret)
> +                       return ret;
> +
> +               its->tracked_pfns[i] = 0;
> +               its->empty_idx = i;
> +       }
> +
> +       return 0;
> +}

If the PFN isn't found in the array, this silently returns 0 (success).

> +
> +static int get_num_itt_pages(struct its_priv_state *its, u8 num_bits)
> +{
> +       int nr_ites = 1 << (num_bits + 1);
> +       u64 size, gits_typer = readq_relaxed(its->base + GITS_TYPER);
> +
> +       size = nr_ites * (FIELD_GET(GITS_TYPER_ITT_ENTRY_SIZE, gits_typer) + 1);
> +       size = max(size, ITS_ITT_ALIGN) + ITS_ITT_ALIGN - 1;
> +
> +       return PAGE_ALIGN(size) >> PAGE_SHIFT;
> +}
> +
> +static int track_pfn(struct its_priv_state *its, u64 start_pfn, int num_pages, bool remove)
> +{
> +       int i, ret;
> +
> +       for (i = 0; i < num_pages; i++) {
> +               if (remove)
> +                       ret = track_pfn_remove(its, start_pfn + i);
> +               else
> +                       ret = track_pfn_add(its, start_pfn + i);
> +
> +               if (ret)
> +                       goto err_track;
> +       }
> +
> +       return 0;
> +err_track:
> +       for (i = i - 1; i >= 0; i--) {
> +               if (remove)
> +                       track_pfn_add(its, start_pfn + i);
> +               else
> +                       track_pfn_remove(its, start_pfn + i);
> +       }
> +
> +       return ret;
> +}
> +
> +static struct its_baser *get_table(struct its_priv_state *its, u64 type)
> +{
> +       int i;
> +       struct its_shadow_tables *shadow = its->shadow;
> +
> +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> +               if (GITS_BASER_TYPE(shadow->tables[i].val) == type)
> +                       return &shadow->tables[i];
> +       }
> +
> +       return NULL;
> +}
> +
> +static int check_table_update(struct its_priv_state *its, u32 id, u64 type)
> +{
> +       u32 lvl1_idx;
> +       u64 esz, *host_table, *hyp_table, new_entry, update;
> +       struct its_baser *table = get_table(its, type);
> +       int ret;
> +       phys_addr_t new_lvl2_table, lvl2_table;
> +
> +       if (!table)
> +               return -EINVAL;
> +
> +       if (!(table->val & GITS_BASER_INDIRECT))
> +               return 0;
> +
> +       esz = GITS_BASER_ENTRY_SIZE(table->val);
> +       lvl1_idx = id / (table->psz / esz);
> +
> +       host_table = kern_hyp_va(table->shadow);
> +       hyp_table = kern_hyp_va(table->base);
> +
> +       new_entry = host_table[id];

This accesses the entry based on id, which isn't sanitized.

> +       update = new_entry ^ hyp_table[id];
> +       if (!update || !(update & GITS_BASER_VALID))
> +               return 0;

This assumes any meaningful update must toggle the Valid bit. If the
host issues a MAPD that changes the Level 2 table pointer but keeps
Valid=1, update & GITS_BASER_VALID is 0.

> +
> +       new_lvl2_table = hyp_phys_to_pfn(new_entry & PHYS_MASK_SHIFT);
> +       lvl2_table = hyp_phys_to_pfn(hyp_table[id] & PHYS_MASK_SHIFT);

Should this be PHYS_MASK?

> +       if (new_entry & GITS_BASER_VALID)
> +               ret = __pkvm_host_donate_hyp(new_lvl2_table, table->psz >> PAGE_SHIFT);
> +       else
> +               ret = __pkvm_hyp_donate_host(lvl2_table, table->psz >> PAGE_SHIFT);

Similar issue to the one I mentioned in a previous patch regarding ITS
page size vs host page size.


> +       if (ret)
> +               return ret;
> +
> +       hyp_table[id] = new_entry;
> +       return 0;
> +}
> +
> +static int process_its_mapd(struct its_priv_state *its, struct its_cmd_block *cmd)
> +{
> +       phys_addr_t itt_addr = cmd->raw_cmd[2] & GENMASK(51, 8);
> +       u8 size = cmd->raw_cmd[1] & GENMASK(4, 0);
> +       bool remove = !(cmd->raw_cmd[2] & BIT(63));
> +       u32 device_id = cmd->raw_cmd[0] >> 32;
> +       int num_pages, ret;
> +       u64 base_pfn;
> +
> +       if (PAGE_ALIGNED(itt_addr))
> +               return -EINVAL;

This is inverted, right?

Cheers,
/fuad

> +
> +       base_pfn = hyp_phys_to_pfn(itt_addr);
> +       num_pages = get_num_itt_pages(its, size);
> +
> +       ret = check_table_update(its, device_id, GITS_BASER_TYPE_DEVICE);
> +       if (ret)
> +               return ret;
> +
> +       return track_pfn(its, base_pfn, num_pages, remove);
> +}
> +
> +static int parse_its_cmdq(struct its_priv_state *its, int offset, ssize_t len)
> +{
> +       struct its_cmd_block *cmd = its->cmd_hyp_base + offset;
> +       u8 req_type;
> +       int ret = 0;
> +
> +       while (len > 0 && !ret) {
> +               req_type = cmd->raw_cmd[0] & GENMASK(7, 0);
> +
> +               switch (req_type) {
> +               case GITS_CMD_MAPD:
> +                       ret = process_its_mapd(its, cmd);
> +                       break;
> +               }
> +
> +               cmd++;
> +               len -= sizeof(struct its_cmd_block);
> +       }
> +
> +       return ret;
> +}
> +
>  static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
>  {
>         u64 cwriter_offset = value & GENMASK(19, 5);
> @@ -41,11 +218,15 @@ static void cwriter_write(struct its_priv_state *its, u64 offset, u64 value)
>                 return;
>
>         memcpy(its->cmd_hyp_base + cmd_offset, its->cmd_host_cwriter, cmd_len);
> +       if (parse_its_cmdq(its, cmd_offset, cmd_len))
> +               return;
>
>         its->cmd_host_cwriter = its->cmd_host_base +
>                 (cmd_offset + cmd_len) % cmdq_sz;
>         if (its->cmd_host_cwriter == its->cmd_host_base) {
>                 memcpy(its->cmd_hyp_base, its->cmd_host_base, cwriter_offset);
> +               if (parse_its_cmdq(its, cmd_offset, cmd_len))
> +                       return;
>
>                 its->cmd_host_cwriter = its->cmd_host_base + cwriter_offset;
>         }
> @@ -357,6 +538,7 @@ int pkvm_init_gic_its_emulation(phys_addr_t dev_addr, void *host_priv_state,
>         priv_state->cmd_hyp_base = kern_hyp_va(shadow->cmd_original);
>         priv_state->cmd_host_base = kern_hyp_va(shadow->cmd_shadow);
>         priv_state->cmd_host_cwriter = priv_state->cmd_host_base;
> +       priv_state->empty_idx = 0;
>
>         hyp_spin_unlock(&its_setup_lock);
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 278dbc56f962..be78f7dccb9f 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -121,8 +121,6 @@ static DEFINE_PER_CPU(struct its_node *, local_4_1_its);
>  #define is_v4_1(its)           (!!((its)->typer & GITS_TYPER_VMAPP))
>  #define device_ids(its)                (FIELD_GET(GITS_TYPER_DEVBITS, (its)->typer) + 1)
>
> -#define ITS_ITT_ALIGN          SZ_256
> -
>  /* The maximum number of VPEID bits supported by VLPI commands */
>  #define ITS_MAX_VPEID_BITS                                             \
>         ({                                                              \
> @@ -515,16 +513,6 @@ struct its_cmd_desc {
>         };
>  };
>
> -/*
> - * The ITS command block, which is what the ITS actually parses.
> - */
> -struct its_cmd_block {
> -       union {
> -               u64     raw_cmd[4];
> -               __le64  raw_cmd_le[4];
> -       };
> -};
> -
>  #define ITS_CMD_QUEUE_SZ               SZ_64K
>  #define ITS_CMD_QUEUE_NR_ENTRIES       (ITS_CMD_QUEUE_SZ / sizeof(struct its_cmd_block))
>
> diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> index 40457a4375d4..4f7d47f3d970 100644
> --- a/include/linux/irqchip/arm-gic-v3.h
> +++ b/include/linux/irqchip/arm-gic-v3.h
> @@ -612,6 +612,8 @@
>   */
>  #define GIC_IRQ_TYPE_LPI               0xa110c8ed
>
> +#define ITS_ITT_ALIGN                  SZ_256
> +
>  struct rdists {
>         struct {
>                 raw_spinlock_t  rd_lock;
> @@ -634,6 +636,16 @@ struct rdists {
>         bool                    has_vpend_valid_dirty;
>  };
>
> +/*
> + * The ITS command block, which is what the ITS actually parses.
> + */
> +struct its_cmd_block {
> +       union {
> +               u64     raw_cmd[4];
> +               __le64  raw_cmd_le[4];
> +       };
> +};
> +
>  struct irq_domain;
>  struct fwnode_handle;
>  int __init its_lpi_memreserve_init(void);
> --
> 2.53.0.473.g4a7958ca14-goog
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/14] KVM: ITS hardening for pKVM
  2026-03-12 17:56 ` [RFC PATCH 00/14] KVM: ITS hardening for pKVM Fuad Tabba
@ 2026-03-20 14:42   ` Sebastian Ene
  0 siblings, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-20 14:42 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

On Thu, Mar 12, 2026 at 05:56:22PM +0000, Fuad Tabba wrote:

Hi,

> Hi Sebastian,
> 
> On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
> >
> > This series introduces the necessary machinery to perform trap & emulate
> > on device access in pKVM. Furthermore, it hardens the GIC/ITS controller to
> > prevent an attacker from tampering with the hypervisor protected memory
> > through this device.
> >
> > In pKVM, the host kernel is initially trusted to manage the boot process but
> > its permissions are revoked once KVM initializes. The GIC/ITS device is
> > configured before the kernel deprivileges itself. Once the hypervisor
> > becomes available, sanitize the accesses to the ITS controller by
> > trapping and emulating certain registers and by shadowing some memory
> > structures used by the ITS.
> >
> > This is required because the ITS can issue transactions on the memory
> > bus *directly*, without having an SMMU in front of it, which makes it
> > an interesting target for crossing the hypervisor-established privilege
> > boundary.
> >
> >
> > Patch overview
> > ==============
> >
> > The first patch is re-used from Mostafa's series[1] which brings SMMU-v3
> > support to pKVM.
> >
> > [1] https://lore.kernel.org/linux-iommu/20251117184815.1027271-1-smostafa@google.com/#r
> >
> > Some of the infrastructure built in that series might intersect and we
> > agreed to converge on some changes. The patches [1 - 3] allow unmapping
> > devices from the host address space and installing a handler to trap
> > accesses from the host. While executing in the handler, enough context
> > has to be given from mem-abort to perform the emulation of the device
> > such as: the offset, the access size, direction of the write and private
> > related data specific to the device.
> > The unmapping of the device from the host address space is performed
> > after the host deprivilege (during _kvm_host_prot_finalize call).
> >
> > The 4th patch looks up the ITS node from the device tree and adds it to
> > an array of unmapped devices. It install a handler that forwards all the
> > MMIO request to mediate the host access inside the emulation layer and
> > to prevent breaking ITS functionality.
> >
> > The 5th patch changes the GIC/ITS driver to exposes two new methods
> > which will be called from the KVM layer to setup the shadow state and
> > to take the appropriate locks. This one is the most intrusive as it
> > changes the current GIC/ITS driver. I tried to avoid creating a
> > dependency with KVM to keep the GIC driver agnostic of the virtualization
> > layer but I am happy to explore other options as well.
> > To avoid re-programming the ITS device with new shadow structures after
> > pKVM is ready, I exposed two functions to change the
> > pointers inside the driver for the following structures:
> > - the command queue points to a newly allocated queue
> > - the GITS_BASER<n> tables configured with an indirect layout have the
> >   first layer shadowed and they point to a new memory region
> 
> We used the term shadow for the hyp version of structs in an early
> pKVM patch series, but after a bit of discussion, we refer to it as
> the hypervisor state [1]. So please use this terminology instead of
> shadow.
> 
> [1] https://lore.kernel.org/all/YthwzIS18mutjGhN@google.com/

I think it makes sense to call it shadow in the context you pointed out.
In my context it doesn't make sense because the original structures are
manipulated by the hypervisor while the host only interacts with a copy
(eg. thus the shadow naming).

If you disagreee maybe we can call it a copy, I have no strong feelings
about this.

> 
> > Patch 6 adds the entry point into the emulation setup and sets up the
> > shadow command queue. It adds some helper macros to define the offset
> > register and the associate action that we want to execute in the
> > emulation. It also unmaps the state passed from the host kernel
> > to prevent it from playing nasty games later on. The patch
> > traps accesses to CWRITER register and copies the commands from the
> > host command queue to the shadow command queue.
> >
> > Patch 7 prevents the host from directly accessing the first layer of the
> > indirect tables held in GITS_BASER<n>. It also prevents the host from
> > directly accesssing the last layer of the Device Table (since the entries
> > in this table hold the address of the ITT table) and of the vPE Table
> > (since the vPE table entries hold the address of the virtual LPI pending
> > table.
> >
> > Patches [8-10] sanitize the commands sent to the ITS and their
> > arguments.
> >
> > Patches [11-13] restrict the access of the host to certain registers
> > and prevent undefined behaviour. Prevent the host from re-programming
> > the tables held in the GITS_BASER register.
> >
> > The last patch introduces an hvc to setup the ITS emulation and calls
> > into the ITS driver to setup the shadow state.
> >
> >
> > Design
> > ======
> >
> >
> > 1. Command queue shadowing
> >
> > The ITS hardware supports a command queue which is programmed by the driver
> > in the GITS_CBASER register. To inform the hardware that a new command
> > has been added, the driver updates an index into the GITS_CWRITER
> 
> It updates a base address offset, but that's probably what you meant.
> 
> > register. The driver then reads the GITS_CREADR register to see if the
> > command was processed or if the queue is stalled.
> >
> > To create a new command, the emulation layer mirrors the behavior
> > as following:
> >  (i) The host ITS driver creates a command in the shadow queue:
> >         its_allocate_entry() -> builder()
> >  (ii) Notifies the hardware that a new command is available:
> >         its_post_commands()
> >  (iii) Hypervisor traps the write to GITS_CWRITER:
> >         handle_host_mem_abort() -> handle_host_mmio_trap() ->
> >             pkvm_handle_gic_emulation()
> >  (iv) Hypervisor copies the command from the host command queue
> >       to the original queue which is not accessible to the host.
> >       It parses the command and updates the hardware write.
> >
> > The driver allocates space for the original command queue and programs
> > the hardware (GITS_CWRITER). When pKVM becomes available, the driver
> 
> You mean GITS_CBASER, right?
> 
> > allocates a new (shadow) queue and replaces its original pointer to
> > the queue with this new one. This is to prevent a malicious host from
> > tampering with the commands sent to the ITS hardware.
> >
> > The entry point of our emulation shares the memory of the newly
> > allocated queue with the hypervisor and donates the memory of the
> > original queue to make it inaccesible to the host.
> >
> >
> > 2. Indirect tables first level shadowing
> >
> > The ITS hardware supports indirection to minimize the space required to
> > accommodate large tables (eg. deviceId space used to index the Device Table
> > is quite sparse). This is a 2-level indirection, with entries from the
> > first table pointing to a second table.
> >
> > An attacker in control of the host can insert an address that points to
> > the hypervisor protected memory in the first level table and then use
> > subsequent ITS commands to write to this memory (MAPD).
> >
> > To shadow this tables, we rely on the driver to allocate space for it
> > and we copy the original content from the table into the copy. When
> > pKVM becomes available we switch the pointers that hold the orginal
> > tables to point to the copy.
> > To keep the tables from the hypervisor in sync with what the host
> > has, we update the tables when commands are sent to the ITS.
> >
> >
> > 3. Hiding the last layer of the Device Table and vPE Table from the host
> >
> > An attacker in control of the host kernel can alter the content of these
> > tables directly (the Arm IHI 0069H.b spec says that is undefined behavior
> > if entries are created by software). Normally these entries are created in
> > response of commands sent to the ITS.
> 
> nit: unpredictable behavior. Undefined usually refers to instructions.
>

Ack, will update.

> >
> > A Device Table entry that has the following structure:
> >
> > type DeviceTableEntry is (
> >         boolean Valid,
> >         Address ITT_base,
> >         bits(5) ITT_size
> > )
> 
> Be careful, this might be true for a specific GIC implementation,
> e.g., Arm CoreLink GIC-600, but according to the spec (5.2) the
> formats of the tables in system memory are IMPLEMENTATION DEFINED. If
> the format is relevant to us, then we verify specific GIC
> implementation via GITS_IIDR. If the series depends on this, then we
> must decide what to do in case the specific implementation does not
> match what we expect.
> 
> > This can be maliciously created by an attacker and the ITT_base can be
> > pointed to hypervisor protected memory. The MAPTI command can then be
> > used to write over the ITT_base with an ITE entry.
> 
> You mean it writes to the memory addressed by ITT_base, rather than
> writes over the ITT_base itself.
> 
> > Similarly a vCPU Table entry has the following structure:
> >
> > type VCPUTableEntry is (
> >         boolean Valid,
> >         bits(32) RDbase,
> >         Address VPT_base,
> >         bits(5) VPT_size
> > )
> >
> > VPT_base can be pointed to hypervisor protected memory and then a
> > command can be used to raise interrupts and set the corresponding
> > bit. This would give a 1-bit write primitive so is not "as generous"
> > as the others.
> >
> >
> > Notes
> > =====
> >
> >
> > Performance impact is expected with this as the emulation dance is not
> > cost free.
> > I haven't implemented any ITS quirks in the emulation and I don't know
> > whether we will need it ? (some hardware needs explicit dcache flushing
> > ITS_FLAGS_CMDQ_NEEDS_FLUSHING).
> 
> It's not a quirk. We should handle this in the next respin, because
> cache maintenance of the command queue is an explicit architectural
> requirement depending on how the hardware is integrated and
> configured.
> 
> According to the spec, the cacheability attributes of the ITS command
> queue are strictly governed by the InnerCache and OuterCache fields of
> the GITS_CBASER register. These fields can be configured for various
> memory types, including Device-nGnRnE or Normal Non-cacheable.
> 
> Because pKVM now takes responsibility for physically writing the
> command packets into the true hardware queue, the hypervisor must obey
> the cacheability attributes programmed into the physical GITS_CBASER.
> 
> If the software provisions GITS_CBASER as Non-cacheable, the
> hypervisor must perform explicit data cache maintenance (such as DC
> CVAU or DC CVAC) after copying the commands to the shadow queue. If
> you don't implement this, the physical ITS hardware (acting as a
> non-coherent bus master) will read stale memory, which will inevitably
> lead to queue stalls or the ITS executing garbage commands.
> 
> Since we are shielding the physical queue from the host, we inherit
> the host's responsibility to manage its cache coherency based on the
> GITS_CBASER configuration.
> 

Right, this will complicate the series a bit. I used the term 'quirk'
because this is how the driver refers to it.

> > Please note that Redistributors trapping hasn't been addressed at all in
> > this series and the solution is not sufficient but this can be extended
> > afterwards.
> > The current series has been tested with Qemu (-machine
> > virt,virtualization=true,gic-version=4) and with Pixel 10.
> 
> It would be helpful to mention that this is based on Linux 7.0-rc3
> (applied cleanly, and confirmed with you offline).
> 
> Also, it would be helpful if you could share how to tested this
> series, and how we could reproduce your tests.
>

I created a simple driver that registers for MSIs and probed it after
boot complete and observed all the commands sent to the ITS being
handled by the hypervisor.

[   60.196235] lpi_test: loading out-of-tree module taints kernel.
[   60.210751] lpi-test-driver test_node: >> probe lpi-test
[   60.212649] [ITS][CMD] >> 0x8
[   60.214780] lpi-test-driver test_node: lpi_test_probe linux lpi irq:
39
[   60.215810] lpi-test-driver test_node: lpi_test_probe linux lpi irq:
40
[   60.216870] [ITS][CMD] >> 0xa
[   60.217297] [ITS][CMD] >> 0x5
[   60.217889] lpi-test-driver test_node: >> msi address_high 0x0,
address_lo 0x8090040, address 0x8090040, data 0x0
[   60.219168] [ITS][CMD] >> 0xc
[   60.219576] [ITS][CMD] >> 0x5
[   60.220874] [ITS][CMD] >> 0xa
[   60.221292] [ITS][CMD] >> 0x5
[   60.221771] lpi-test-driver test_node: >> msi address_high 0x0,
address_lo 0x8090040, address 0x8090040, data 0x1
[   60.223091] [ITS][CMD] >> 0xc
[   60.223405] [ITS][CMD] >> 0x5
[   60.224482] lpi-test-driver test_node: >> probe complete

> Thanks,
> /fuad
> 
> 

Thanks,
Sebastian

> >
> >
> > Thanks,
> > Sebastian E.
> >
> > Mostafa Saleh (1):
> >   KVM: arm64: Donate MMIO to the hypervisor
> >
> > Sebastian Ene (13):
> >   KVM: arm64: Track host-unmapped MMIO regions in a static array
> >   KVM: arm64: Support host MMIO trap handlers for unmapped devices
> >   KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
> >   irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
> >   KVM: arm64: Add infrastructure for ITS emulation setup
> >   KVM: arm64: Restrict host access to the ITS tables
> >   KVM: arm64: Trap & emulate the ITS MAPD command
> >   KVM: arm64: Trap & emulate the ITS VMAPP command
> >   KVM: arm64: Trap & emulate the ITS MAPC command
> >   KVM: arm64: Restrict host updates to GITS_CTLR
> >   KVM: arm64: Restrict host updates to GITS_CBASER
> >   KVM: arm64 Restrict host updates to GITS_BASER
> >   KVM: arm64: Implement HVC interface for ITS emulation setup
> >
> >  arch/arm64/include/asm/kvm_arm.h              |   3 +
> >  arch/arm64/include/asm/kvm_asm.h              |   1 +
> >  arch/arm64/include/asm/kvm_pkvm.h             |  20 +
> >  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 +
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
> >  arch/arm64/kvm/hyp/nvhe/Makefile              |   3 +-
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  14 +
> >  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 653 ++++++++++++++++++
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 134 ++++
> >  arch/arm64/kvm/hyp/nvhe/setup.c               |  28 +
> >  arch/arm64/kvm/hyp/pgtable.c                  |   9 +-
> >  arch/arm64/kvm/pkvm.c                         |  60 ++
> >  drivers/irqchip/irq-gic-v3-its.c              | 177 ++++-
> >  include/linux/irqchip/arm-gic-v3.h            |  36 +
> >  14 files changed, 1126 insertions(+), 31 deletions(-)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> >  create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c
> >
> > --
> > 2.53.0.473.g4a7958ca14-goog
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
  2026-03-13 11:26   ` Fuad Tabba
  2026-03-13 13:10     ` Fuad Tabba
@ 2026-03-20 15:11     ` Sebastian Ene
  2026-03-24 14:36       ` Fuad Tabba
  1 sibling, 1 reply; 36+ messages in thread
From: Sebastian Ene @ 2026-03-20 15:11 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

On Fri, Mar 13, 2026 at 11:26:04AM +0000, Fuad Tabba wrote:

Hi Fuad,

> Hi Sebastian,
> 
> On Tue, 10 Mar 2026 at 12:49, Sebastian Ene <sebastianene@google.com> wrote:
> >
> > Expose two helper functions to support emulated ITS in the hypervisor.
> > These allow the KVM layer to notify the driver when hypervisor
> > initialization is complete.
> > The caller is expected to use the functions as follows:
> > 1. its_start_deprivilege(): Acquire the ITS locks.
> > 2. on_each_cpu(_kvm_host_prot_finalize, ...): Finalizes pKVM init
> > 3. its_end_deprivilege(): Shadow the ITS structures, invoke the KVM
> >    callback, and release locks.
> > Specifically, this shadows the ITS command queue and the 1st level
> > indirect tables. These shadow buffers will be used by the driver after
> > host deprivilege, while the hypervisor unmaps and takes ownership of the
> > original structures.
> 
> Just a note again on preferring not to use the "shadow" terminology. I
> thought about it a bit more, since these are not at the host, perhaps
> "proxy" is a better term, to convey that the host is writing to a
> middle-man buffer.
> 
> Another term is "staging," which is common in DMA: the host "stages"
> the commands here, and EL2 "commits" them to the hardware.

Sure, happy to use one of the two indicated ones.

> 
> >
> > Signed-off-by: Sebastian Ene <sebastianene@google.com>
> > ---
> >  drivers/irqchip/irq-gic-v3-its.c   | 165 +++++++++++++++++++++++++++--
> >  include/linux/irqchip/arm-gic-v3.h |  24 +++++
> >  2 files changed, 178 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> > index 291d7668cc8d..278dbc56f962 100644
> > --- a/drivers/irqchip/irq-gic-v3-its.c
> > +++ b/drivers/irqchip/irq-gic-v3-its.c
> > @@ -78,17 +78,6 @@ struct its_collection {
> >         u16                     col_id;
> >  };
> >
> > -/*
> > - * The ITS_BASER structure - contains memory information, cached
> > - * value of BASER register configuration and ITS page size.
> > - */
> > -struct its_baser {
> > -       void            *base;
> > -       u64             val;
> > -       u32             order;
> > -       u32             psz;
> > -};
> > -
> >  struct its_device;
> >
> >  /*
> > @@ -5232,6 +5221,160 @@ static int __init its_compute_its_list_map(struct its_node *its)
> >         return its_number;
> >  }
> >
> > +static void its_free_shadow_tables(struct its_shadow_tables *shadow)
> > +{
> > +       int i;
> > +
> > +       if (shadow->cmd_shadow)
> > +               its_free_pages(shadow->cmd_shadow, get_order(ITS_CMD_QUEUE_SZ));
> > +
> > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > +               if (!shadow->tables[i].shadow)
> > +                       continue;
> > +
> > +               its_free_pages(shadow->tables[i].shadow, 0);
> > +       }
> > +
> > +       its_free_pages(shadow, 0);
> > +}
> > +
> > +static struct its_shadow_tables *its_get_shadow_tables(struct its_node *its)
> > +{
> > +       void *page;
> > +       struct its_shadow_tables *shadow;
> > +       int i;
> 
> Prefer RCT declarations.
> 
> > +
> > +       page = its_alloc_pages_node(its->numa_node, GFP_KERNEL | __GFP_ZERO, 0);
> 
> This is called with the raw_spin_lock_irqsave held, and GFP_KERNEL can
> sleep. You have one of two options, either use GFP_ATOMIC, but that's
> more likely to fail. The alternative is to move this to
> its_start_deprivilege(), before any lock is held.
> 

Thanks, I will try to move the allocation before the lock.

> > +       if (!page)
> > +               return NULL;
> > +
> > +       shadow = (void *)page_address(page);
> > +       page = its_alloc_pages_node(its->numa_node,
> > +                                   GFP_KERNEL | __GFP_ZERO,
> > +                                   get_order(ITS_CMD_QUEUE_SZ));
> > +       if (!page)
> > +               goto err_alloc_shadow;
> > +
> > +       shadow->cmd_shadow = page_address(page);
> > +       shadow->cmdq_len = ITS_CMD_QUEUE_SZ;
> > +       shadow->cmd_original = its->cmd_base;
> > +
> > +       memcpy(shadow->tables, its->tables, sizeof(struct its_baser) * GITS_BASER_NR_REGS);
> > +
> > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > +               if (!(shadow->tables[i].val & GITS_BASER_VALID))
> > +                       continue;
> > +
> > +               if (!(shadow->tables[i].val & GITS_BASER_INDIRECT))
> > +                       continue;
> > +
> > +               page = its_alloc_pages_node(its->numa_node,
> > +                                           GFP_KERNEL | __GFP_ZERO,
> > +                                           shadow->tables[i].order);
> > +               if (!page)
> > +                       goto err_alloc_shadow;
> > +
> > +               shadow->tables[i].shadow = page_address(page);
> > +
> > +               memcpy(shadow->tables[i].shadow, shadow->tables[i].base,
> > +                      PAGE_ORDER_TO_SIZE(shadow->tables[i].order));
> > +       }
> > +
> > +       return shadow;
> > +
> > +err_alloc_shadow:
> > +       its_free_shadow_tables(shadow);
> > +       return NULL;
> > +}
> > +
> > +void *its_start_depriviledge(void)
> 
> Typo here and elsewhere in this patch:
> 
> s/depriviledge/deprivilege/g
> 
> This is particularly important because it also appears in exported
> symbols as well (later in this patch).
>

Ack, will fix this.

> > +{
> > +       struct its_node *its;
> > +       int num_nodes = 0, i = 0;
> > +       unsigned long *flags;
> 
> RCT declaration order, and please untagle them, i.e., don't declare
> the num_nodes and the iterator in the same line.
>

Ack,

> > +
> > +       raw_spin_lock(&its_lock);
> > +       list_for_each_entry(its, &its_nodes, entry) {
> > +               num_nodes++;
> > +       }
> > +
> > +       flags = kzalloc(num_nodes * sizeof(unsigned long), GFP_KERNEL_ACCOUNT);
> 
> Same as the other allocation. This can sleep. I think that for this as
> well, it's better to move it before lock acquisition. Even if you use
> a different allocator, it's still better to keep the critical section
> short.
> 
> > +       if (!flags) {
> > +               raw_spin_unlock(&its_lock);
> > +               return NULL;
> > +       }
> > +
> > +       list_for_each_entry(its, &its_nodes, entry) {
> > +               raw_spin_lock_irqsave(&its->lock, flags[i++]);
> > +       }
> > +
> > +       return flags;
> > +}
> > +EXPORT_SYMBOL_GPL(its_start_depriviledge);
> > +
> > +static int its_switch_to_shadow_locked(struct its_node *its, its_init_emulate init_emulate_cb)
> > +{
> > +       struct its_shadow_tables *hyp_shadow, shadow;
> > +       int i, ret;
> > +       u64 baser, baser_phys;
> > +
> > +       hyp_shadow = its_get_shadow_tables(its);
> > +       if (!hyp_shadow)
> > +               return -ENOMEM;
> > +
> > +       memcpy(&shadow, hyp_shadow, sizeof(shadow));
> > +       ret = init_emulate_cb(its->phys_base, hyp_shadow);
> 
> You are performing this callback with the lock held and local
> interrupts disabled. The hvc call is byitself expensive, especially
> since it's going to do stage-2 manipulations.
> 
> You should decouple the synchronous pointer swapping (which must be
> locked) from the hypervisor notification (which can be done outside
> the lock). Instead of executing the callback inside the critical
> section, its_end_deprivilege should:
> - Lock everything.
> - Perform the pointer swaps in the host driver structures.
> - Save the hyp_shadow pointers to a temporary array.
> - Unlock everything.

I am afraid you can't do that because you can have dropped commands &
timeouts between these two steps. The driver might put commands in the
swapped queue and they will timeout.

> - Loop through the temporary array and call the KVM cb to notify EL2.
> 
> You should probably split this patch into two. The first patch would
> implement the freeze/unfreeze locking mechanism, and the second would
> swap the driver's internal memory pointers to the shadow structures,
> and invoke the KVM callback to lock down the real hardware.
> 
> Cheers,
> /fuad
> 

Thanks,
Sebastian

> > +       if (ret) {
> > +               its_free_shadow_tables(hyp_shadow);
> > +               return ret;
> > +       }
> > +
> > +       /* Switch the driver command queue to use the shadow and save the original */
> > +       its->cmd_write = (its->cmd_write - its->cmd_base) +
> > +               (struct its_cmd_block *)shadow.cmd_shadow;
> > +       its->cmd_base = shadow.cmd_shadow;
> > +
> > +       /* Shadow the first level of the indirect tables */
> > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > +               baser = shadow.tables[i].val;
> > +
> > +               if (!shadow.tables[i].shadow)
> > +                       continue;
> > +
> > +               baser_phys = virt_to_phys(shadow.tables[i].shadow);
> > +               if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48))
> > +                       baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> > +
> > +               its->tables[i].val &= ~GENMASK(47, 12);
> > +               its->tables[i].val |= baser_phys;
> > +               its->tables[i].base = shadow.tables[i].shadow;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +int its_end_depriviledge(int ret_pkvm_finalize, unsigned long *flags, its_init_emulate cb)
> > +{
> > +       struct its_node *its;
> > +       int i = 0, ret = 0;
> > +
> > +       if (!flags || !cb)
> > +               return -EINVAL;
> > +
> > +       list_for_each_entry(its, &its_nodes, entry) {
> > +               if (!ret_pkvm_finalize && !ret)
> > +                       ret = its_switch_to_shadow_locked(its, cb);
> > +
> > +               raw_spin_unlock_irqrestore(&its->lock, flags[i++]);
> > +       }
> > +
> > +       kfree(flags);
> > +       raw_spin_unlock(&its_lock);
> > +
> > +       return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(its_end_depriviledge);
> > +
> >  static int __init its_probe_one(struct its_node *its)
> >  {
> >         u64 baser, tmp;
> > diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> > index 0225121f3013..40457a4375d4 100644
> > --- a/include/linux/irqchip/arm-gic-v3.h
> > +++ b/include/linux/irqchip/arm-gic-v3.h
> > @@ -657,6 +657,30 @@ static inline bool gic_enable_sre(void)
> >         return !!(val & ICC_SRE_EL1_SRE);
> >  }
> >
> > +/*
> > + * The ITS_BASER structure - contains memory information, cached
> > + * value of BASER register configuration and ITS page size.
> > + */
> > +struct its_baser {
> > +       void            *base;
> > +       void            *shadow;
> > +       u64             val;
> > +       u32             order;
> > +       u32             psz;
> > +};
> > +
> > +struct its_shadow_tables {
> > +       struct its_baser        tables[GITS_BASER_NR_REGS];
> > +       void                    *cmd_shadow;
> > +       void                    *cmd_original;
> > +       size_t                  cmdq_len;
> > +};
> > +
> > +typedef int (*its_init_emulate)(phys_addr_t its_phys_base, struct its_shadow_tables *shadow);
> > +
> > +void *its_start_depriviledge(void);
> > +int its_end_depriviledge(int ret, unsigned long *flags, its_init_emulate cb);
> > +
> >  #endif
> >
> >  #endif
> > --
> > 2.53.0.473.g4a7958ca14-goog
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor
  2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
  2026-03-12 17:57   ` Fuad Tabba
  2026-03-13 10:40   ` Suzuki K Poulose
@ 2026-03-24 10:39   ` Vincent Donnefort
  2 siblings, 0 replies; 36+ messages in thread
From: Vincent Donnefort @ 2026-03-24 10:39 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tabba, tglx, bgrzesik, will, yuzenghui

On Tue, Mar 10, 2026 at 12:49:20PM +0000, Sebastian Ene wrote:
> From: Mostafa Saleh <smostafa@google.com>
> 
> Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> drivers can use that to protect the MMIO of IOMMU.
> The initial attempt to implement this was to have a new flag to
> "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> it was quite intrusive for host/hyp to check/set page state to make it
> aware of MMIO and to encode the state in the page table in that case.
> Which is called in paths that can be sensitive to performance (FFA, VMs..)
> 
> As donating MMIO is very rare, and we don’t need to encode the full
> state, it’s reasonable to have a separate function to do this.
> It will init the host s2 page table with an invalid leaf with the owner ID
> to prevent the host from mapping the page on faults.

I am not sure I agree here:

* Differentiating between MMIO and Memory is just a fast binary search into the
  memory regions.

* host_donate_hyp isn't a fast path even for memory regions anyway.

* Having common functions for changing ownership is more and more helpful (see the
  SME dvmsync workaround).

* There's nothing preventing from having a range here which is safely handled
  already by host_donate_hyp()

> 
> Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> stage-2 PTEs, as this can be triggered from recycle logic under memory
> pressure. There is no code relying on this, as all ownership changes is
> done via kvm_pgtable_stage2_set_owner()
> 
> For error path in IOMMU drivers, add a function to donate MMIO back
> from hyp to host.
> 
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  2 +
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 90 +++++++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c                  |  9 +-
>  3 files changed, 94 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 5f9d56754e39..8b617e6fc0e0 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -31,6 +31,8 @@ enum pkvm_component_id {
>  };
>  
>  extern unsigned long hyp_nr_cpus;
> +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> +int __pkvm_hyp_donate_host_mmio(u64 pfn);
>  
>  int __pkvm_prot_finalize(void);
>  int __pkvm_host_share_hyp(u64 pfn);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 38f66a56a766..0808367c52e5 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -784,6 +784,96 @@ int __pkvm_host_unshare_hyp(u64 pfn)
>  	return ret;
>  }
>  
> +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> +{
> +	u64 phys = hyp_pfn_to_phys(pfn);
> +	void *virt = __hyp_va(phys);
> +	int ret;
> +	kvm_pte_t pte;
> +
> +	if (addr_is_memory(phys))
> +		return -EINVAL;
> +
> +	host_lock_component();
> +	hyp_lock_component();
> +
> +	ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> +	if (ret)
> +		goto unlock;
> +
> +	if (pte && !kvm_pte_valid(pte)) {
> +		ret = -EPERM;
> +		goto unlock;
> +	}
> +
> +	ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> +	if (ret)
> +		goto unlock;
> +	if (pte) {
> +		ret = -EBUSY;
> +		goto unlock;
> +	}
> +
gg> +	ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> gg+	if (ret)
> +		goto unlock;
> +	/*
> +	 * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> +	 * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> +	 * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> +	 *   kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> +	 * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> +	 */
> +	WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> +				PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> +unlock:
> +	hyp_unlock_component();
> +	host_unlock_component();
> +
> +	return ret;
> +}
> +
> +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> +{
> +	u64 phys = hyp_pfn_to_phys(pfn);
> +	u64 virt = (u64)__hyp_va(phys);
> +	size_t size = PAGE_SIZE;
> +	int ret;
> +	kvm_pte_t pte;
> +
> +	if (addr_is_memory(phys))
> +		return -EINVAL;
> +
> +	host_lock_component();
> +	hyp_lock_component();
> +
> +	ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> +	if (ret)
> +		goto unlock;
> +	if (!kvm_pte_valid(pte)) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> +	if (ret)
> +		goto unlock;
> +
> +	if (FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte) != PKVM_ID_HYP) {
> +		ret = -EPERM;
> +		goto unlock;
> +	}
> +
> +	WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> +	WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> +				PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> +unlock:
> +	hyp_unlock_component();
> +	host_unlock_component();
> +
> +	return ret;
> +}
> +
>  int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
>  {
>  	u64 phys = hyp_pfn_to_phys(pfn);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9b480f947da2..d954058e63ff 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1152,13 +1152,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	kvm_pte_t *childp = NULL;
>  	bool need_flush = false;
>  
> -	if (!kvm_pte_valid(ctx->old)) {
> -		if (stage2_pte_is_counted(ctx->old)) {
> -			kvm_clear_pte(ctx->ptep);
> -			mm_ops->put_page(ctx->ptep);
> -		}
> -		return 0;
> -	}
> +	if (!kvm_pte_valid(ctx->old))
> +		return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
>  
>  	if (kvm_pte_table(ctx->old, ctx->level)) {
>  		childp = kvm_pte_follow(ctx->old, mm_ops);
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array
  2026-03-10 12:49 ` [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array Sebastian Ene
  2026-03-12 19:05   ` Fuad Tabba
@ 2026-03-24 10:46   ` Vincent Donnefort
  1 sibling, 0 replies; 36+ messages in thread
From: Vincent Donnefort @ 2026-03-24 10:46 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tabba, tglx, bgrzesik, will, yuzenghui

On Tue, Mar 10, 2026 at 12:49:21PM +0000, Sebastian Ene wrote:
> Introduce a registry to track protected MMIO regions that are unmapped
> from the host stage-2 page tables. These regions are stored in a
> fixed-size array and their ownership is donated to the hypervisor during
> initialization to ensure host-exclusion and persistent tracking.
> 
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/include/asm/kvm_pkvm.h     | 10 ++++++++++
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  3 +++
>  arch/arm64/kvm/hyp/nvhe/setup.c       | 25 +++++++++++++++++++++++++
>  3 files changed, 38 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> index 757076ad4ec9..48ec7d519399 100644
> --- a/arch/arm64/include/asm/kvm_pkvm.h
> +++ b/arch/arm64/include/asm/kvm_pkvm.h
> @@ -17,6 +17,16 @@
>  
>  #define HYP_MEMBLOCK_REGIONS 128
>  
> +#define PKVM_PROTECTED_REGS_NUM	8
> +
> +struct pkvm_protected_reg {
> +	u64 start_pfn; 
> +	size_t num_pages; 

nit: "u64 pfn, u64 nr_pages" to align with everywhere else.

> +};
> +
> +extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
> +extern unsigned int kvm_nvhe_sym(num_protected_reg);
> +
>  int pkvm_init_host_vm(struct kvm *kvm);
>  int pkvm_create_hyp_vm(struct kvm *kvm);
>  bool pkvm_hyp_vm_is_created(struct kvm *kvm);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 0808367c52e5..7c125836b533 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -23,6 +23,9 @@
>  
>  struct host_mmu host_mmu;
>  
> +struct pkvm_protected_reg pkvm_protected_regs[PKVM_PROTECTED_REGS_NUM];
> +unsigned int num_protected_reg;
> +
>  static struct hyp_pool host_s2_pool;
>  
>  static DEFINE_PER_CPU(struct pkvm_hyp_vm *, __current_vm);
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 90bd014e952f..ad5b96085e1b 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -284,6 +284,27 @@ static int fix_hyp_pgtable_refcnt(void)
>  				&walker);
>  }
>  
> +static int unmap_protected_regions(void)
> +{
> +	struct pkvm_protected_reg *reg;
> +	int i, ret, j = 0;
> +
> +	for (i = 0; i < num_protected_reg; i++) {
> +		reg = &pkvm_protected_regs[i];
> +		for (j = 0; j < reg->num_pages; j++) {
> +			ret = __pkvm_host_donate_hyp_mmio(reg->start_pfn + j);

If this is to make this static at boot, we don't even need __pkvm_host_donate_hyp_mmio()

We can just map the region early enough in the hypervisor pkvm_create_mappings()
in recreate_hyp_mappings() and then let fix_host_ownership() do the host
stage2 unmapping.

> +			if (ret)
> +				goto err_setup;
> +		}
> +	}
> +
> +	return 0;
> +err_setup:
> +	for (j = j - 1; j >= 0; j--)
> +		__pkvm_hyp_donate_host_mmio(reg->start_pfn + j);
> +	return ret;
> +}
> +
>  void __noreturn __pkvm_init_finalise(void)
>  {
>  	struct kvm_cpu_context *host_ctxt = host_data_ptr(host_ctxt);
> @@ -324,6 +345,10 @@ void __noreturn __pkvm_init_finalise(void)
>  	if (ret)
>  		goto out;
>  
> +	ret = unmap_protected_regions();
> +	if (ret)
> +		goto out;
> +
>  	ret = hyp_ffa_init(ffa_proxy_pages);
>  	if (ret)
>  		goto out;
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices
  2026-03-10 12:49 ` [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices Sebastian Ene
  2026-03-13  9:31   ` Fuad Tabba
@ 2026-03-24 10:59   ` Vincent Donnefort
  1 sibling, 0 replies; 36+ messages in thread
From: Vincent Donnefort @ 2026-03-24 10:59 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tabba, tglx, bgrzesik, will, yuzenghui

On Tue, Mar 10, 2026 at 12:49:22PM +0000, Sebastian Ene wrote:
> Introduce a mechanism to register callbacks for MMIO accesses to regions
> unmapped from the host Stage-2 page tables.
> 
> This infrastructure allows the hypervisor to intercept host accesses to
> protected or emulated devices. When a Stage-2 fault occurs on a
> registered device region, the hypervisor will invoke the associated
> callback to emulate the access.
> 
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> ---
>  arch/arm64/include/asm/kvm_arm.h      |  3 ++
>  arch/arm64/include/asm/kvm_pkvm.h     |  6 ++++
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c | 41 +++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  3 ++
>  4 files changed, 53 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 3f9233b5a130..8fe1e80ab3f4 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -304,6 +304,9 @@
>  
>  /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>  #define HPFAR_MASK	(~UL(0xf))
> +
> +#define FAR_MASK	GENMASK_ULL(11, 0)
> +
>  /*
>   * We have
>   *	PAR	[PA_Shift - 1	: 12] = PA	[PA_Shift - 1 : 12]
> diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
> index 48ec7d519399..5321ced2f50a 100644
> --- a/arch/arm64/include/asm/kvm_pkvm.h
> +++ b/arch/arm64/include/asm/kvm_pkvm.h
> @@ -19,9 +19,15 @@
>  
>  #define PKVM_PROTECTED_REGS_NUM	8
>  
> +struct pkvm_protected_reg;
> +
> +typedef void (pkvm_emulate_handler)(struct pkvm_protected_reg *region, u64 offset, bool write,
> +				    u64 *reg, u8 reg_size);
> +
>  struct pkvm_protected_reg {
>  	u64 start_pfn;
>  	size_t num_pages;
> +	pkvm_emulate_handler *cb;
>  };
>  
>  extern struct pkvm_protected_reg kvm_nvhe_sym(pkvm_protected_regs)[];
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 7c125836b533..f405d2fbd88f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -13,6 +13,7 @@
>  #include <asm/stage2_pgtable.h>
>  
>  #include <hyp/fault.h>
> +#include <hyp/adjust_pc.h>
>  
>  #include <nvhe/gfp.h>
>  #include <nvhe/memory.h>
> @@ -608,6 +609,41 @@ static int host_stage2_idmap(u64 addr)
>  	return ret;
>  }
>  
> +static bool handle_host_mmio_trap(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr)
> +{
> +	u64 offset, reg_value = 0, start, end;
> +	u8 reg_size, reg_index;
> +	bool write;
> +	int i;
> +
> +	for (i = 0; i < num_protected_reg; i++) {

This is potentially slow for a fast path. As this is an array, we could sort it
and do a binary search, just like find_mem_range?

> +		start = pkvm_protected_regs[i].start_pfn << PAGE_SHIFT;
> +		end = start + (pkvm_protected_regs[i].num_pages << PAGE_SHIFT);
> +
> +		if (start > addr || addr > end)
> +			continue;
> +
> +		reg_size = BIT((esr & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
> +		reg_index = (esr & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
> +		write = (esr & ESR_ELx_WNR) == ESR_ELx_WNR;
> +		offset = addr - start;
> +
> +		if (write)
> +			reg_value = host_ctxt->regs.regs[reg_index];
> +
> +		pkvm_protected_regs[i].cb(&pkvm_protected_regs[i], offset, write,
> +					  &reg_value, reg_size);
> +
> +		if (!write)
> +			host_ctxt->regs.regs[reg_index] = reg_value;
> +
> +		kvm_skip_host_instr();
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
>  {
>  	struct kvm_vcpu_fault_info fault;
> @@ -630,6 +666,11 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
>  	 */
>  	BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS));
>  	addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12;
> +	addr |= fault.far_el2 & FAR_MASK;
> +
> +	if (ESR_ELx_EC(esr) == ESR_ELx_EC_DABT_LOW && !addr_is_memory(addr) &&
> +	    handle_host_mmio_trap(host_ctxt, esr, addr))
> +		return;
>  
>  	ret = host_stage2_idmap(addr);
>  	BUG_ON(ret && ret != -EAGAIN);
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index ad5b96085e1b..f91dfebe9980 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -296,6 +296,9 @@ static int unmap_protected_regions(void)
>  			if (ret)
>  				goto err_setup;
>  		}
> +
> +		if (reg->cb)
> +			reg->cb = kern_hyp_va(reg->cb);
>  	}
>  
>  	return 0;
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
  2026-03-20 15:11     ` Sebastian Ene
@ 2026-03-24 14:36       ` Fuad Tabba
  0 siblings, 0 replies; 36+ messages in thread
From: Fuad Tabba @ 2026-03-24 14:36 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta, smostafa,
	suzuki.poulose, tglx, vdonnefort, bgrzesik, will, yuzenghui

Hi Seb,

<snip>

> > You should decouple the synchronous pointer swapping (which must be
> > locked) from the hypervisor notification (which can be done outside
> > the lock). Instead of executing the callback inside the critical
> > section, its_end_deprivilege should:
> > - Lock everything.
> > - Perform the pointer swaps in the host driver structures.
> > - Save the hyp_shadow pointers to a temporary array.
> > - Unlock everything.
>
> I am afraid you can't do that because you can have dropped commands &
> timeouts between these two steps. The driver might put commands in the
> swapped queue and they will timeout.

You're right, that won't work. Simply releasing the lock between the
pointer swap and the hypercall (HVC) isn't safe for two reasons:
- Timeouts (your point): If we swap the pointers to the new shadow
queue and drop the lock, another CPU might immediately try to queue a
command. It will trap to EL2 (cwriter_write), but because the HVC
hasn't finished initializing the hypervisor's internal state
(region->priv), EL2 will
drop the MMIO write. The host will then spin in
its_wait_for_range_completion waiting for the hardware to process a
command it never saw, resulting in a timeout.
- Stage-2 Aborts: Conversely, if we try to run the HVC before swapping
the pointers, the HVC will actively unmap (donate) the original
queue's memory to EL2. If the host is not locked and tries to write to
the old queue during this window, it will trigger a Stage-2 Data Abort
and boom!

We must pause host command submission to the affected ITS while the
HVC is running. That said, I still don't think it's acceptable to do
what you propose in this patch. This holds the raw_spin_lock_irqsave
for way too long, keeping local interrupts disabled while performing
slow hypercalls for the entire system.

I had a bit of a think, and I have two ideas, one is an improvement,
but not a full solution. It's simple though, and it might be enough
for now and I am more confident that it works. The second one is a
better solution I think, assuming it works and I haven't missed
anything :)

Option A: Granular Locking (Per-ITS Lock & HVC)

In this patch its_start_depriviledge effectively locks *all* ITS nodes
in the system simultaneously. Then, its_end_depriviledge calls the HVC
for every single ITS sequentially, while the CPU is still holding all
the locks with interrupts globally disabled. This makes this critical
section very long. Making it shorter is pretty straightforward i
think...

Instead of trying to decouple the pointer swap from the HVC, we can
start by reducing the scope of the lock. We remove the global locking
in its_start_depriviledge. Inside its_end_depriviledge, we process one
ITS at a time:
1. Disable local interrupts and lock one specific ITS
(raw_spin_lock_irqsave(&its->lock)).
2. Perform the pointer swap AND the HVC for this specific ITS.
3. Unlock and re-enable local interrupts (raw_spin_unlock_irqrestore).
4. Move to the next ITS in the list.

This should be simple to implement. It guarantees zero dropped
commands and zero aborts because the swap and HVC remain atomic.
However, the CPU executing the deprivilege still holds a raw spinlock
during a hypercall, but the duration is reduced to a single ITS node
rather than the entire thing.

Option B: Software Quiescence (Driver-Level Pausing)
To make it so that the HVC runs outside of an atomic context (with
local interrupts enabled), maybe we can teach the ITS driver to
voluntarily pause command submission without holding the raw spinlock.

I was thinking that we introduce a new state flag, e.g.,
is_vmm_migrating, to struct its_node. Every ITS command goes through
the BUILD_SINGLE_CMD_FUNC macro. We modify this macro so that if a CPU
tries to send a command and sees this flag is true, it temporarily
drops the lock, re-enables its interrupts, spins (cpu_relax()), and
retries.

The deprivilege sequence per ITS then becomes:
1. Lock: Acquire its->lock.
2. Swap & Pause: Swap the pointers to the shadow queue and set
its->is_vmm_migrating = true.
3. Unlock: Drop its->lock and re-enable interrupts. (Any other CPU
trying to send a command to this ITS will now safely spin and wait).
4. The HVC: Execute the slow hypercall safely outside of atomic context.
5. Resume: Re-acquire its->lock, set its->is_vmm_migrating = false,
and drop the lock. (This wakes up any spinning CPUs, and they
immediately send their commands to the newly registered shadow queue).

The HVC runs safely with local interrupts enabled, guaranteeing that
no commands are dropped or sent to unmapped memory. If a hardware
interrupt fires on another CPU that requires sending an ITS command
exactly while the HVC is running, that CPU will be forced to spin.
However, this is no worse than aquiring locks, where that CPU would
have been spinning waiting for the raw spinlock anyway.

What do you think?

If you like, I could hack something and we could discuss it some more.

Cheers,
/fuad


>
> > - Loop through the temporary array and call the KVM cb to notify EL2.
> >
> > You should probably split this patch into two. The first patch would
> > implement the freeze/unfreeze locking mechanism, and the second would
> > swap the driver's internal memory pointers to the shadow structures,
> > and invoke the KVM callback to lock down the real hardware.
> >
> > Cheers,
> > /fuad
> >
>
> Thanks,
> Sebastian
>
> > > +       if (ret) {
> > > +               its_free_shadow_tables(hyp_shadow);
> > > +               return ret;
> > > +       }
> > > +
> > > +       /* Switch the driver command queue to use the shadow and save the original */
> > > +       its->cmd_write = (its->cmd_write - its->cmd_base) +
> > > +               (struct its_cmd_block *)shadow.cmd_shadow;
> > > +       its->cmd_base = shadow.cmd_shadow;
> > > +
> > > +       /* Shadow the first level of the indirect tables */
> > > +       for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > > +               baser = shadow.tables[i].val;
> > > +
> > > +               if (!shadow.tables[i].shadow)
> > > +                       continue;
> > > +
> > > +               baser_phys = virt_to_phys(shadow.tables[i].shadow);
> > > +               if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48))
> > > +                       baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> > > +
> > > +               its->tables[i].val &= ~GENMASK(47, 12);
> > > +               its->tables[i].val |= baser_phys;
> > > +               its->tables[i].base = shadow.tables[i].shadow;
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +int its_end_depriviledge(int ret_pkvm_finalize, unsigned long *flags, its_init_emulate cb)
> > > +{
> > > +       struct its_node *its;
> > > +       int i = 0, ret = 0;
> > > +
> > > +       if (!flags || !cb)
> > > +               return -EINVAL;
> > > +
> > > +       list_for_each_entry(its, &its_nodes, entry) {
> > > +               if (!ret_pkvm_finalize && !ret)
> > > +                       ret = its_switch_to_shadow_locked(its, cb);
> > > +
> > > +               raw_spin_unlock_irqrestore(&its->lock, flags[i++]);
> > > +       }
> > > +
> > > +       kfree(flags);
> > > +       raw_spin_unlock(&its_lock);
> > > +
> > > +       return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(its_end_depriviledge);
> > > +
> > >  static int __init its_probe_one(struct its_node *its)
> > >  {
> > >         u64 baser, tmp;
> > > diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> > > index 0225121f3013..40457a4375d4 100644
> > > --- a/include/linux/irqchip/arm-gic-v3.h
> > > +++ b/include/linux/irqchip/arm-gic-v3.h
> > > @@ -657,6 +657,30 @@ static inline bool gic_enable_sre(void)
> > >         return !!(val & ICC_SRE_EL1_SRE);
> > >  }
> > >
> > > +/*
> > > + * The ITS_BASER structure - contains memory information, cached
> > > + * value of BASER register configuration and ITS page size.
> > > + */
> > > +struct its_baser {
> > > +       void            *base;
> > > +       void            *shadow;
> > > +       u64             val;
> > > +       u32             order;
> > > +       u32             psz;
> > > +};
> > > +
> > > +struct its_shadow_tables {
> > > +       struct its_baser        tables[GITS_BASER_NR_REGS];
> > > +       void                    *cmd_shadow;
> > > +       void                    *cmd_original;
> > > +       size_t                  cmdq_len;
> > > +};
> > > +
> > > +typedef int (*its_init_emulate)(phys_addr_t its_phys_base, struct its_shadow_tables *shadow);
> > > +
> > > +void *its_start_depriviledge(void);
> > > +int its_end_depriviledge(int ret, unsigned long *flags, its_init_emulate cb);
> > > +
> > >  #endif
> > >
> > >  #endif
> > > --
> > > 2.53.0.473.g4a7958ca14-goog
> > >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/14] KVM: ITS hardening for pKVM
  2026-03-13 15:18 ` Mostafa Saleh
  2026-03-15 13:24   ` Fuad Tabba
@ 2026-03-25 16:26   ` Sebastian Ene
  1 sibling, 0 replies; 36+ messages in thread
From: Sebastian Ene @ 2026-03-25 16:26 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: alexandru.elisei, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm, catalin.marinas, dbrazdil, joey.gouly, kees,
	mark.rutland, maz, oupton, perlarsen, qperret, rananta,
	suzuki.poulose, tabba, tglx, vdonnefort, bgrzesik, will,
	yuzenghui

On Fri, Mar 13, 2026 at 03:18:01PM +0000, Mostafa Saleh wrote:

Hi Mostafa,

> Hi Seb,
> 
> On Tue, Mar 10, 2026 at 12:49:19PM +0000, Sebastian Ene wrote:
> > This series introduces the necessary machinery to perform trap & emulate
> > on device access in pKVM. Furthermore, it hardens the GIC/ITS controller to
> > prevent an attacker from tampering with the hypervisor protected memory
> > through this device. 
> > 
> > In pKVM, the host kernel is initially trusted to manage the boot process but
> > its permissions are revoked once KVM initializes. The GIC/ITS device is
> > configured before the kernel deprivileges itself. Once the hypervisor
> > becomes available, sanitize the accesses to the ITS controller by
> > trapping and emulating certain registers and by shadowing some memory
> > structures used by the ITS.
> > 
> > This is required because the ITS can issue transactions on the memory
> > bus *directly*, without having an SMMU in front of it, which makes it
> > an interesting target for crossing the hypervisor-established privilege
> > boundary.
> > 
> > 
> > Patch overview
> > ==============
> > 
> > The first patch is re-used from Mostafa's series[1] which brings SMMU-v3
> > support to pKVM.
> > 
> > [1] https://lore.kernel.org/linux-iommu/20251117184815.1027271-1-smostafa@google.com/#r
> > 
> > Some of the infrastructure built in that series might intersect and we
> > agreed to converge on some changes. The patches [1 - 3] allow unmapping
> > devices from the host address space and installing a handler to trap
> > accesses from the host. While executing in the handler, enough context
> > has to be given from mem-abort to perform the emulation of the device
> > such as: the offset, the access size, direction of the write and private
> > related data specific to the device. 
> > The unmapping of the device from the host address space is performed
> > after the host deprivilege (during _kvm_host_prot_finalize call).
> > 
> > The 4th patch looks up the ITS node from the device tree and adds it to
> > an array of unmapped devices. It install a handler that forwards all the
> > MMIO request to mediate the host access inside the emulation layer and
> > to prevent breaking ITS functionality. 
> > 
> > The 5th patch changes the GIC/ITS driver to exposes two new methods
> > which will be called from the KVM layer to setup the shadow state and
> > to take the appropriate locks. This one is the most intrusive as it
> > changes the current GIC/ITS driver. I tried to avoid creating a
> > dependency with KVM to keep the GIC driver agnostic of the virtualization
> > layer but I am happy to explore other options as well. 
> > To avoid re-programming the ITS device with new shadow structures after
> > pKVM is ready, I exposed two functions to change the
> > pointers inside the driver for the following structures:
> > - the command queue points to a newly allocated queue
> > - the GITS_BASER<n> tables configured with an indirect layout have the
> >   first layer shadowed and they point to a new memory region
> > 
> > Patch 6 adds the entry point into the emulation setup and sets up the
> > shadow command queue. It adds some helper macros to define the offset
> > register and the associate action that we want to execute in the
> > emulation. It also unmaps the state passed from the host kernel
> > to prevent it from playing nasty games later on. The patch
> > traps accesses to CWRITER register and copies the commands from the
> > host command queue to the shadow command queue. 
> > 
> > Patch 7 prevents the host from directly accessing the first layer of the
> > indirect tables held in GITS_BASER<n>. It also prevents the host from
> > directly accesssing the last layer of the Device Table (since the entries
> > in this table hold the address of the ITT table) and of the vPE Table
> > (since the vPE table entries hold the address of the virtual LPI pending
> > table.
> > 
> > Patches [8-10] sanitize the commands sent to the ITS and their
> > arguments.
> > 
> > Patches [11-13] restrict the access of the host to certain registers
> > and prevent undefined behaviour. Prevent the host from re-programming
> > the tables held in the GITS_BASER register.
> > 
> > The last patch introduces an hvc to setup the ITS emulation and calls
> > into the ITS driver to setup the shadow state. 
> > 
> > 
> > Design
> > ======
> > 
> > 
> > 1. Command queue shadowing
> > 
> > The ITS hardware supports a command queue which is programmed by the driver
> > in the GITS_CBASER register. To inform the hardware that a new command
> > has been added, the driver updates an index into the GITS_CWRITER
> > register. The driver then reads the GITS_CREADR register to see if the
> > command was processed or if the queue is stalled.
> >  
> > To create a new command, the emulation layer mirrors the behavior
> > as following:
> >  (i) The host ITS driver creates a command in the shadow queue:
> > 	its_allocate_entry() -> builder()
> >  (ii) Notifies the hardware that a new command is available:
> > 	its_post_commands()
> >  (iii) Hypervisor traps the write to GITS_CWRITER:
> > 	handle_host_mem_abort() -> handle_host_mmio_trap() ->
> >             pkvm_handle_gic_emulation()
> >  (iv) Hypervisor copies the command from the host command queue
> >       to the original queue which is not accessible to the host.
> >       It parses the command and updates the hardware write.
> > 
> > The driver allocates space for the original command queue and programs
> > the hardware (GITS_CWRITER). When pKVM becomes available, the driver
> > allocates a new (shadow) queue and replaces its original pointer to
> > the queue with this new one. This is to prevent a malicious host from
> > tampering with the commands sent to the ITS hardware.
> > 
> > The entry point of our emulation shares the memory of the newly
> > allocated queue with the hypervisor and donates the memory of the
> > original queue to make it inaccesible to the host.
> > 
> > 
> > 2. Indirect tables first level shadowing
> > 
> > The ITS hardware supports indirection to minimize the space required to
> > accommodate large tables (eg. deviceId space used to index the Device Table
> > is quite sparse). This is a 2-level indirection, with entries from the
> > first table pointing to a second table.
> > 
> > An attacker in control of the host can insert an address that points to
> > the hypervisor protected memory in the first level table and then use
> > subsequent ITS commands to write to this memory (MAPD).
> > 
> > To shadow this tables, we rely on the driver to allocate space for it
> > and we copy the original content from the table into the copy. When
> > pKVM becomes available we switch the pointers that hold the orginal
> > tables to point to the copy.
> > To keep the tables from the hypervisor in sync with what the host
> > has, we update the tables when commands are sent to the ITS.
> > 
> > 
> > 3. Hiding the last layer of the Device Table and vPE Table from the host
> > 
> > An attacker in control of the host kernel can alter the content of these
> > tables directly (the Arm IHI 0069H.b spec says that is undefined behavior
> > if entries are created by software). Normally these entries are created in
> > response of commands sent to the ITS.
> > 
> > A Device Table entry that has the following structure:
> > 
> > type DeviceTableEntry is (
> > 	boolean Valid,
> > 	Address ITT_base,
> > 	bits(5) ITT_size
> > ) 
> > 
> > This can be maliciously created by an attacker and the ITT_base can be
> > pointed to hypervisor protected memory. The MAPTI command can then be
> > used to write over the ITT_base with an ITE entry.
> > 
> > Similarly a vCPU Table entry has the following structure:
> > 
> > type VCPUTableEntry is (
> > 	boolean Valid,
> > 	bits(32) RDbase,
> > 	Address VPT_base,
> > 	bits(5) VPT_size
> > )
> > 
> > VPT_base can be pointed to hypervisor protected memory and then a
> > command can be used to raise interrupts and set the corresponding
> > bit. This would give a 1-bit write primitive so is not "as generous"
> > as the others.
> > 
> > 
> > Notes
> > =====
> > 
> > 
> > Performance impact is expected with this as the emulation dance is not
> > cost free.
> > I haven't implemented any ITS quirks in the emulation and I don't know
> > whether we will need it ? (some hardware needs explicit dcache flushing
> > ITS_FLAGS_CMDQ_NEEDS_FLUSHING). 
> > 
> > Please note that Redistributors trapping hasn't been addressed at all in
> > this series and the solution is not sufficient but this can be extended
> > afterwards. 
> > The current series has been tested with Qemu (-machine
> > virt,virtualization=true,gic-version=4) and with Pixel 10.
> > 
> > 
> > Thanks,
> > Sebastian E.
> > 
> > Mostafa Saleh (1):
> >   KVM: arm64: Donate MMIO to the hypervisor
> > 
> > Sebastian Ene (13):
> >   KVM: arm64: Track host-unmapped MMIO regions in a static array
> >   KVM: arm64: Support host MMIO trap handlers for unmapped devices
> >   KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping
> >   irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege
> >   KVM: arm64: Add infrastructure for ITS emulation setup
> >   KVM: arm64: Restrict host access to the ITS tables
> >   KVM: arm64: Trap & emulate the ITS MAPD command
> >   KVM: arm64: Trap & emulate the ITS VMAPP command
> >   KVM: arm64: Trap & emulate the ITS MAPC command
> >   KVM: arm64: Restrict host updates to GITS_CTLR
> >   KVM: arm64: Restrict host updates to GITS_CBASER
> >   KVM: arm64 Restrict host updates to GITS_BASER
> >   KVM: arm64: Implement HVC interface for ITS emulation setup
> 
> I tested the patches on Lenovo ideacenter Mini X Gen 10 Snapdragon,
> and the kernel hangs at boot for me with messags the following log:
> 
> [    2.735838] ITS queue timeout (1056 1024)
> [    2.739969] ITS cmd its_build_mapd_cmd failed
> [    4.776344] ITS queue timeout (1120 1024)
> [    4.780472] ITS cmd its_build_mapti_cmd failed
> [    6.816677] ITS queue timeout (1184 1024)
> [    6.820806] ITS cmd its_build_mapti_cmd failed
> [    8.857009] ITS queue timeout (1248 1024)
> [    8.861129] ITS cmd its_build_mapti_cmd failed
> 
> I am happy to do more debugging, let me know if I can try anything.

I managed to reproduce it on this Lenovo machine. I will have to dig a bit more
because I am not seeing this under Qemu. As a quick try I used
gic_flush_dcache_to_poc after adding commands to the ITS queue but it
didn't make any difference.

> 
> Thanks,
> Mostafa
> 

Thanks for trying it,
Sebastian
> > 
> >  arch/arm64/include/asm/kvm_arm.h              |   3 +
> >  arch/arm64/include/asm/kvm_asm.h              |   1 +
> >  arch/arm64/include/asm/kvm_pkvm.h             |  20 +
> >  arch/arm64/kvm/hyp/include/nvhe/its_emulate.h |  17 +
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
> >  arch/arm64/kvm/hyp/nvhe/Makefile              |   3 +-
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  14 +
> >  arch/arm64/kvm/hyp/nvhe/its_emulate.c         | 653 ++++++++++++++++++
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 134 ++++
> >  arch/arm64/kvm/hyp/nvhe/setup.c               |  28 +
> >  arch/arm64/kvm/hyp/pgtable.c                  |   9 +-
> >  arch/arm64/kvm/pkvm.c                         |  60 ++
> >  drivers/irqchip/irq-gic-v3-its.c              | 177 ++++-
> >  include/linux/irqchip/arm-gic-v3.h            |  36 +
> >  14 files changed, 1126 insertions(+), 31 deletions(-)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/its_emulate.h
> >  create mode 100644 arch/arm64/kvm/hyp/nvhe/its_emulate.c
> > 
> > -- 
> > 2.53.0.473.g4a7958ca14-goog
> > 

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2026-03-25 16:26 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 12:49 [RFC PATCH 00/14] KVM: ITS hardening for pKVM Sebastian Ene
2026-03-10 12:49 ` [PATCH 01/14] KVM: arm64: Donate MMIO to the hypervisor Sebastian Ene
2026-03-12 17:57   ` Fuad Tabba
2026-03-13 10:40   ` Suzuki K Poulose
2026-03-24 10:39   ` Vincent Donnefort
2026-03-10 12:49 ` [PATCH 02/14] KVM: arm64: Track host-unmapped MMIO regions in a static array Sebastian Ene
2026-03-12 19:05   ` Fuad Tabba
2026-03-24 10:46   ` Vincent Donnefort
2026-03-10 12:49 ` [PATCH 03/14] KVM: arm64: Support host MMIO trap handlers for unmapped devices Sebastian Ene
2026-03-13  9:31   ` Fuad Tabba
2026-03-24 10:59   ` Vincent Donnefort
2026-03-10 12:49 ` [PATCH 04/14] KVM: arm64: Mediate host access to GIC/ITS MMIO via unmapping Sebastian Ene
2026-03-13  9:58   ` Fuad Tabba
2026-03-10 12:49 ` [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege Sebastian Ene
2026-03-13 11:26   ` Fuad Tabba
2026-03-13 13:10     ` Fuad Tabba
2026-03-20 15:11     ` Sebastian Ene
2026-03-24 14:36       ` Fuad Tabba
2026-03-10 12:49 ` [PATCH 06/14] KVM: arm64: Add infrastructure for ITS emulation setup Sebastian Ene
2026-03-16 10:46   ` Fuad Tabba
2026-03-17  9:40     ` Fuad Tabba
2026-03-10 12:49 ` [PATCH 07/14] KVM: arm64: Restrict host access to the ITS tables Sebastian Ene
2026-03-16 16:13   ` Fuad Tabba
2026-03-10 12:49 ` [PATCH 08/14] KVM: arm64: Trap & emulate the ITS MAPD command Sebastian Ene
2026-03-17 10:20   ` Fuad Tabba
2026-03-10 12:49 ` [PATCH 09/14] KVM: arm64: Trap & emulate the ITS VMAPP command Sebastian Ene
2026-03-10 12:49 ` [PATCH 10/14] KVM: arm64: Trap & emulate the ITS MAPC command Sebastian Ene
2026-03-10 12:49 ` [PATCH 11/14] KVM: arm64: Restrict host updates to GITS_CTLR Sebastian Ene
2026-03-10 12:49 ` [PATCH 12/14] KVM: arm64: Restrict host updates to GITS_CBASER Sebastian Ene
2026-03-10 12:49 ` [PATCH 13/14] KVM: arm64: Restrict host updates to GITS_BASER Sebastian Ene
2026-03-10 12:49 ` [PATCH 14/14] KVM: arm64: Implement HVC interface for ITS emulation setup Sebastian Ene
2026-03-12 17:56 ` [RFC PATCH 00/14] KVM: ITS hardening for pKVM Fuad Tabba
2026-03-20 14:42   ` Sebastian Ene
2026-03-13 15:18 ` Mostafa Saleh
2026-03-15 13:24   ` Fuad Tabba
2026-03-25 16:26   ` Sebastian Ene

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox