[PATCH 00/43] KVM: arm64: Add GICv5 IRS support

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH 00/43] KVM: arm64: Add GICv5 IRS support
@ 2026-04-27 16:06 Sascha Bischoff
  2026-04-27 16:06 ` [PATCH 01/43] arm64/sysreg: Add GICv5 GIC VDPEND and VDRCFG encodings Sascha Bischoff
                   ` (42 more replies)
  0 siblings, 43 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:06 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Hi all,

This series builds on the initial vGICv5 support [1] and adds support
for the GICv5 IRS, as described by the GICv5 (EAC0) specification [2].
With this, a GICv5 guest is no longer restricted to PPIs, and can
make use of SPIs and LPIs as well.

With SPIs and LPIs available, this series makes it possible to boot a
full Linux guest on the Arm FVP model [3], using the setup described
in Lorenzo's GICv5 Linux software enablement guide [4]. In addition,
GICv5 IPIs are typically implemented as LPIs, so LPI support is what
makes guests with more than one vCPU possible.

The corresponding kvmtool changes are available separately [5]. With
these changes, `--irqchip=gicv5` works now. The `--irqchip=gicv5-its`
configuration does not work with this series, as ITS/MSI support is not
included here.

The IRS is created as part of the vgic-v5 device, and is not a
separate KVM device. This is because implementing the CPU interface
alone would limit a guest to a single vCPU and PPIs only. Hence, the
IRS is bundled into the VGIC, much like the distributor for GICv3.

The series adds the host-side plumbing needed for KVM to interact with
the physical IRS, including VM and VPE table management, VPE residency,
and VPE doorbells. It also adds an emulated IRS MMIO interface for the
guest, together with the UAPI needed for userspace to configure the IRS
address and to save/restore the IRS and IST state.

The main pieces are:

* Host IRS capability discovery and configuration frame plumbing.

* Allocation and management of the GICv5 VM table, VPE table, VMTEs,
  VPEs, and guest IST backing storage.

* VPE residency and doorbell handling, allowing the IRS to select SPIs
  and LPIs for resident VPEs and wake non-resident VPEs when required.

* Emulation of the GICv5 IRS MMIO CONFIG_FRAME for guests.
  Virtualisation, MPAM, MEC, SWERR are not supported and are RAZ/WI.

* SPI injection support using GIC VDPEND, together with tracking of
  in-flight SPIs so that deactivation notifiers can still be handled.

* Save/restore support for GICv5 EL1 system registers, IRS MMIO state,
  and guest ISTs.

* Documentation for the new VGICv5 IRS userspace interfaces and the
  required save/restore ordering.

Please pay close attention to the save/restore interface and ordering,
in particular to make sure that no guest state has been omitted.

LPIs are explicitly driven by the guest through the IRS/IST state. This
series does not add direct LPI injection support, and does not add MSI
support. It also does not add GICv5 ITS emulation.

These changes are based on v7.1-rc1. They presume the existence of the
still-under-review changes in [6]. I've pushed the full set of changes
to a branch at [7].

Thanks for taking the time to look at these changes. Apologies for the
size of the series, but this is a rather hard set to disentangle and
review separately!

Thanks,
Sascha

[1] https://lore.kernel.org/all/20260319154937.3619520-1-sascha.bischoff@arm.com/
[2] https://developer.arm.com/documentation/aes0070/latest
[3] https://developer.arm.com/documentation/108086/latest
[4] https://linaro.atlassian.net/wiki/x/CQAF-wY
[5] https://lore.kernel.org/all/20260116182606.61856-1-sascha.bischoff@arm.com/
[6] https://lore.kernel.org/all/20260415115559.2227718-1-maz@kernel.org/
[7] https://gitlab.arm.com/linux-arm/linux-sb/-/tree/gicv5_kvm_irs_support

Sascha Bischoff (43):
  arm64/sysreg: Add GICv5 GIC VDPEND and VDRCFG encodings
  arm64/sysreg: Update ICC_CR0_EL1 with LINK and LINK_IDLE fields
  KVM: arm64: gic-v5: Add resident/non-resident hyp calls
  irqchip/gic-v5: Provide IRS config frame attrs to KVM
  KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame
  KVM: arm64: gic-v5: Add VPE doorbell domain
  KVM: arm64: gic-v5: Create & manage VM and VPE tables
  KVM: arm64: gic-v5: Introduce guest IST alloc and management
  KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops
  KVM: arm64: gic-v5: Implement VPE IRS MMIO Ops
  KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset()
  KVM: arm64: gic-v5: Clear db_fired flag before making VPE non-resident
  KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put
  KVM: arm64: gic-v5: Request VPE doorbells when going non-resident
  KVM: arm64: gic-v5: Handle doorbells in kvm_vgic_vcpu_pending_irq()
  KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells
  KVM: arm64: gic-v5: Enable VPE DBs on VPE reset and disable on
    teardown
  KVM: arm64: gic-v5: Define remaining IRS MMIO registers
  KVM: arm64: gic-v5: Introduce struct vgic_v5_irs and IRS base address
  KVM: arm64: gic-v5: Add IRS IODEV to iodev_types and generic MMIO
    handlers
  KVM: arm64: gic-v5: Add KVM_VGIC_V5_ADDR_TYPE_IRS to UAPI
  KVM: arm64: gic-v5: Add GICv5 IRS IODEV and MMIO emulation
  KVM: arm64: gic-v5: Set IRICHPPIDIS based on IRS enable state
  KVM: arm64: gic-v5: Call IRS init/teardown from vgic_v5 init/teardown
  KVM: arm64: gic-v5: Register the IRS IODEV
  Documentation: KVM: Extend VGICv5 docs for KVM_VGIC_V5_ADDR_TYPE_IRS
  KVM: arm64: selftests: Update vGICv5 selftest to set IRS address
  KVM: arm64: gic-v5: Introduce SPI AP list
  KVM: arm64: gic-v5: Add GIC VDPEND and GIC VDRCFG hyp calls
  KVM: arm64: gic-v5: Track SPI state for in-flight SPIs
  KVM: arm64: gic: Introduce set_pending_state() to irq_op
  KVM: arm64: gic-v5: Support SPI injection
  KVM: arm64: gic-v5: Add GICv5 SPI injection to irqfd
  KVM: arm64: gic-v5: Mask per-vcpu PPI state in
    vgic_v5_finalize_ppi_state()
  KVM: arm64: gic-v5: Add GICv5 EL1 sysreg userspace set/get interface
  KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs
  KVM: arm64: gic-v5: Handle userspace accesses to IRS MMIO region
  KVM: arm64: gic-v5: Add VGIC_GRP_IRS_REGS/VGIC_GRP_IST to UAPI
  KVM: arm64: gic-v5: Plumb in has/set/get_attr for sysregs & IRS MMIO
    regs
  Documentation: KVM: Document KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS for
    VGICv5
  Documentation: KVM: Add KVM_DEV_ARM_VGIC_GRP_IRS_REGS to VGICv5 docs
  Documentation: KVM: Add docs for KVM_DEV_ARM_VGIC_GRP_IST
  Documentation: KVM: Add the VGICv5 IRS save/restore sequences

 .../virt/kvm/devices/arm-vgic-v5.rst          |  224 ++-
 arch/arm64/include/asm/kvm_asm.h              |    4 +
 arch/arm64/include/asm/kvm_hyp.h              |    4 +
 arch/arm64/include/asm/sysreg.h               |   18 +-
 arch/arm64/include/uapi/asm/kvm.h             |    7 +
 arch/arm64/kvm/Makefile                       |    3 +-
 arch/arm64/kvm/arm.c                          |    2 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   33 +
 arch/arm64/kvm/hyp/vgic-v5-sr.c               |   54 +
 arch/arm64/kvm/sys_regs.c                     |    6 +-
 arch/arm64/kvm/vgic-sys-reg-v5.c              |  510 +++++
 arch/arm64/kvm/vgic/vgic-init.c               |   57 +-
 arch/arm64/kvm/vgic/vgic-irqfd.c              |   13 +-
 arch/arm64/kvm/vgic/vgic-irs-v5.c             | 1075 ++++++++++
 arch/arm64/kvm/vgic/vgic-kvm-device.c         |  235 ++-
 arch/arm64/kvm/vgic/vgic-mmio.c               |    6 +
 arch/arm64/kvm/vgic/vgic-mmio.h               |    2 +
 arch/arm64/kvm/vgic/vgic-v5-tables.c          | 1766 +++++++++++++++++
 arch/arm64/kvm/vgic/vgic-v5-tables.h          |  212 ++
 arch/arm64/kvm/vgic/vgic-v5.c                 | 1250 +++++++++++-
 arch/arm64/kvm/vgic/vgic.c                    |   25 +-
 arch/arm64/kvm/vgic/vgic.h                    |   20 +-
 arch/arm64/tools/sysreg                       |    4 +-
 drivers/irqchip/irq-gic-v5-irs.c              |   26 +
 drivers/irqchip/irq-gic-v5.c                  |    3 +
 include/kvm/arm_vgic.h                        |  126 +-
 include/linux/irqchip/arm-gic-v5.h            |  199 +-
 include/linux/irqchip/arm-vgic-info.h         |    5 +
 tools/arch/arm64/include/uapi/asm/kvm.h       |    2 +
 tools/testing/selftests/kvm/arm64/vgic_v5.c   |    6 +
 .../selftests/kvm/include/arm64/gic_v5.h      |    3 +
 31 files changed, 5791 insertions(+), 109 deletions(-)
 create mode 100644 arch/arm64/kvm/vgic-sys-reg-v5.c
 create mode 100644 arch/arm64/kvm/vgic/vgic-irs-v5.c
 create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.c
 create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/43] arm64/sysreg: Add GICv5 GIC VDPEND and VDRCFG encodings
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
@ 2026-04-27 16:06 ` Sascha Bischoff
  2026-04-27 16:06 ` [PATCH 02/43] arm64/sysreg: Update ICC_CR0_EL1 with LINK and LINK_IDLE fields Sascha Bischoff
                   ` (41 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:06 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Add the system register encodings for the GIC VDPEND and GIC VDRCFG
system instructions. These operate on the virtual interrupt domain,
and are used to make interrupts pending for a VM and to read back the
configuration of a VM's interrupts.

This is part of enabling GICv5 KVM support, and is required for
injection of SPIs and LPIs, and querying the state of in-flight SPIs
to detect their deactivation.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/include/asm/sysreg.h | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 736561480f365..e00b32f0a8d9f 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -1040,7 +1040,7 @@
 #define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
 					       GCS_CAP_VALID_TOKEN)
 /*
- * Definitions for GICv5 instructions
+ * Definitions for GICv5 instructions for the Current Domain
  */
 #define GICV5_OP_GIC_CDAFF		sys_insn(1, 0, 12, 1, 3)
 #define GICV5_OP_GIC_CDDI		sys_insn(1, 0, 12, 2, 0)
@@ -1105,6 +1105,22 @@
 #define GICV5_GICR_CDNMIA_TYPE_MASK	GENMASK_ULL(31, 29)
 #define GICV5_GICR_CDNMIA_ID_MASK	GENMASK_ULL(23, 0)
 
+/*
+ * Definitions for GICv5 instructions for the Virtual Domain
+ */
+#define GICV5_OP_GIC_VDPEND		sys_insn(1, 4, 12, 1, 4)
+#define GICV5_OP_GIC_VDRCFG		sys_insn(1, 4, 12, 1, 5)
+
+/* Shift and mask definitions for GIC VDPEND */
+#define GICV5_GIC_VDPEND_PENDING_MASK	BIT_ULL(63)
+#define GICV5_GIC_VDPEND_VM_MASK	GENMASK_ULL(47, 32)
+#define GICV5_GIC_VDPEND_TYPE_MASK	GENMASK_ULL(31, 29)
+#define GICV5_GIC_VDPEND_ID_MASK	GENMASK_ULL(23, 0)
+
+/* Shift and mask definitions for GIC VDRCFG */
+#define GICV5_GIC_VDRCFG_TYPE_MASK	GENMASK_ULL(31, 29)
+#define GICV5_GIC_VDRCFG_ID_MASK	GENMASK_ULL(23, 0)
+
 #define gicr_insn(insn)			read_sysreg_s(GICV5_OP_GICR_##insn)
 #define gic_insn(v, insn)		write_sysreg_s(v, GICV5_OP_GIC_##insn)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/43] arm64/sysreg: Update ICC_CR0_EL1 with LINK and LINK_IDLE fields
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
  2026-04-27 16:06 ` [PATCH 01/43] arm64/sysreg: Add GICv5 GIC VDPEND and VDRCFG encodings Sascha Bischoff
@ 2026-04-27 16:06 ` Sascha Bischoff
  2026-04-27 16:07 ` [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls Sascha Bischoff
                   ` (40 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:06 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

These fields have been added to the architecture since this register
was added to the generator, and were hence missing.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/tools/sysreg | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 6c3ff14e561e6..57ab09404267c 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -3736,7 +3736,9 @@ Sysreg	ICC_CR0_EL1	3	1	12	0	1
 Res0	63:39
 Field	38	PID
 Field	37:32	IPPT
-Res0	31:1
+Res0	31:3
+Field	2	LINK_IDLE
+Field	1	LINK
 Field	0	EN
 EndSysreg
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
  2026-04-27 16:06 ` [PATCH 01/43] arm64/sysreg: Add GICv5 GIC VDPEND and VDRCFG encodings Sascha Bischoff
  2026-04-27 16:06 ` [PATCH 02/43] arm64/sysreg: Update ICC_CR0_EL1 with LINK and LINK_IDLE fields Sascha Bischoff
@ 2026-04-27 16:07 ` Sascha Bischoff
  2026-04-28 14:28   ` Marc Zyngier
  2026-04-27 16:07 ` [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM Sascha Bischoff
                   ` (39 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:07 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

So far the KVM GICv5 support has been limited to PPIs. These only go
as far out as the CPU interface, and have no interaction with the
host's IRS. Therefore, PPIs can be directly used for guests without
host IRS involvement. However, in order to support both SPIs and LPIs
IRS involvement is required.

GICv5 introduces the concept of VPE residency - a VPE can be either
resident or non-resident. When the VPE is resident, the IRS is allowed
to select interrupts that target that VPE (or the VM) as the HPPI
(Highest Priority Pending Interrupt). As the IRS handles both SPIs and
LPIs, these will only be picked as the IRS's HPPI when a VPE is
resident.

A GICv5 VPE is made resident by writing to the ICH_CONTEXTR_EL2 with a
valid VM and VPE ID, and marking it valid in the process. This informs
the IRS that a specific VPE is running, and that it can begin HPPI
selection for that VPE. Making a VPE non-resident (by making the
ICH_CONTEXTR_EL2 invalid) informs the IRS that the VPE is no longer
running, and it stops HPPI selection for it.

This change introduces two new hyp calls - one to make a VPE resident
and its counterpart to make a VPE non-resident. As part of making a
VPE resident, the ICH_CONTEXTR_EL2.F bit is checked in order to catch
faults, at which point the kernel will WARN. If everything is
configured correctly, this should not happen.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  2 ++
 arch/arm64/include/asm/kvm_hyp.h   |  2 ++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 15 +++++++++++++++
 arch/arm64/kvm/hyp/vgic-v5-sr.c    | 25 +++++++++++++++++++++++++
 include/kvm/arm_vgic.h             |  3 +++
 5 files changed, 47 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index fa033be6141ad..8c69f1f4de534 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -79,6 +79,8 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
 	__KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs,
 	__KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs,
+	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_resident,
+	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_non_resident,
 	__KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr,
 	__KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr,
 
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 8d06b62e7188c..5f9184276b04e 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -88,6 +88,8 @@ void __vgic_v3_restore_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if);
 int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu);
 
 /* GICv5 */
+void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if);
+void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if);
 void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if);
 void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if);
 /* No hypercalls for the following */
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 9e44c05cf780e..804a9ffdc8594 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -672,6 +672,19 @@ static void handle___tracing_write_event(struct kvm_cpu_context *host_ctxt)
 	trace_selftest(id);
 }
 
+static void handle___vgic_v5_make_resident(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1);
+
+	__vgic_v5_make_resident(kern_hyp_va(cpu_if));
+}
+static void handle___vgic_v5_make_non_resident(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1);
+
+	__vgic_v5_make_non_resident(kern_hyp_va(cpu_if));
+}
+
 static void handle___vgic_v5_save_apr(struct kvm_cpu_context *host_ctxt)
 {
 	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1);
@@ -711,6 +724,8 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__kvm_timer_set_cntvoff),
 	HANDLE_FUNC(__vgic_v3_save_aprs),
 	HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs),
+	HANDLE_FUNC(__vgic_v5_make_resident),
+	HANDLE_FUNC(__vgic_v5_make_non_resident),
 	HANDLE_FUNC(__vgic_v5_save_apr),
 	HANDLE_FUNC(__vgic_v5_restore_vmcr_apr),
 
diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c
index 6d69dfe89a96c..04c5846b9abac 100644
--- a/arch/arm64/kvm/hyp/vgic-v5-sr.c
+++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c
@@ -7,6 +7,31 @@
 
 #include <asm/kvm_hyp.h>
 
+void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if)
+{
+	write_sysreg_s(cpu_if->vgic_contextr, SYS_ICH_CONTEXTR_EL2);
+	isb();
+
+	/* Catch any faults */
+	cpu_if->vgic_contextr = read_sysreg_s(SYS_ICH_CONTEXTR_EL2);
+	if (WARN_ON(FIELD_GET(ICH_CONTEXTR_EL2_F, cpu_if->vgic_contextr)))
+		return;
+
+	cpu_if->gicv5_vpe.resident = true;
+}
+
+void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if)
+{
+	/*
+	 * Make as non-resident before actually making non-resident. Avoids race
+	 * with doorbell arriving.
+	 */
+	cpu_if->gicv5_vpe.resident = false;
+
+	write_sysreg_s(cpu_if->vgic_contextr, SYS_ICH_CONTEXTR_EL2);
+	isb();
+}
+
 void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if)
 {
 	cpu_if->vgic_apr = read_sysreg_s(SYS_ICH_APR_EL2);
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index fe49fb56dc3c9..d14cf4771d606 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -495,6 +495,9 @@ struct vgic_v5_cpu_if {
 	 */
 	u64	vgic_icsr;
 
+	/* The contextr used to make VPEs resident and non-resident */
+	u64	vgic_contextr;
+
 	struct gicv5_vpe gicv5_vpe;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (2 preceding siblings ...)
  2026-04-27 16:07 ` [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls Sascha Bischoff
@ 2026-04-27 16:07 ` Sascha Bischoff
  2026-04-28 14:56   ` Marc Zyngier
  2026-04-27 16:07 ` [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame Sascha Bischoff
                   ` (38 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:07 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

KVM needs to interact with the host IRS in order to, for example, make
VMs or VPEs valid. There are two potential approaches here. Either the
host irqchip driver can provide an interface, or KVM can interact
directly with the host IRS. The latter of these two is chosen as the
set of MMIO registers that KVM needs to interact with is orthogonal to
the set used by the host irqchip driver (with the exception of some of
the read-only IRS_IDRx registers).

Pass KVM a pointer to an IRS config frame - the config frame belonging
to ANY IRS is fine as long as one IRS's config frame is used
consistently - in struct gic_kvm_info. Additionally, include a flag
telling KVM whether the IRS is coherent or non-coherent in order to
make sure that KVM can do the correct cache state management, if
required.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 drivers/irqchip/irq-gic-v5-irs.c      | 26 ++++++++++++++++++++++++++
 drivers/irqchip/irq-gic-v5.c          |  3 +++
 include/linux/irqchip/arm-gic-v5.h    |  2 ++
 include/linux/irqchip/arm-vgic-info.h |  5 +++++
 4 files changed, 36 insertions(+)

diff --git a/drivers/irqchip/irq-gic-v5-irs.c b/drivers/irqchip/irq-gic-v5-irs.c
index f3fce0b1e25d9..5dfa043cf9e34 100644
--- a/drivers/irqchip/irq-gic-v5-irs.c
+++ b/drivers/irqchip/irq-gic-v5-irs.c
@@ -50,6 +50,32 @@ static void irs_writeq_relaxed(struct gicv5_irs_chip_data *irs_data,
 	writeq_relaxed(val, irs_data->irs_base + reg_offset);
 }
 
+void __iomem *gicv5_irs_get_config_frame_base(void)
+{
+	struct gicv5_irs_chip_data *irs_data = per_cpu(per_cpu_irs_data,
+						       smp_processor_id());
+
+	if (!irs_data)
+		return NULL;
+
+	return irs_data->irs_base;
+}
+
+bool gicv5_irs_is_non_coherent(void)
+{
+	struct gicv5_irs_chip_data *irs_data = per_cpu(per_cpu_irs_data,
+						       smp_processor_id());
+
+	if (!irs_data) {
+		pr_err("Failed to look up IRS for CPU %d\n",
+		       smp_processor_id());
+		return false;
+	}
+
+	return !!(irs_data->flags & IRS_FLAGS_NON_COHERENT);
+}
+
+
 /*
  * The polling wait (in gicv5_wait_for_op_s_atomic()) on a GIC register
  * provides the memory barriers (through MMIO accessors)
diff --git a/drivers/irqchip/irq-gic-v5.c b/drivers/irqchip/irq-gic-v5.c
index 58e457d4c1476..3329019722360 100644
--- a/drivers/irqchip/irq-gic-v5.c
+++ b/drivers/irqchip/irq-gic-v5.c
@@ -1134,6 +1134,9 @@ static void __init gic_of_setup_kvm_info(struct device_node *node)
 
 	gic_v5_kvm_info.type = GIC_V5;
 
+	gic_v5_kvm_info.gicv5_irs.base = gicv5_irs_get_config_frame_base();
+	gic_v5_kvm_info.gicv5_irs.non_coherent = gicv5_irs_is_non_coherent();
+
 	/* GIC Virtual CPU interface maintenance interrupt */
 	gic_v5_kvm_info.no_maint_irq_mask = false;
 	gic_v5_kvm_info.maint_irq = irq_of_parse_and_map(node, 0);
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index 40d2fce682940..dfa4d29dc0012 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -377,6 +377,8 @@ void __init gicv5_free_lpi_domain(void);
 
 int gicv5_irs_of_probe(struct device_node *parent);
 int gicv5_irs_acpi_probe(void);
+void __iomem *gicv5_irs_get_config_frame_base(void);
+bool gicv5_irs_is_non_coherent(void);
 void gicv5_irs_remove(void);
 int gicv5_irs_enable(void);
 void gicv5_irs_its_probe(void);
diff --git a/include/linux/irqchip/arm-vgic-info.h b/include/linux/irqchip/arm-vgic-info.h
index 67d9d960273b9..f05370e2debf4 100644
--- a/include/linux/irqchip/arm-vgic-info.h
+++ b/include/linux/irqchip/arm-vgic-info.h
@@ -38,6 +38,11 @@ struct gic_kvm_info {
 	bool		has_v4_1;
 	/* Deactivation impared, subpar stuff */
 	bool		no_hw_deactivation;
+	/* GICv5 IRS base */
+	struct {
+		void __iomem	*base;
+		bool		non_coherent;
+	}		gicv5_irs;
 };
 
 #ifdef CONFIG_KVM
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (3 preceding siblings ...)
  2026-04-27 16:07 ` [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM Sascha Bischoff
@ 2026-04-27 16:07 ` Sascha Bischoff
  2026-04-28 15:20   ` Marc Zyngier
  2026-04-27 16:08 ` [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain Sascha Bischoff
                   ` (37 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:07 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

The host irqchip driver provides KVM with a pointer to an IRS's config
frame, which allows KVM to directly interact with the host's IRS. The
MMIO registers in the config frame are used to configure VMs (in
addition to them being used by the host). The IRS's config frame also
includes a set of ID registers which describe the capabilities that
the IRS has.

Stash the pointer to the config frame, and extract the VM capabilities
(from IRS_IDR3 & IRS_IDR4), as well as the IST
capabilities/requirements (IRS_IDR2) from the IRS.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/Makefile              |  2 +-
 arch/arm64/kvm/vgic/vgic-v5-tables.c |  8 +++++
 arch/arm64/kvm/vgic/vgic-v5-tables.h | 41 ++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic-v5.c        | 52 ++++++++++++++++++++++++++++
 include/linux/irqchip/arm-gic-v5.h   | 10 ++++++
 5 files changed, 112 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.c
 create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.h

diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 59612d2f277c1..431de9b145ca1 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -24,7 +24,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
 	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
-	 vgic/vgic-v5.o
+	 vgic/vgic-v5.o vgic/vgic-v5-tables.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
new file mode 100644
index 0000000000000..30e2b108b1aa3
--- /dev/null
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025, 2026 Arm Ltd.
+ */
+
+#include "vgic-v5-tables.h"
+
+struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
new file mode 100644
index 0000000000000..cf00a248eabd5
--- /dev/null
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2025, 2026 Arm Ltd.
+ */
+
+#ifndef __KVM_ARM_VGICV5_TABLES_H__
+#define __KVM_ARM_VGICV5_TABLES_H__
+
+#include <linux/irqchip/arm-gic-v5.h>
+
+struct vgic_v5_host_ist_caps {
+	/* IST Capabilities */
+
+	/* Apply to LPIs and SPIs */
+	u8	ist_id_bits;
+	bool	ist_levels;
+	u8	ist_l2sz;
+	bool	istmd;
+	u8	istmd_sz;
+
+	/* LPI only */
+	u8	min_lpi_id_bits;
+
+	/* VM Table, VPE Table */
+	bool	two_level_vmt_support;
+	u32	max_vms;
+	u32	max_vpes;
+	u16	vmd_size;
+	u16	vped_size;
+
+	/* Is the IRS coherent with us, or not? */
+	bool	irs_non_coherent;
+};
+
+extern struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
+static inline struct vgic_v5_host_ist_caps *vgic_v5_host_caps(void)
+{
+	return &gicv5_host_ist_caps;
+}
+
+#endif
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index d4789ff3e7402..fd3d6299a2baa 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -9,6 +9,7 @@
 #include <linux/irqchip/arm-vgic-info.h>
 
 #include "vgic.h"
+#include "vgic-v5-tables.h"
 
 #define ppi_caps	kvm_vgic_global_state.vgic_v5_ppi_caps
 
@@ -34,6 +35,54 @@ static void vgic_v5_get_implemented_ppis(void)
 	__assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3());
 }
 
+static void __iomem *irs_base;
+
+static u32 irs_readl_relaxed(const u32 reg_offset)
+{
+	return readl_relaxed(irs_base + reg_offset);
+}
+
+static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
+{
+	u64 idr;
+
+	irs_base = info->gicv5_irs.base;
+	if (!irs_base) {
+		kvm_info("No GICv5 MMIO IRS address; no GICv5 support\n");
+		return -ENODEV;
+	}
+
+	vgic_v5_host_caps()->irs_non_coherent = info->gicv5_irs.non_coherent;
+
+	idr = irs_readl_relaxed(GICV5_IRS_IDR2);
+
+	/* We skip the LPI field as it only applies to physical LPIs */
+	vgic_v5_host_caps()->ist_id_bits = FIELD_GET(GICV5_IRS_IDR2_ID_BITS, idr);
+	vgic_v5_host_caps()->min_lpi_id_bits = FIELD_GET(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, idr);
+	vgic_v5_host_caps()->ist_levels = !!FIELD_GET(GICV5_IRS_IDR2_IST_LEVELS, idr);
+	vgic_v5_host_caps()->ist_l2sz = FIELD_GET(GICV5_IRS_IDR2_IST_L2SZ, idr);
+	vgic_v5_host_caps()->istmd = !!FIELD_GET(GICV5_IRS_IDR2_ISTMD, idr);
+	vgic_v5_host_caps()->istmd_sz = FIELD_GET(GICV5_IRS_IDR2_ISTMD_SZ, idr);
+
+	idr = irs_readl_relaxed(GICV5_IRS_IDR3);
+
+	vgic_v5_host_caps()->max_vms = BIT(FIELD_GET(GICV5_IRS_IDR3_VM_ID_BITS, idr));
+	vgic_v5_host_caps()->two_level_vmt_support = !!FIELD_GET(GICV5_IRS_IDR3_VMT_LEVELS, idr);
+
+	if (FIELD_GET(GICV5_IRS_IDR3_VMD, idr))
+		vgic_v5_host_caps()->vmd_size = BIT(FIELD_GET(GICV5_IRS_IDR3_VMD_SZ, idr));
+	else
+		vgic_v5_host_caps()->vmd_size = 0;
+
+	idr = irs_readl_relaxed(GICV5_IRS_IDR4);
+
+	vgic_v5_host_caps()->vped_size = BIT(FIELD_GET(GICV5_IRS_IDR4_VPED_SZ, idr));
+	/* Field stores VPE_ID_BITS - 1 */
+	vgic_v5_host_caps()->max_vpes = BIT(FIELD_GET(GICV5_IRS_IDR4_VPE_ID_BITS, idr) + 1);
+
+	return 0;
+}
+
 /*
  * Probe for a vGICv5 compatible interrupt controller, returning 0 on success.
  */
@@ -61,6 +110,9 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
 		goto skip_v5;
 	}
 
+	if (gicv5_irs_extract_vm_caps(info))
+		goto skip_v5;
+
 	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
 
 	vgic_v5_get_implemented_ppis();
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index dfa4d29dc0012..76dcf414ffb20 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -68,6 +68,8 @@
 #define GICV5_IRS_IDR0			0x0000
 #define GICV5_IRS_IDR1			0x0004
 #define GICV5_IRS_IDR2			0x0008
+#define GICV5_IRS_IDR3			0x000c
+#define GICV5_IRS_IDR4			0x0010
 #define GICV5_IRS_IDR5			0x0014
 #define GICV5_IRS_IDR6			0x0018
 #define GICV5_IRS_IDR7			0x001c
@@ -105,6 +107,14 @@
 #define GICV5_IRS_IDR2_LPI		BIT(5)
 #define GICV5_IRS_IDR2_ID_BITS		GENMASK(4, 0)
 
+#define GICV5_IRS_IDR3_VMT_LEVELS	BIT(10)
+#define GICV5_IRS_IDR3_VM_ID_BITS	GENMASK(9, 5)
+#define GICV5_IRS_IDR3_VMD_SZ		GENMASK(4, 1)
+#define GICV5_IRS_IDR3_VMD		BIT(0)
+
+#define GICV5_IRS_IDR4_VPE_ID_BITS	GENMASK(9, 6)
+#define GICV5_IRS_IDR4_VPED_SZ		GENMASK(5, 0)
+
 #define GICV5_IRS_IDR5_SPI_RANGE	GENMASK(24, 0)
 #define GICV5_IRS_IDR6_SPI_IRS_RANGE	GENMASK(24, 0)
 #define GICV5_IRS_IDR7_SPI_BASE		GENMASK(23, 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (4 preceding siblings ...)
  2026-04-27 16:07 ` [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame Sascha Bischoff
@ 2026-04-27 16:08 ` Sascha Bischoff
  2026-04-28 16:40   ` Marc Zyngier
  2026-04-27 16:08 ` [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables Sascha Bischoff
                   ` (36 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:08 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

GICv5 supports two types of doorbell - VPE doorbells and VM
doorbells. In KVM we only support Targeted interrupts, and do not
support 1ofN target selection. This means that we only implement VPE
doorbells. These doorbells are implemented as host LPIs which are
generated when a non-resident VPE has a pending interrupt of
sufficient priority and the doorbell has been requested as part of
making the VPE non-resident.

VPE doorbells allow KVM to wake VPEs (so, vcpus) as soon as the
hardware determines that sufficient conditions for the interrupt to be
signalled have been met. This simplifies the wake-up path for vcpus
with GICv5 for LPIs and SPIs. NOTE: PPI pending state must still be
checked explicitly as the IRS never sees them.

This change introduces support for the vgic_v5 doorbell domain. One
doorbell domain is created per GICv5 VM, and all VPEs have their own
doorbell within this domain. When the doorbell fires, this is tracked
(in gicv5_vpe.db_fired) and the corresponding vcpu is kicked.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-init.c    |   5 +-
 arch/arm64/kvm/vgic/vgic-v5.c      | 143 +++++++++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic.h         |   1 +
 include/kvm/arm_vgic.h             |   6 ++
 include/linux/irqchip/arm-gic-v5.h |   2 +
 5 files changed, 156 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 907057881b26a..984908a271c8d 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -500,8 +500,11 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm)
 		dist->vgic_cpu_base = VGIC_ADDR_UNDEF;
 	}
 
-	if (vgic_supports_direct_irqs(kvm))
+	if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 &&
+	    vgic_supports_direct_irqs(kvm))
 		vgic_v4_teardown(kvm);
+	else if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5)
+		vgic_v5_teardown(kvm);
 
 	xa_destroy(&dist->lpi_xa);
 }
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index fd3d6299a2baa..4e0d52b309628 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -7,6 +7,7 @@
 
 #include <linux/bitops.h>
 #include <linux/irqchip/arm-vgic-info.h>
+#include <linux/irqdomain.h>
 
 #include "vgic.h"
 #include "vgic-v5-tables.h"
@@ -162,6 +163,138 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
 	return 0;
 }
 
+/*
+ * This set of irq_chip functions is specific for doorbells.
+ */
+static struct irq_chip vgic_v5_db_irq_chip = {
+	.name = "GICv5-DB",
+	.irq_mask = irq_chip_mask_parent,
+	.irq_unmask = irq_chip_unmask_parent,
+	.irq_eoi = irq_chip_eoi_parent,
+	.irq_set_affinity = irq_chip_set_affinity_parent,
+	.irq_get_irqchip_state = irq_chip_get_parent_state,
+	.irq_set_irqchip_state = irq_chip_set_parent_state,
+	.flags = IRQCHIP_SET_TYPE_MASKED | IRQCHIP_SKIP_SET_WAKE |
+		 IRQCHIP_MASK_ON_SUSPEND,
+};
+
+static int vgic_v5_irq_db_domain_map(struct irq_domain *d, unsigned int virq,
+				     u16 vpe_id)
+{
+	int ret;
+	u32 lpi;
+	irq_hw_number_t hwirq;
+	struct irq_chip *chip = &vgic_v5_db_irq_chip;
+	struct irq_data *irqd = irq_desc_get_irq_data(irq_to_desc(virq));
+
+	/*
+	 * For the DB domain, we don't use the same hwirq as for LPIs.
+	 */
+	hwirq = vpe_id;
+
+	ret = gicv5_alloc_lpi();
+	if (ret < 0)
+		return ret;
+	lpi = ret;
+
+	ret = irq_domain_alloc_irqs_parent(d, virq, 1, &lpi);
+	if (ret) {
+		gicv5_free_lpi(lpi);
+		return ret;
+	}
+
+	irq_domain_set_hwirq_and_chip(d, virq, hwirq, chip, d->host_data);
+	irqd_set_single_target(irqd);
+
+	return 0;
+}
+
+
+static void vgic_v5_irq_db_domain_free(struct irq_domain *domain,
+				       unsigned int virq, unsigned int nr_irqs)
+{
+	int i;
+
+	for (i = 0; i < nr_irqs; i++) {
+		struct irq_data *d = irq_domain_get_irq_data(domain, virq + i);
+
+		gicv5_free_lpi(d->parent_data->hwirq);
+		irq_set_handler(virq + i, NULL);
+		irq_domain_reset_irq_data(d);
+	}
+
+	irq_domain_free_irqs_parent(domain, virq, nr_irqs);
+}
+
+static int vgic_v5_irq_db_domain_alloc(struct irq_domain *domain,
+				       unsigned int virq, unsigned int nr_irqs,
+				       void *arg)
+{
+	struct vgic_v5_vm *vm = arg;
+	int ret;
+
+	if (vm == NULL) {
+		pr_err("invalid parameter for doorbell irq allocation");
+		return -EINVAL;
+	}
+
+	if (vm->nr_vpes != nr_irqs)
+		return -EINVAL;
+
+	for (int i = 0; i < nr_irqs; i++) {
+		ret = vgic_v5_irq_db_domain_map(domain, virq + i, i);
+		if (ret) {
+			/* Free what we've allocated so far before returning */
+			while (--i >= 0)
+				vgic_v5_irq_db_domain_free(domain, virq + i, 1);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static const struct irq_domain_ops vgic_v5_irq_db_domain_ops = {
+	.alloc = vgic_v5_irq_db_domain_alloc,
+	.free = vgic_v5_irq_db_domain_free,
+};
+
+static int vgic_v5_create_per_vm_domain(struct vgic_v5_vm *vm)
+{
+	if (!gicv5_global_data.lpi_domain) {
+		pr_err("LPI domain uninitialized, can't set up KVM Doorbells");
+		return -ENODEV;
+	}
+
+	vm->fwnode = irq_domain_alloc_named_id_fwnode("GICv5-vpe-db",
+						  task_pid_nr(current));
+
+	/*
+	 * KVM per-VM VPE DB domain; child of LPI domain; only ever handles
+	 * doorbells. We know how many doorbells we have, and therefore we
+	 * create a linear domain.
+	 */
+	vm->domain = irq_domain_create_hierarchy(gicv5_global_data.lpi_domain,
+						 0, vm->nr_vpes, vm->fwnode,
+						 &vgic_v5_irq_db_domain_ops, vm);
+
+	if (WARN_ON(!vm->domain))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void vgic_v5_teardown_per_vm_domain(struct vgic_v5_vm *vm)
+{
+	if (!vm->domain)
+		return;
+
+	irq_domain_remove(vm->domain);
+	irq_domain_free_fwnode(vm->fwnode);
+	vm->domain = NULL;
+	vm->fwnode = NULL;
+}
+
 void vgic_v5_reset(struct kvm_vcpu *vcpu)
 {
 	/*
@@ -181,10 +314,15 @@ int vgic_v5_init(struct kvm *kvm)
 {
 	struct kvm_vcpu *vcpu;
 	unsigned long idx;
+	int ret;
 
 	if (vgic_initialized(kvm))
 		return 0;
 
+	ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
+	if (ret)
+		return ret;
+
 	kvm_for_each_vcpu(idx, vcpu, kvm) {
 		if (vcpu_has_nv(vcpu)) {
 			kvm_err("Nested GICv5 VMs are currently unsupported\n");
@@ -203,6 +341,11 @@ int vgic_v5_init(struct kvm *kvm)
 	return 0;
 }
 
+void vgic_v5_teardown(struct kvm *kvm)
+{
+	vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
+}
+
 int vgic_v5_map_resources(struct kvm *kvm)
 {
 	if (!vgic_initialized(kvm))
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index f45f7e3ec4d6e..f2f5fdc3211d7 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -366,6 +366,7 @@ void vgic_debug_destroy(struct kvm *kvm);
 int vgic_v5_probe(const struct gic_kvm_info *info);
 void vgic_v5_reset(struct kvm_vcpu *vcpu);
 int vgic_v5_init(struct kvm *kvm);
+void vgic_v5_teardown(struct kvm *kvm);
 int vgic_v5_map_resources(struct kvm *kvm);
 void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid);
 bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu);
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index d14cf4771d606..05dbd01f6fd21 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -366,6 +366,12 @@ struct vgic_v5_vm {
 	 * convenient way to do that).
 	 */
 	DECLARE_BITMAP(vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS);
+
+	struct fwnode_handle	*fwnode;
+	struct irq_domain	*domain;
+	int			vpe_db_base;
+	int			nr_vpes;
+	u16			vm_id;
 };
 
 struct vgic_dist {
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index 76dcf414ffb20..087d94f739672 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -401,6 +401,8 @@ void gicv5_irs_syncr(void);
 
 /* Embedded in kvm.arch */
 struct gicv5_vpe {
+	int			db;
+	bool			db_fired;
 	bool			resident;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (5 preceding siblings ...)
  2026-04-27 16:08 ` [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain Sascha Bischoff
@ 2026-04-27 16:08 ` Sascha Bischoff
  2026-04-28 14:54   ` Vladimir Murzin
                     ` (2 more replies)
  2026-04-27 16:08 ` [PATCH 08/43] KVM: arm64: gic-v5: Introduce guest IST alloc and management Sascha Bischoff
                   ` (35 subsequent siblings)
  42 siblings, 3 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:08 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

GICv5 uses a set of in-memory tables to track and manage VM
state. These must be allocated by the hypervisor, and provided to the
IRS to use.

The VMT (Virtual Machine Table) is a linear or two level table
comprising VMT Entries (VMTE). Each VMTE describes the state for a
single VM. This state includes things such as the SPI and LPI IST
configuration (coming in a future commit), an implementation-defined
VM Descriptor, and a VPE Table (VPET).

The VPET contains one entry per VPE belonging to a VM, and is used to
mark a VPE as valid, as well as providing the address of an
implementation-defined VPE Descriptor, which is used by the hardware
to track and manage VPE state.

This commit adds support for allocating the VMT, and managing the
VMTEs. The VMTEs can be initialised or released for re-use. Allocation
and tracking of unused VMTEs is handled with an IDA.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5-tables.c | 628 +++++++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic-v5-tables.h | 108 +++++
 include/kvm/arm_vgic.h               |   2 +
 include/linux/irqchip/arm-gic-v5.h   |  13 +
 4 files changed, 751 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
index 30e2b108b1aa3..502d05d46cccf 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
@@ -3,6 +3,634 @@
  * Copyright (C) 2025, 2026 Arm Ltd.
  */
 
+#include <kvm/arm_vgic.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/xarray.h>
+#include <asm/kvm_mmu.h>
+
+#include "vgic.h"
 #include "vgic-v5-tables.h"
 
 struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
+
+static struct vgic_v5_vmt *vmt_info;
+DEFINE_XARRAY(vm_info);
+
+static bool vgic_v5_vmt_allocated(void)
+{
+	return vmt_info != NULL;
+}
+
+static int vgic_v5_check_vm_id(u16 vm_id)
+{
+	if (vm_id >= vmt_info->num_entries)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Our IRS might be coherent or non-coherent. If coherent, we can just emit a
+ * DSB to ensure that we're in sync. However, when non-coherent, we need to
+ * manage our cached data explicitly.
+ *
+ * This helper is used to handle both coherent and non-coherent IRSes, and
+ * handles all combinations of cleaning and invalidating to the PoC.
+ */
+static void vgic_v5_clean_inval(void *va, size_t size, bool clean, bool inval)
+{
+	unsigned long base = (unsigned long)va;
+
+	/* Catch any accidental NOPs */
+	BUILD_BUG_ON(!(clean || inval));
+
+	/* Coherent; emit DSB. */
+	if (!gicv5_host_ist_caps.irs_non_coherent) {
+		dsb(ishst);
+		return;
+	}
+
+	if (clean && inval)
+		dcache_clean_inval_poc(base, base + size);
+	else if (clean)
+		dcache_clean_poc(base, base + size);
+	else if (inval)
+		dcache_inval_poc(base, base + size);
+}
+
+/*
+ * Create a linear VM table, rounding up the number of entries to at least one
+ * whole page to give us nicer alignment.
+ *
+ * Note: We don't update the number of entries tracked in our tracking structure
+ * as this might be higher than the number of bits supported by the HW.
+ */
+static int vgic_v5_alloc_vmt_linear(unsigned int num_entries)
+{
+	unsigned int l2_entries_per_page;
+	size_t alloc_size;
+
+	/* Potentially throw away a bit of memory for the sake of alignment! */
+	l2_entries_per_page = PAGE_SIZE / GICV5_VMTEL2E_SIZE;
+	if (num_entries < l2_entries_per_page)
+		num_entries = l2_entries_per_page;
+
+	alloc_size = num_entries * sizeof(struct vmtl2_entry);
+
+	vmt_info->linear.vmt_base = kzalloc(alloc_size, GFP_KERNEL);
+	if (vmt_info->linear.vmt_base == NULL)
+		return -ENOMEM;
+
+	vgic_v5_clean_inval(vmt_info->linear.vmt_base, alloc_size, true, true);
+
+	return 0;
+}
+
+/*
+ * Allocate the first level of a two-level VM table. The second-level VM tables
+ * are allocated on demand (by vgic_v5_alloc_l2_vmt()).
+ *
+ * Note: If there are too few entries, these are rounded up to the size of an L2
+ * table (4k) to ensure sane alignment. As with the linear table, the tracked
+ * number of entries is not increased to avoid the case of going above what the
+ * hardware supports.
+ */
+static int vgic_v5_alloc_vmt_two_level(unsigned int num_entries)
+{
+	size_t alloc_size;
+
+	/* Potentially throw away a bit of memory for the sake of alignment! */
+	if (num_entries < GICV5_VMT_L2_TABLE_ENTRIES)
+		num_entries = GICV5_VMT_L2_TABLE_ENTRIES;
+
+	/*
+	 * Let's make sure that we always allocate a whole power of 2
+	 * of entries. Note that we need to subtract 1 from the fls()
+	 * result in order to give the correct number of bits as we
+	 * are operating on a whole power of 2.
+	 */
+	num_entries = roundup_pow_of_two(num_entries);
+
+	vmt_info->l2.num_l1_ents = (num_entries / GICV5_VMT_L2_TABLE_ENTRIES);
+	alloc_size = vmt_info->l2.num_l1_ents * sizeof(vmtl1_entry);
+
+	vmt_info->l2.vmt_base = kzalloc(alloc_size, GFP_KERNEL);
+	if (vmt_info->l2.vmt_base == NULL)
+		return -ENOMEM;
+
+	vgic_v5_clean_inval(vmt_info->l2.vmt_base, alloc_size, true, true);
+
+	vmt_info->l2.l2ptrs = kzalloc_objs(*vmt_info->l2.l2ptrs,
+					   vmt_info->l2.num_l1_ents,
+					   GFP_KERNEL);
+	if (vmt_info->l2.l2ptrs == NULL) {
+		kfree(vmt_info->l2.vmt_base);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+
+/*
+ * Allocate a second level VMT, if required. This can be called eagerly, and
+ * will only perform the allocation if required.
+ */
+static int vgic_v5_alloc_l2_vmt(struct kvm *kvm)
+{
+	unsigned int l1_index;
+	struct vmtl2_entry *l2_table;
+	vmtl1_entry tmp;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
+	struct gicv5_cmd_info cmd_info;
+	int ret;
+
+	if (!vgic_v5_vmt_allocated())
+		return -ENXIO;
+
+	/* Nothing to do if we have linear tables! */
+	if (!vmt_info->two_level)
+		return 0;
+
+	ret = vgic_v5_check_vm_id(vm_id);
+	if (ret)
+		return ret;
+
+	/*
+	 * We have 4k-sized L2 tables - this is mandated by the spec for
+	 * two-level VMTs. This means that we have 128 entries per L1 VMTE.
+	 */
+	l1_index = vm_id / GICV5_VMT_L2_TABLE_ENTRIES;
+
+	if (l1_index > vmt_info->l2.num_l1_ents)
+		return -E2BIG;
+
+	/* Already valid? Great! */
+	if (vmt_info->l2.l2ptrs[l1_index] != NULL)
+		return 0;
+
+	l2_table = kzalloc(GICV5_VMT_L2_TABLE_SIZE, GFP_KERNEL);
+	if (l2_table == NULL)
+		return -ENOMEM;
+
+	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
+		kfree(l2_table);
+		return -EINVAL;
+	}
+
+	vmt_info->l2.l2ptrs[l1_index] = l2_table;
+
+	/* Alignment issue! */
+	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
+		kfree(l2_table);
+		return -EFAULT;
+	}
+
+	tmp = virt_to_phys(l2_table) & GICV5_VMTEL1E_L2_ADDR;
+	WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], cpu_to_le64(tmp));
+
+	vgic_v5_clean_inval(l2_table, GICV5_VMT_L2_TABLE_SIZE, true, true);
+	/* Skip inval for now - wait until table is made valid by HW */
+	vgic_v5_clean_inval(vmt_info->l2.vmt_base + l1_index,
+			    sizeof(vmtl1_entry), true, false);
+
+	/* VMAP in the L2 VMT via the IRS */
+	cmd_info.cmd_type = VMT_L2_MAP;
+	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
+
+	/* We've failed to make the L2 VMT valid - things are very broken! */
+	if (ret) {
+		/* Remove the pointer from L1 table */
+		WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], 0);
+
+		kfree(l2_table);
+		vmt_info->l2.l2ptrs[l1_index] = NULL;
+
+		return ret;
+	}
+
+	/* Table updated; inval our copy */
+	vgic_v5_clean_inval(vmt_info->l2.vmt_base + l1_index,
+			    sizeof(vmtl1_entry), false, true);
+
+	return ret;
+}
+
+/*
+ * Allocate the top-level VMT. This can either be linear or two-level.
+ */
+int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
+			 size_t vmd_size, size_t vped_size,
+			 unsigned int max_vpes)
+{
+	int ret = 0;
+
+	if (vgic_v5_vmt_allocated())
+		return -EBUSY;
+
+	/* VMD is optional; using 0 to signal that it not needed. */
+	if (vmd_size != 0 &&
+	    (vmd_size < VMD_MIN_SIZE || vmd_size > VMD_MAX_SIZE))
+		return -EINVAL;
+
+	if (vped_size < VPED_MIN_SIZE || vped_size > VPED_MAX_SIZE)
+		return -EINVAL;
+
+	/* Allocate the tracking structure */
+	vmt_info = kzalloc_obj(*vmt_info, GFP_KERNEL);
+	if (vmt_info == NULL)
+		return -ENOMEM;
+
+	ida_init(&vmt_info->vm_id_ida);
+	vmt_info->max_vpes = max_vpes;
+	vmt_info->vmd_size = vmd_size;
+	vmt_info->vped_size = vped_size;
+	vmt_info->two_level = two_level;
+	vmt_info->num_entries = num_entries;
+
+	if (!two_level)
+		ret = vgic_v5_alloc_vmt_linear(num_entries);
+	else
+		ret = vgic_v5_alloc_vmt_two_level(num_entries);
+
+	/* If anything failed, free our tracking structure before returning */
+	if (ret) {
+		kfree(vmt_info);
+		vmt_info = NULL;
+	}
+
+	return ret;
+}
+
+/*
+ * Free the VMT and associated tracking structures. This isn't strictly expected
+ * to be called in general operation, but instead exists for completeness.
+ */
+int vgic_v5_vmt_free(void)
+{
+	if (!vgic_v5_vmt_allocated())
+		return -EINVAL;
+
+	if (!vmt_info->two_level) {
+		kfree(vmt_info->linear.vmt_base);
+	} else {
+		/* Free the L2 tables; kfree(NULL) is safe */
+		for (int i = 0; i < vmt_info->l2.num_l1_ents; ++i)
+			kfree(vmt_info->l2.l2ptrs[i]);
+		kfree(vmt_info->l2.l2ptrs);
+
+		/* And now free the L1 table */
+		kfree(vmt_info->l2.vmt_base);
+	}
+
+	ida_destroy(&vmt_info->vm_id_ida);
+	kfree(vmt_info);
+	vmt_info = NULL;
+
+	return 0;
+}
+
+/*
+ * Look up a VMT Entry by VM ID.
+ */
+static int vgic_v5_get_l2_vmte(u16 vm_id, struct vmtl2_entry **vmte)
+{
+	unsigned int l1_index, l2_index;
+	struct vmtl2_entry *l2_table;
+	int ret;
+
+	ret = vgic_v5_check_vm_id(vm_id);
+	if (ret)
+		return ret;
+
+	if (!vmt_info->two_level) {
+		/* All entries always valid for Linear table */
+		*vmte = &vmt_info->linear.vmt_base[vm_id];
+	} else {
+		l1_index = vm_id / GICV5_VMT_L2_TABLE_ENTRIES;
+		l2_index = vm_id % GICV5_VMT_L2_TABLE_ENTRIES;
+
+		if (l1_index > vmt_info->l2.num_l1_ents)
+			return -E2BIG;
+
+		if (vmt_info->l2.l2ptrs[l1_index] == NULL)
+			return -EINVAL;
+
+		l2_table = vmt_info->l2.l2ptrs[l1_index];
+		*vmte = &l2_table[l2_index];
+	}
+
+	return 0;
+}
+
+/*
+ * Zero a VMT Entry, and flush & invalidate to the PoC, if required.
+ */
+static int vgic_v5_reset_vmte(struct kvm *kvm)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vmtl2_entry *vmte;
+	int ret;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	WRITE_ONCE(vmte->val[0], 0ULL);
+	WRITE_ONCE(vmte->val[1], 0ULL);
+	WRITE_ONCE(vmte->val[2], 0ULL);
+	WRITE_ONCE(vmte->val[3], 0ULL);
+
+	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+
+	return 0;
+}
+
+/*
+ * Use the IDA to allocate a new VM ID, and track it in the gicv5_vm data
+ * structure. If we're out of VM IDs, the IDA catches that, and we return the
+ * error (-ENOSPC).
+ */
+int vgic_v5_allocate_vm_id(struct kvm *kvm)
+{
+	int id;
+
+	id = ida_alloc_max(&vmt_info->vm_id_ida, vmt_info->num_entries - 1u,
+			   GFP_KERNEL);
+	if (id < 0)
+		return id;
+
+	kvm->arch.vgic.gicv5_vm.vm_id = id;
+	kvm->arch.vgic.gicv5_vm.vm_id_valid = true;
+
+	return 0;
+}
+
+/*
+ * Release the VM ID to allow it to be reallocated in the future.
+ */
+void vgic_v5_release_vm_id(struct kvm *kvm)
+{
+	ida_free(&vmt_info->vm_id_ida, kvm->arch.vgic.gicv5_vm.vm_id);
+	kvm->arch.vgic.gicv5_vm.vm_id_valid = false;
+}
+
+/*
+ * Initialise an entry in the VMT based on the index of the VM.
+ *
+ * Note: We don't mark the VMTE as valid as this needs to be done by
+ * the hardware.
+ */
+int vgic_v5_vmte_init(struct kvm *kvm)
+{
+	int nr_cpus = atomic_read(&kvm->online_vcpus);
+	struct vgic_v5_vm_info *vmi = NULL;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	void *vmd = NULL, *vpet = NULL;
+	struct vmtl2_entry *vmte;
+	void **vped_ptrs = NULL;
+	size_t vpet_alloc_size;
+	int ret;
+	u64 tmp;
+
+	if (nr_cpus > vmt_info->max_vpes)
+		return -E2BIG;
+
+	/*
+	 * If we're using two-level VMTs, L2 is allocated on demand. For linear
+	 * VMTs, this is a NOP.
+	 */
+	if (vgic_v5_alloc_l2_vmt(kvm))
+		return -EIO;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	/* If the entry is already valid, something went wrong */
+	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
+		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+		return -EINVAL;
+	}
+
+	ret = vgic_v5_reset_vmte(kvm);
+	if (ret)
+		return ret;
+
+	vmi = kzalloc_objs(*vmi, GFP_KERNEL);
+	if (vmi == NULL) {
+		ret = -ENOMEM;
+		goto out_fail;
+	}
+
+	ret = xa_insert(&vm_info, vm_id, vmi, GFP_KERNEL);
+	if (ret)
+		goto out_fail;
+
+	/* Allocate and assign the VM Descriptor, if required. */
+	if (vmt_info->vmd_size != 0) {
+		vmd = kzalloc(vmt_info->vmd_size, GFP_KERNEL);
+		if (!vmd) {
+			ret = -ENOMEM;
+			goto out_fail;
+		}
+
+		/* Stash the VA so we can free it later */
+		vmi->vmd_base = vmd;
+
+		tmp = FIELD_PREP(GICV5_VMTEL2E_VMD_ADDR,
+				virt_to_phys(vmd) >>
+				GICV5_VMTEL2E_VMD_ADDR_SHIFT);
+		WRITE_ONCE(vmte->val[0], cpu_to_le64(tmp));
+	}
+
+	/*
+	 * Allocate and assign the VPE Table. We can only describe the number of
+	 * VPE ID Bits in the VMTE, and therefore we round up the number of CPUs
+	 * to a whole power of two.
+	 */
+	nr_cpus = roundup_pow_of_two(nr_cpus);
+	vmi->vpe_id_bits = fls(nr_cpus) - 1;
+
+	vpet_alloc_size = sizeof(vpe_entry) * nr_cpus;
+	vpet = kzalloc(vpet_alloc_size, GFP_KERNEL);
+	if (!vpet) {
+		ret = -ENOMEM;
+		goto out_fail;
+	}
+
+	/* Stash the VA so we can free it later */
+	vmi->vpet_base = vpet;
+
+	tmp = FIELD_PREP(GICV5_VMTEL2E_VPET_ADDR,
+			virt_to_phys(vpet) >> GICV5_VMTEL2E_VPET_ADDR_SHIFT);
+	tmp |= FIELD_PREP(GICV5_VMTEL2E_VPE_ID_BITS, vmi->vpe_id_bits);
+	WRITE_ONCE(vmte->val[1], cpu_to_le64(tmp));
+
+	vped_ptrs = kzalloc_objs(*vped_ptrs, nr_cpus, GFP_KERNEL);
+	if (vped_ptrs == NULL) {
+		ret = -ENOMEM;
+		goto out_fail;
+	}
+	vmi->vped_ptrs = vped_ptrs;
+
+	if (vmd)
+		vgic_v5_clean_inval(vmd, vmt_info->vmd_size, true, true);
+	vgic_v5_clean_inval(vpet, vpet_alloc_size, true, true);
+	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+
+	kvm->arch.vgic.gicv5_vm.vmte_allocated = true;
+
+	return 0;
+
+out_fail:
+	/* kfree(NULL) is safe so we can just kfree() at leisure */
+	kfree(vmd);
+	kfree(vpet);
+	kfree(vped_ptrs);
+	if (vmi)
+		xa_erase(&vm_info, vm_id);
+	kfree(vmi);
+
+	vgic_v5_reset_vmte(kvm);
+
+	return ret;
+}
+
+/*
+ * Release the VMT Entry, freeing up any allocated data structures before
+ * zeroing the VMTE.
+ *
+ * The VMTE must be marked as invalid before it is released.
+ */
+int vgic_v5_vmte_release(struct kvm *kvm)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	struct vmtl2_entry *vmte;
+	int ret;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	/* Reject if the VMTE has not been marked as invalid! */
+	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
+		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+		return -EINVAL;
+	}
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		goto no_vmi;
+
+	kfree(vmi->vmd_base);
+	kfree(vmi->vpet_base);
+
+	xa_erase(&vm_info, vm_id);
+	kfree(vmi);
+
+no_vmi:
+	/*
+	 * If we didn't get far enough into allocating a VMTE to create the VM
+	 * info structure, then we just zero the VMTE and move on. There's
+	 * nothing else we can realistically do here.
+	 */
+	ret = vgic_v5_reset_vmte(kvm);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+/*
+ * Allocate a VPE descriptor and provide it to the hardware via the VPE Table.
+ */
+int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu)
+{
+	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
+	u16 vpe_id = vgic_v5_vpe_id(vcpu);
+	struct vgic_v5_vm_info *vmi;
+	vpe_entry tmp, *vpet_base;
+	void *vped;
+
+	/* Make sure we're not over what the hardware supports */
+	if (vpe_id >= vmt_info->max_vpes)
+		return -E2BIG;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -EINVAL;
+
+	if (vpe_id >= 1 << vmi->vpe_id_bits)
+		return -E2BIG;
+
+	vpet_base = vmi->vpet_base;
+
+	/* If the VPETE for this CPU is already valid we've gone wrong */
+	if (FIELD_GET(GICV5_VPE_VALID, le64_to_cpu(READ_ONCE(vpet_base[vpe_id])))) {
+		vgic_v5_clean_inval(&vpet_base[vpe_id], sizeof(*vpet_base), true, true);
+		return -EBUSY;
+	}
+
+	/* Alloc VPE Descriptor. Only used by IRS. */
+	vped = kzalloc(vmt_info->vped_size, GFP_KERNEL);
+	if (vped == NULL)
+		return -ENOMEM;
+
+	vmi->vped_ptrs[vpe_id] = vped;
+
+	tmp = FIELD_PREP(GICV5_VPED_ADDR, virt_to_phys(vped) >> GICV5_VPED_ADDR_SHIFT);
+	WRITE_ONCE(vpet_base[vpe_id], cpu_to_le64(tmp));
+
+	vgic_v5_clean_inval(vped, vmt_info->vped_size, true, true);
+	vgic_v5_clean_inval(vpet_base + vpe_id, sizeof(vpe_entry), true, true);
+
+	return 0;
+}
+
+/*
+ * Free the memory allocated for the VPE descriptor.
+ */
+int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
+{
+	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
+	u16 vpe_id = vgic_v5_vpe_id(vcpu);
+	struct vgic_v5_vm_info *vmi;
+	struct vmtl2_entry *vmte;
+	vpe_entry *vpet_base;
+	void *vped;
+	int ret;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
+		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+		return -EBUSY;
+	}
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (!vmi)
+		return -EINVAL;
+
+	if (vpe_id >= 1 << vmi->vpe_id_bits)
+		return -E2BIG;
+
+	vpet_base = vmi->vpet_base;
+	WRITE_ONCE(vpet_base[vpe_id], 0ULL);
+
+	vgic_v5_clean_inval(vpet_base + vpe_id, sizeof(vpe_entry), true, true);
+
+	/* Free VPE Descriptor. Only used by IRS. */
+	vped = vmi->vped_ptrs[vpe_id];
+	vmi->vped_ptrs[vpe_id] = NULL;
+	kfree(vped);
+
+	return 0;
+}
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
index cf00a248eabd5..5501a44308362 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
@@ -8,6 +8,86 @@
 
 #include <linux/irqchip/arm-gic-v5.h>
 
+#define VM_ID_BITS_MIN	8
+#define VM_ID_BITS_MAX	16
+#define VMD_MIN_SIZE	8
+#define VMD_MAX_SIZE	4096
+#define VPED_MIN_SIZE	8
+#define VPED_MAX_SIZE	4096
+#define VPE_ID_BITS_MIN	8
+#define VPE_ID_BITS_MAX	16
+
+/* Level 1 Virtual Machine Table Entry */
+typedef __le64 vmtl1_entry;
+#define GICV5_VMTEL1E_VALID		BIT_ULL(0)
+/* Note that there is no shift for the address by design */
+#define GICV5_VMTEL1E_L2_ADDR		GENMASK(51, 12)
+
+#define GICV5_VMTEL2E_SIZE		32ULL
+/* An L2 table (two-level VMT) is ALWAYS 4kB! */
+#define GICV5_VMT_L2_TABLE_SIZE		4096ULL
+#define GICV5_VMT_L2_TABLE_ENTRIES	(GICV5_VMT_L2_TABLE_SIZE / GICV5_VMTEL2E_SIZE)
+
+/* Level 2 Virtual Machine Table Entry */
+struct vmtl2_entry {
+	__le64 val[4];
+};
+
+/*
+ * As the L2 VMTE is a large data structure, we are splitting it into 4 parts.
+ * We only mask and shift WITHIN each part for simplicity.
+ */
+/* First 64-bit chunk */
+#define GICV5_VMTEL2E_VALID		BIT_ULL(0)
+#define GICV5_VMTEL2E_VMD_ADDR_SHIFT	3ULL
+#define GICV5_VMTEL2E_VMD_ADDR		GENMASK_ULL(55, 3)
+/* Second 64-bit chunk */
+#define GICV5_VMTEL2E_VPET_ADDR_SHIFT	3ULL
+#define GICV5_VMTEL2E_VPET_ADDR		GENMASK_ULL(55, 3)
+#define GICV5_VMTEL2E_VPE_ID_BITS	GENMASK_ULL(63, 59)
+/* Third & fourth 64-bit chunks (the encodings are the same for each) */
+#define GICV5_VMTEL2E_IST_VALID		BIT_ULL(0)
+#define GICV5_VMTEL2E_IST_L2SZ		GENMASK_ULL(2, 1)
+#define GICV5_VMTEL2E_IST_ADDR_SHIFT	6ULL
+#define GICV5_VMTEL2E_IST_ADDR		GENMASK_ULL(55, 6)
+#define GICV5_VMTEL2E_IST_ISTSZ		GENMASK_ULL(57, 56)
+#define GICV5_VMTEL2E_IST_STRUCTURE	BIT_ULL(58)
+#define GICV5_VMTEL2E_IST_ID_BITS	GENMASK_ULL(63, 59)
+
+/* Virtual PE Table Entry */
+typedef __le64 vpe_entry;
+#define GICV5_VPE_VALID			BIT_ULL(0)
+/* Note that there is no shift for the address by design. */
+#define GICV5_VPED_ADDR_SHIFT		3ULL
+#define GICV5_VPED_ADDR			GENMASK_ULL(55, 3)
+
+struct vgic_v5_vm_info {
+	void __iomem		*vmd_base;
+	vpe_entry __iomem	*vpet_base;
+	void __iomem		**vped_ptrs;
+	u8			vpe_id_bits;
+};
+
+struct vgic_v5_vmt {
+	union {
+		struct {
+			struct vmtl2_entry *vmt_base;
+			unsigned int num_ents;
+		} linear;
+		struct {
+			vmtl1_entry *vmt_base;
+			struct vmtl2_entry **l2ptrs;
+			unsigned int num_l1_ents;
+		} l2;
+	};
+	bool		two_level;
+	unsigned int	num_entries;
+	unsigned int	max_vpes;
+	size_t		vmd_size;
+	size_t		vped_size;
+	struct ida	vm_id_ida;
+};
+
 struct vgic_v5_host_ist_caps {
 	/* IST Capabilities */
 
@@ -38,4 +118,32 @@ static inline struct vgic_v5_host_ist_caps *vgic_v5_host_caps(void)
 	return &gicv5_host_ist_caps;
 }
 
+static inline u16 vgic_v5_vm_id(struct kvm *kvm)
+{
+	return kvm->arch.vgic.gicv5_vm.vm_id;
+}
+
+static inline u16 vgic_v5_vpe_id(struct kvm_vcpu *vcpu)
+{
+	return vcpu->vcpu_id;
+}
+
+static inline int vgic_v5_vpe_db(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db;
+}
+
+int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
+			 size_t vmd_size, size_t vped_size,
+			 unsigned int vpe_id_bits);
+int vgic_v5_vmt_free(void);
+
+int vgic_v5_allocate_vm_id(struct kvm *kvm);
+void vgic_v5_release_vm_id(struct kvm *kvm);
+
+int vgic_v5_vmte_init(struct kvm *kvm);
+int vgic_v5_vmte_release(struct kvm *kvm);
+int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu);
+int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 05dbd01f6fd21..0bcbc751593cc 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -372,6 +372,8 @@ struct vgic_v5_vm {
 	int			vpe_db_base;
 	int			nr_vpes;
 	u16			vm_id;
+	bool			vm_id_valid;
+	bool			vmte_allocated;
 };
 
 struct vgic_dist {
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index 087d94f739672..89579ee04f5d1 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -182,6 +182,7 @@
 #define GICV5_IRS_MAP_L2_ISTR_ID	GENMASK(23, 0)
 
 #define GICV5_ISTL1E_VALID		BIT_ULL(0)
+#define GICV5_IRS_ISTL1E_SIZE		8UL
 
 #define GICV5_ISTL1E_L2_ADDR_MASK	GENMASK_ULL(55, 12)
 
@@ -444,4 +445,16 @@ void gicv5_free_lpi(u32 lpi);
 
 void __init gicv5_its_of_probe(struct device_node *parent);
 void __init gicv5_its_acpi_probe(void);
+
+enum gicv5_vcpu_info_cmd_type {
+	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
+	VMTE_MAKE_VALID,	/* Make the VMTE valid */
+	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
+};
+
+struct gicv5_cmd_info {
+	enum gicv5_vcpu_info_cmd_type	cmd_type;
+	u64				data;
+};
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 08/43] KVM: arm64: gic-v5: Introduce guest IST alloc and management
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (6 preceding siblings ...)
  2026-04-27 16:08 ` [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables Sascha Bischoff
@ 2026-04-27 16:08 ` Sascha Bischoff
  2026-04-29 14:29   ` Marc Zyngier
  2026-04-27 16:09 ` [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops Sascha Bischoff
                   ` (34 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:08 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

GICv5 guests use Interrupt State Tables (ISTs) to track and manage the
interrupt state for SPIs and LPIs. These ISTs are provided to the
host's IRS via the VMTE.

On a host GICv5 system, SPIs do not require any up-front memory
allocation prior to their use, unlike LPIs which require the OS to
allocate an IST. For a GICv5 guest, the same holds from the guest's
point of view - the SPIs should require no explicit memory allocation
by the guest. This means that the hypervisor must provision the memory
which it passed to the IRS for managing a guest's SPI state.

In light of the above, the hypervisor allocates the SPI IST prior to
running the guest for the first time. As only a small number of SPIs
are expected, this is always allocated as a linear IST. The host is
responsible for freeing this memory on guest teardown.

For LPIs, the OS needs to provision memory for state tracking. This
applies to both hosts and guests, and so the guest will provision some
memory for the LPI IST. However, this is not directly used by
KVM. Instead, KVM allocates a shadow LPI IST which is passed to the
IRS (in the VMTE). Again, on guest teardown, the hypervisor must free
this memory again. The LPI IST is allocated as a two level structure,
as many more LPIs are expected than SPIs.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5-tables.c | 531 +++++++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic-v5-tables.h |  22 ++
 include/linux/irqchip/arm-gic-v5.h   |   3 +
 3 files changed, 556 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
index 502d05d46cccf..de905f37b61a5 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
@@ -501,6 +501,25 @@ int vgic_v5_vmte_init(struct kvm *kvm)
 	return ret;
 }
 
+/*
+ * The following set of forward declarations makes the code layout a *little*
+ * clearer as it lets us keep the IST-related code together.
+ */
+static int vgic_v5_alloc_linear_ist(struct kvm *kvm, bool spi_ist,
+				    unsigned int id_bits,
+				    unsigned int istsz);
+static int vgic_v5_alloc_l1_ist(struct kvm *kvm, unsigned int id_bits,
+				unsigned int istsz, unsigned int l2_split);
+static int vgic_v5_alloc_l2_ists(struct kvm *kvm, unsigned int id_bits,
+				 unsigned int istsz, unsigned int l2_split);
+static int vgic_v5_alloc_two_level_lpi_ist(struct kvm *kvm,
+					   unsigned int id_bits,
+					   unsigned int istsz,
+					   unsigned int l2_split);
+static int vgic_v5_linear_ist_free(struct kvm *kvm, bool spi);
+static int vgic_v5_two_level_ist_free(struct kvm *kvm, bool spi);
+static int vgic_v5_spi_ist_free(struct kvm *kvm);
+
 /*
  * Release the VMT Entry, freeing up any allocated data structures before
  * zeroing the VMTE.
@@ -531,6 +550,18 @@ int vgic_v5_vmte_release(struct kvm *kvm)
 	kfree(vmi->vmd_base);
 	kfree(vmi->vpet_base);
 
+	/* If we have an LPI IST, free it */
+	if (vmi->h_lpi_ist)
+		ret = vgic_v5_lpi_ist_free(kvm);
+	if (ret)
+		return ret;
+
+	/* If we have an SPI IST, free it */
+	if (vmi->h_spi_ist)
+		ret = vgic_v5_spi_ist_free(kvm);
+	if (ret)
+		return ret;
+
 	xa_erase(&vm_info, vm_id);
 	kfree(vmi);
 
@@ -634,3 +665,503 @@ int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
 
 	return 0;
 }
+
+/*
+ * Assign an already allocated IST to the VM by populating the fields in the
+ * corresponding VMTE. We re-use this code for both an SPI IST and LPI IST, even
+ * if the paths to reach it might be vastly different.
+ */
+int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
+			    bool two_level, unsigned int id_bits,
+			    unsigned int l2sz, unsigned int istsz,
+			    bool spi_ist)
+{
+	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct gicv5_cmd_info cmd_info;
+	struct vmtl2_entry *vmte;
+	unsigned int section;
+	u64 tmp;
+	int ret;
+
+	section = spi_ist ? GICV5_VMTEL2_SPI_SECTION : GICV5_VMTEL2_LPI_SECTION;
+
+	if (ist_base & ~GICV5_VMTEL2E_IST_ADDR) {
+		kvm_err("IST alignment issue! Address: 0x%llx, Mask 0x%llx\n",
+			ist_base, GICV5_VMTEL2E_IST_ADDR);
+		return -EINVAL;
+	}
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	/* Bail if already allocated - something is broken! */
+	if (FIELD_GET(GICV5_VMTEL2E_IST_VALID, vmte->val[section])) {
+		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+		return -EINVAL;
+	}
+
+	tmp = FIELD_PREP(GICV5_VMTEL2E_IST_L2SZ, l2sz);
+	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_ADDR,
+			ist_base >> GICV5_VMTEL2E_IST_ADDR_SHIFT);
+	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_ISTSZ, istsz);
+	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_ID_BITS, id_bits);
+	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_STRUCTURE, two_level);
+
+	WRITE_ONCE(vmte->val[section], cpu_to_le64(tmp));
+	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, false);
+
+	/* Finally, mark the entry as valid */
+	cmd_info.cmd_type = spi_ist ? SPI_VIST_MAKE_VALID : LPI_VIST_MAKE_VALID;
+	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
+
+	/* Any cached entries we now have are stale! */
+	vgic_v5_clean_inval(vmte, sizeof(*vmte), false, true);
+
+	return ret;
+}
+
+/*
+ * Helper to determine the correct l2sz to use based on the combination of
+ * PAGE_SIZE and whatever hardware supports.
+ */
+static unsigned int vgic_v5_ist_l2sz(void)
+{
+	switch (PAGE_SIZE) {
+	case SZ_64K:
+		if (gicv5_host_ist_caps.ist_l2sz & 0x4)
+			return GICV5_IRS_IST_CFGR_L2SZ_64K;
+		fallthrough;
+	case SZ_4K:
+		if (gicv5_host_ist_caps.ist_l2sz & 0x1)
+			return GICV5_IRS_IST_CFGR_L2SZ_4K;
+		fallthrough;
+	case SZ_16K:
+		if (gicv5_host_ist_caps.ist_l2sz & 0x2)
+			return GICV5_IRS_IST_CFGR_L2SZ_16K;
+		break;
+	}
+
+	if (gicv5_host_ist_caps.ist_l2sz & 0x1)
+		return GICV5_IRS_IST_CFGR_L2SZ_4K;
+
+	return GICV5_IRS_IST_CFGR_L2SZ_64K;
+}
+
+/* Helper to determine ISTE size based on metadata requirements */
+static unsigned int vgic_v5_ist_istsz(unsigned int id_bits)
+{
+	if (!gicv5_host_ist_caps.istmd)
+		return GICV5_IRS_IST_CFGR_ISTSZ_4;
+
+	if (id_bits >= gicv5_host_ist_caps.istmd_sz)
+		return GICV5_IRS_IST_CFGR_ISTSZ_16;
+
+	return GICV5_IRS_IST_CFGR_ISTSZ_8;
+}
+
+/*
+ * Allocate a Linear IST - always used for SPIs and potentially LPIs.
+ *
+ * The calculation for n has been taken from the GICv5 spec.
+ *
+ * NOTE: istsz is the FIELD used by GICv5, not the actual size (or log2() of the
+ * size).
+ */
+static int vgic_v5_alloc_linear_ist(struct kvm *kvm, bool spi_ist,
+				    unsigned int id_bits, unsigned int istsz)
+{
+	const size_t n = id_bits + 1 + istsz;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	__le64 *ist;
+	u32 l1sz;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -EINVAL;
+
+	/*
+	 * Allocate the IST. We only have one level, so we just use the L2 ISTE.
+	 */
+	l1sz = BIT(n + 1);
+	ist = kzalloc(l1sz, GFP_KERNEL);
+	if (!ist)
+		return -ENOMEM;
+
+	if (spi_ist) {
+		vmi->h_spi_ist = ist;
+	} else {
+		vmi->h_lpi_ist_structure = false;
+		vmi->h_lpi_ist = ist;
+	}
+
+	vgic_v5_clean_inval(ist, l1sz, true, true);
+
+	return 0;
+}
+
+/*
+ * Allocate the first level of a two-level IST - LPI, only.
+ *
+ * The calculations for n, l1_size have been taken from the GICv5 spec.
+ *
+ * NOTE: istsz and l2sz are the FIELDS used by GICv5, not the actual sizes (or
+ * log2() of the sizes).
+ */
+static int vgic_v5_alloc_l1_ist(struct kvm *kvm, unsigned int id_bits,
+				unsigned int istsz, unsigned int l2sz)
+{
+	const size_t n =  max(5, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	const u32 l1_size = BIT(n + 1);
+	struct vgic_v5_vm_info *vmi;
+	__le64 *ist;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (!vmi)
+		return -EINVAL;
+
+	ist = kzalloc(l1_size, GFP_KERNEL);
+	if (!ist)
+		return -ENOMEM;
+
+	vmi->h_lpi_ist_structure = true;
+	vmi->h_lpi_ist = ist;
+
+	vgic_v5_clean_inval(ist, l1_size, true, true);
+
+	return 0;
+}
+
+/*
+ * Allocate ALL of the second level ISTs for a two-level IST - LPI, only.
+ *
+ * The calculations for n, l1_entries, l2_size have been taken from the GICv5
+ * spec.
+ *
+ * NOTE: istsz and l2sz are the FIELDS used by GICv5, not the actual sizes (or
+ * log2() of the sizes).
+ */
+static int vgic_v5_alloc_l2_ists(struct kvm *kvm, unsigned int id_bits,
+				unsigned int istsz, unsigned int l2sz)
+{
+	const size_t n =  max(5, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
+	const int l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
+	const size_t l2_size = BIT(11 + (2 * l2sz) + 1);
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	__le64 *l2ist;
+	__le64 *l1ist;
+	int index;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -EINVAL;
+
+	l1ist = vmi->h_lpi_ist;
+
+	/*
+	 * Allocate the storage for the pointers to the L2 ISTs (used when
+	 * freeing later).
+	 */
+	vmi->h_lpi_l2_ists = kzalloc_objs(*vmi->h_lpi_l2_ists, l1_entries,
+					  GFP_KERNEL);
+	if (!vmi->h_lpi_l2_ists)
+		return -ENOMEM;
+
+	/* Allocate the L2 IST for each L1 IST entry */
+	for (index = 0; index < l1_entries; ++index) {
+		l2ist = kzalloc(l2_size, GFP_KERNEL);
+		if (!l2ist) {
+			while (--index >= 0)
+				kfree(vmi->h_lpi_l2_ists[index]);
+
+			kfree(vmi->h_lpi_l2_ists);
+			vmi->h_lpi_l2_ists = NULL;
+
+			return -ENOMEM;
+		}
+
+		/*
+		 * We are not doing on-demand allocation of the L2 ISTs, and are
+		 * instead provisioning the whole IST up front. This means that
+		 * we are able to mark the L2 ISTs as valid in the L1 ISTEs as
+		 * the overall IST is not yet valid.
+		 */
+		l1ist[index] = cpu_to_le64(
+			virt_to_phys(l2ist) & GICV5_ISTL1E_L2_ADDR_MASK) |
+			GICV5_ISTL1E_VALID;
+
+		vmi->h_lpi_l2_ists[index] = l2ist;
+
+		vgic_v5_clean_inval(l2ist, l2_size, true, true);
+	}
+
+	/* Handle CMOs for the whole L1 IST in one go */
+	vgic_v5_clean_inval(l1ist, l1_entries * sizeof(*l1ist), true, false);
+
+	return 0;
+}
+
+/* Allocate a two-level IST - LPIs, only */
+static int vgic_v5_alloc_two_level_lpi_ist(struct kvm *kvm, unsigned int id_bits,
+					   unsigned int istsz, unsigned int l2sz)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	int ret;
+
+	/*
+	 * Allocate the L1 IST first, then all of the L2s. Everything
+	 * is preallocated and we do no on-demand IST allocation. This
+	 * is to avoid needing to track if and when the guest is doing
+	 * on-demand IST allocation.
+	 */
+	ret = vgic_v5_alloc_l1_ist(kvm, id_bits, istsz, l2sz);
+	if (ret)
+		return ret;
+
+	ret = vgic_v5_alloc_l2_ists(kvm, id_bits, istsz, l2sz);
+	if (ret) {
+		/* Free the L1 IST again */
+		vmi = xa_load(&vm_info, vm_id);
+		kfree(vmi->h_lpi_ist);
+		vmi->h_lpi_ist = 0;
+
+		return ret;
+	}
+
+	return 0;
+}
+
+static void vgic_v5_free_allocated_lpi_ist(struct vgic_v5_vm_info *vmi,
+					   unsigned int id_bits,
+					   unsigned int istsz,
+					   unsigned int l2sz)
+{
+	if (!vmi->h_lpi_ist_structure) {
+		kfree(vmi->h_lpi_ist);
+		vmi->h_lpi_ist = NULL;
+		return;
+	}
+
+	if (vmi->h_lpi_l2_ists) {
+		const size_t n = max(2, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
+		const int l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
+		int index;
+
+		for (index = 0; index < l1_entries; ++index)
+			kfree(vmi->h_lpi_l2_ists[index]);
+
+		kfree(vmi->h_lpi_l2_ists);
+		vmi->h_lpi_l2_ists = NULL;
+	}
+
+	kfree(vmi->h_lpi_ist);
+	vmi->h_lpi_ist = NULL;
+}
+
+void vgic_v5_free_allocated_spi_ist(struct kvm *kvm)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return;
+
+	kfree(vmi->h_spi_ist);
+	vmi->h_spi_ist = NULL;
+}
+
+/*
+ * Free a Linear IST. Can only happen once the VM is dead.
+ */
+static int vgic_v5_linear_ist_free(struct kvm *kvm, bool spi)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vmtl2_entry *vmte;
+	struct vgic_v5_vm_info *vmi;
+	int section, ret;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (!vmi)
+		return -EINVAL;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	if (spi) {
+		section = GICV5_VMTEL2_SPI_SECTION;
+		vgic_v5_free_allocated_spi_ist(kvm);
+	} else {
+		section = GICV5_VMTEL2_LPI_SECTION;
+		vgic_v5_free_allocated_lpi_ist(vmi, 0, 0, 0);
+	}
+
+	/* The VM should be dead here, so we can just zero the VMT section */
+	WRITE_ONCE(vmte->val[section], 0ULL);
+	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+
+	return 0;
+}
+
+/*
+ * Free a Two-Level IST. Can only happen once the VM is dead.
+ */
+static int vgic_v5_two_level_ist_free(struct kvm *kvm, bool spi)
+{
+	unsigned int id_bits, istsz, l2sz;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	__le64 *l1ist, tmp;
+	struct vmtl2_entry *vmte;
+	int section, l1_entries;
+	size_t n;
+	int ret;
+
+	/* We don't create two-level SPI ISTs, so freeing is a bad idea! */
+	if (spi)
+		return -EINVAL;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (!vmi)
+		return -EINVAL;
+
+	section = GICV5_VMTEL2_LPI_SECTION;
+	l1ist = vmi->h_lpi_ist;
+
+	if (!vmi->h_lpi_ist_structure)
+		return -EINVAL;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	tmp = le64_to_cpu(READ_ONCE(vmte->val[section]));
+
+	id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
+	istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
+	l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
+
+	/* Calculation for n taken from the GICv5 specification */
+	n =  max(2, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
+	l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
+
+	vgic_v5_free_allocated_lpi_ist(vmi, id_bits, istsz, l2sz);
+
+	/* The VM must be dead, so we can just zero the VMT section */
+	WRITE_ONCE(vmte->val[section], 0ULL);
+
+	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
+
+	return 0;
+}
+
+/*
+ * Allocate an IST for SPIs.
+ *
+ * We don't anticipate a large number of SPIs being allocated. Therefore, we
+ * always allocate a Linear IST for SPIs. This will need to be revisited should
+ * that assumption no longer hold.
+ */
+int vgic_v5_spi_ist_allocate(struct kvm *kvm, phys_addr_t *base_addr,
+			     unsigned int id_bits, unsigned int istsz)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	int ret;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -EINVAL;
+
+	ret = vgic_v5_alloc_linear_ist(kvm, true, id_bits, istsz);
+	if (ret)
+		return ret;
+
+	*base_addr = virt_to_phys(vmi->h_spi_ist);
+
+	return 0;
+}
+
+/*
+ * Free the IST for SPIs. Should only happen once the VM is dead.
+ */
+static int vgic_v5_spi_ist_free(struct kvm *kvm)
+{
+	return vgic_v5_linear_ist_free(kvm, true);
+}
+
+/*
+ * Allocate an IST for LPIs.
+ *
+ * Unlike with SPIs, we anticipate that the guest will allocate a relatively
+ * large number of LPIs. Therefore, while we support doing a linear LPI IST, it
+ * is expected that LPI ISTs will be two-level.
+ */
+int vgic_v5_lpi_ist_alloc(struct kvm *kvm, unsigned int id_bits)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	unsigned int istsz, l2sz;
+	phys_addr_t phys_addr;
+	bool two_level;
+	int ret;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -EINVAL;
+
+	istsz = vgic_v5_ist_istsz(id_bits);
+	l2sz = vgic_v5_ist_l2sz();
+
+	/*
+	 * Determine if we want to create a Linear or a Two-Level IST.
+	 *
+	 * If we require more than one page for the IST, create a Two-Level IST
+	 * (if the host supports it, which is likely).
+	 *
+	 * Note: GICv5's istsz is not the size of the ISTEs in log2(bytes). It
+	 * is 2 less, hence the +2 below.
+	 */
+	two_level = gicv5_host_ist_caps.ist_levels &&
+		id_bits > PAGE_SHIFT - (2 + istsz);
+
+	if (!two_level)
+		ret = vgic_v5_alloc_linear_ist(kvm, false /* LPIs, not SPIs */,
+					       id_bits, istsz);
+	else
+		ret = vgic_v5_alloc_two_level_lpi_ist(kvm, id_bits, istsz,
+						      l2sz);
+
+	if (ret)
+		return ret;
+
+	phys_addr = virt_to_phys(vmi->h_lpi_ist);
+	ret = vgic_v5_vmte_assign_ist(kvm, phys_addr, two_level, id_bits, l2sz,
+				      istsz, false);
+	if (ret)
+		vgic_v5_free_allocated_lpi_ist(vmi, id_bits, istsz, l2sz);
+
+	return ret;
+}
+
+/* Free the LPI IST again */
+int vgic_v5_lpi_ist_free(struct kvm *kvm)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (!vmi)
+		return -ENXIO;
+
+	if (!vmi->h_lpi_ist_structure)
+		return vgic_v5_linear_ist_free(kvm, false);
+	else
+		return vgic_v5_two_level_ist_free(kvm, false);
+}
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
index 5501a44308362..37e220cda1987 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
@@ -54,6 +54,13 @@ struct vmtl2_entry {
 #define GICV5_VMTEL2E_IST_STRUCTURE	BIT_ULL(58)
 #define GICV5_VMTEL2E_IST_ID_BITS	GENMASK_ULL(63, 59)
 
+/*
+ * The LPI and SPI configuration is stored in the 2nd and 3rd 64-bit chunks of
+ * the VMTE (0-based).
+ */
+#define GICV5_VMTEL2_LPI_SECTION	2
+#define GICV5_VMTEL2_SPI_SECTION	3
+
 /* Virtual PE Table Entry */
 typedef __le64 vpe_entry;
 #define GICV5_VPE_VALID			BIT_ULL(0)
@@ -66,6 +73,12 @@ struct vgic_v5_vm_info {
 	vpe_entry __iomem	*vpet_base;
 	void __iomem		**vped_ptrs;
 	u8			vpe_id_bits;
+
+	/* Tracking for the hyp-owned ISTs */
+	bool			h_lpi_ist_structure;
+	__le64			*h_lpi_ist;
+	__le64			**h_lpi_l2_ists;
+	__le64			*h_spi_ist;
 };
 
 struct vgic_v5_vmt {
@@ -146,4 +159,13 @@ int vgic_v5_vmte_release(struct kvm *kvm);
 int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu);
 int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu);
 
+int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
+			    bool two_level, unsigned int id_bits,
+			    unsigned int l2sz, unsigned int istsz, bool spi_ist);
+int vgic_v5_spi_ist_allocate(struct kvm *kvm, phys_addr_t *base_addr,
+			     unsigned int id_bits, unsigned int istsz);
+void vgic_v5_free_allocated_spi_ist(struct kvm *kvm);
+int vgic_v5_lpi_ist_alloc(struct kvm *kvm, unsigned int id_bits);
+int vgic_v5_lpi_ist_free(struct kvm *kvm);
+
 #endif
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index 89579ee04f5d1..ccec0a045927c 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -450,6 +450,9 @@ enum gicv5_vcpu_info_cmd_type {
 	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
 	VMTE_MAKE_VALID,	/* Make the VMTE valid */
 	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
+	SPI_VIST_MAKE_VALID,	/* No corresponding invalid */
+	LPI_VIST_MAKE_VALID,	/* Triggered by a guest */
+	LPI_VIST_MAKE_INVALID,	/* Triggered by a guest */
 };
 
 struct gicv5_cmd_info {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (7 preceding siblings ...)
  2026-04-27 16:08 ` [PATCH 08/43] KVM: arm64: gic-v5: Introduce guest IST alloc and management Sascha Bischoff
@ 2026-04-27 16:09 ` Sascha Bischoff
  2026-04-29 12:50   ` Joey Gouly
  2026-04-29 16:04   ` Marc Zyngier
  2026-04-27 16:09 ` [PATCH 10/43] KVM: arm64: gic-v5: Implement VPE " Sascha Bischoff
                   ` (33 subsequent siblings)
  42 siblings, 2 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:09 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

GICv5 has rules about which fields of a VMTE (or L1 VMT) may be
directly written by the host once the table is valid. This ensures
that no stale state is cached by the hardware, and provides a clear
interface for making VMs, ISTs, etc, valid.

The hypervisor is responsible for populating the VMTE for a
VM. However, it is not permitted to write the Valid bit (as the VM
table is already valid). Instead, the VM is made valid via an IRS MMIO
Op. The same applies to the ISTs - they must be made valid via the
host IRS.

This commit adds support for:

* Making level 2 VMTs valid (only), allowing for dynamic level 2 table
  allocation.
* Making VMTEs (VMs) valid or invalid
* Making SPI/LPI ISTs valid or invalid for a specific VM

When (successfully) probing for a GICv5, the VMT is allocated, and is
made valid via the IRS's MMIO interface.

This commit also extends the doorbell domain to allow the doorbells
themselves to act as a conduit for issuing commands - this is similar
to what exists for GICv4 support. Effectively, irq_set_vcpu_affinity()
becomes an ioctl-like interface for issuing commands specific to
either a VM or the particular VPE that the doorbell belongs to. This
change adds support for the following via the VPE doorbells:

        VMT_L2_MAP - Make a second level VM table valid
        VMTE_MAKE_VALID - Make a single VMTE (and hence VM) valid
        VMTE_MAKE_INVALID - Make a single VMTE (and hence VM) invalid
        SPI_VIST_MAKE_VALID - Make the SPI IST valid
        LPI_VIST_MAKE_VALID - Make the LPI IST valid
        LPI_VIST_MAKE_INVALID - Make the LPI IST invalid

Note: It is intentional that there is no SPI_VIST_MAKE_INVALID - this
cannot happen while the VM is live, and given that the SPI is
allocated as part of VM creation, there is no need to make it invalid
again until the VM is destroyed, at which point the VMTE is
invalid. Therefore, there's no need to do this via the host's IRS MMIO
interface, as it can be directly marked as invalid and freed. LPIs, on
the other hand, are driven by the guest itself, and the guest is
theoretically free to invalidate and free the LPI IST at any point.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5-tables.c |  25 +++
 arch/arm64/kvm/vgic/vgic-v5-tables.h |   2 +
 arch/arm64/kvm/vgic/vgic-v5.c        | 236 ++++++++++++++++++++++++++-
 include/linux/irqchip/arm-gic-v5.h   |  30 ++++
 4 files changed, 290 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
index de905f37b61a5..0120c3205dea6 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
@@ -666,6 +666,26 @@ int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+phys_addr_t vgic_v5_get_vmt_base(void)
+{
+	phys_addr_t vmt_base;
+
+	if (!vgic_v5_vmt_allocated())
+		return -ENXIO;
+
+	if (!vmt_info->two_level)
+		vmt_base = virt_to_phys(vmt_info->linear.vmt_base);
+	else
+		vmt_base = virt_to_phys(vmt_info->l2.vmt_base);
+
+	return vmt_base;
+}
+
+u8 vgic_v5_vmt_vpe_id_bits(void)
+{
+	return fls(vmt_info->max_vpes) - 1;
+}
+
 /*
  * Assign an already allocated IST to the VM by populating the fields in the
  * corresponding VMTE. We re-use this code for both an SPI IST and LPI IST, even
@@ -715,6 +735,11 @@ int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
 	/* Finally, mark the entry as valid */
 	cmd_info.cmd_type = spi_ist ? SPI_VIST_MAKE_VALID : LPI_VIST_MAKE_VALID;
 	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
+	if (ret) {
+		WRITE_ONCE(vmte->val[section], 0ULL);
+		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, false);
+		return ret;
+	}
 
 	/* Any cached entries we now have are stale! */
 	vgic_v5_clean_inval(vmte, sizeof(*vmte), false, true);
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
index 37e220cda1987..6a024337eba79 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
@@ -150,6 +150,8 @@ int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
 			 size_t vmd_size, size_t vped_size,
 			 unsigned int vpe_id_bits);
 int vgic_v5_vmt_free(void);
+phys_addr_t vgic_v5_get_vmt_base(void);
+u8 vgic_v5_vmt_vpe_id_bits(void);
 
 int vgic_v5_allocate_vm_id(struct kvm *kvm);
 void vgic_v5_release_vm_id(struct kvm *kvm);
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 4e0d52b309628..49eb01ca07961 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -36,6 +36,12 @@ static void vgic_v5_get_implemented_ppis(void)
 	__assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3());
 }
 
+/*
+ * The IRS MMIO interface is shared between all VMs, so make sure we don't do
+ * anything stupid!
+ */
+static DEFINE_RAW_SPINLOCK(vm_config_lock);
+
 static void __iomem *irs_base;
 
 static u32 irs_readl_relaxed(const u32 reg_offset)
@@ -43,6 +49,21 @@ static u32 irs_readl_relaxed(const u32 reg_offset)
 	return readl_relaxed(irs_base + reg_offset);
 }
 
+static void irs_writel_relaxed(const u32 val, const u32 reg_offset)
+{
+	writel_relaxed(val, irs_base + reg_offset);
+}
+
+static u64 irs_readq_relaxed(const u32 reg_offset)
+{
+	return readq_relaxed(irs_base + reg_offset);
+}
+
+static void irs_writeq_relaxed(const u64 val, const u32 reg_offset)
+{
+	writeq_relaxed(val, irs_base + reg_offset);
+}
+
 static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
 {
 	u64 idr;
@@ -84,16 +105,22 @@ static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
 	return 0;
 }
 
+/* Forward decl for cleaner code layout */
+static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base);
+static int vgic_v5_irs_clear_vmt(void);
+
 /*
  * Probe for a vGICv5 compatible interrupt controller, returning 0 on success.
  */
 int vgic_v5_probe(const struct gic_kvm_info *info)
 {
+	struct vgic_v5_host_ist_caps *ist_caps;
 	bool v5_registered = false;
 	u64 ich_vtr_el2;
 	int ret;
 
 	kvm_vgic_global_state.type = VGIC_V5;
+	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
 
 	kvm_vgic_global_state.vcpu_base = 0;
 	kvm_vgic_global_state.vctrl_base = NULL;
@@ -114,13 +141,53 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
 	if (gicv5_irs_extract_vm_caps(info))
 		goto skip_v5;
 
-	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
+	ist_caps = vgic_v5_host_caps();
+
+	/*
+	 * Even if the HW supports more per-VM vCPUs, artifically cap as we
+	 * can't use them all.
+	 */
+	kvm_vgic_global_state.max_gic_vcpus = min(ist_caps->max_vpes,
+						  VGIC_V5_MAX_CPUS);
+
+	/*
+	 * GICv5 requires a set of tables to be allocated in order to manage
+	 * VMs. We allocate them in advance here, which alas means that we
+	 * already have to make a decisions regarding the maximum number of VMs
+	 * we want to run. For now, we match the maximum number offered by the
+	 * hardware, but this might not be a wise choice in the long term.
+	 */
+	ret = vgic_v5_vmt_allocate(ist_caps->two_level_vmt_support,
+				   ist_caps->max_vms, ist_caps->vmd_size,
+				   ist_caps->vped_size,
+				   kvm_vgic_global_state.max_gic_vcpus);
+	if (ret) {
+		kvm_err("Failed to allocate the GICv5 VM tables; no GICv5 support\n");
+		goto skip_v5;
+	}
+
+	/*
+	 * We've now allocated the VM table, but the host's IRS doesn't know
+	 * about it yet. Provide the base address of the VMT to the IRS, as well
+	 * as the number of ID bits that it covers and the structure used
+	 * (linear/two-level).
+	 */
+	ret = vgic_v5_irs_assign_vmt(ist_caps->two_level_vmt_support,
+				     vgic_v5_vmt_vpe_id_bits(),
+				     vgic_v5_get_vmt_base());
+	if (ret) {
+		kvm_err("Failed to assign the GICv5 VM tables to the IRS; no GICv5 support\n");
+		vgic_v5_vmt_free();
+		goto skip_v5;
+	}
 
 	vgic_v5_get_implemented_ppis();
 
 	ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V5);
 	if (ret) {
 		kvm_err("Cannot register GICv5 KVM device.\n");
+		vgic_v5_irs_clear_vmt();
+		vgic_v5_vmt_free();
 		goto skip_v5;
 	}
 
@@ -148,12 +215,13 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
 	ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V3);
 	if (ret) {
 		kvm_err("Cannot register GICv3-legacy KVM device.\n");
-		return ret;
+		/* vGICv5 should still work */
+		return v5_registered ? 0 : ret;
 	}
 
 	/* We potentially limit the max VCPUs further than we need to here */
 	kvm_vgic_global_state.max_gic_vcpus = min(VGIC_V3_MAX_CPUS,
-						  VGIC_V5_MAX_CPUS);
+						  kvm_vgic_global_state.max_gic_vcpus);
 
 	static_branch_enable(&kvm_vgic_global_state.gicv3_cpuif);
 	kvm_info("GCIE legacy system register CPU interface\n");
@@ -163,6 +231,167 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
 	return 0;
 }
 
+/*
+ * Wait for completion of a change in any of IRS_VMT_BASER, IRS_VMAP_L2_VMTR,
+ * IRS_VMAP_VMR, IRS_VMAP_VPER, IRS_VMAP_VISTR, IRS_VMAP_L2_VISTR.
+ */
+static int vgic_v5_irs_wait_for_vm_op(void)
+{
+	u32 statusr;
+	int ret;
+
+	ret = readl_relaxed_poll_timeout_atomic(
+		irs_base + GICV5_IRS_VMT_STATUSR, statusr,
+		FIELD_GET(GICV5_IRS_VMT_STATUSR_IDLE, statusr), 1,
+		USEC_PER_SEC);
+
+	if (ret == -ETIMEDOUT) {
+		pr_err_ratelimited("Time out waiting for IRS VM Op\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base)
+{
+	u64 vmt_baser;
+	u32 vmt_cfgr;
+
+	vmt_baser = irs_readq_relaxed(GICV5_IRS_VMT_BASER);
+	if (!!FIELD_GET(GICV5_IRS_VMT_BASER_VALID, vmt_baser))
+		return -EBUSY;
+
+	vmt_cfgr = FIELD_PREP(GICV5_IRS_VMT_CFGR_VM_ID_BITS, vm_id_bits);
+	if (two_level)
+		vmt_cfgr |= FIELD_PREP(GICV5_IRS_VMT_CFGR_STRUCTURE,
+				       GICV5_IRS_VMT_CFGR_STRUCTURE_TWO_LEVEL);
+
+	irs_writel_relaxed(vmt_cfgr, GICV5_IRS_VMT_CFGR);
+
+	/* The base address is intentionally only masked and not shifted */
+	vmt_baser = FIELD_PREP(GICV5_IRS_VMT_BASER_VALID, true) |
+		    (vmt_base & GICV5_IRS_VMT_BASER_ADDR);
+	irs_writeq_relaxed(vmt_baser, GICV5_IRS_VMT_BASER);
+
+	return vgic_v5_irs_wait_for_vm_op();
+}
+
+static int vgic_v5_irs_clear_vmt(void)
+{
+	irs_writeq_relaxed(0ULL, GICV5_IRS_VMT_BASER);
+
+	return vgic_v5_irs_wait_for_vm_op();
+}
+
+static int vgic_v5_irs_vmap_l2_vmt(int vm_id)
+{
+	u64 vmap_l2_vmtr;
+	int ret = 0;
+
+	guard(raw_spinlock)(&vm_config_lock);
+
+	/* Make sure that we are idle to begin with */
+	ret = vgic_v5_irs_wait_for_vm_op();
+	if (ret)
+		return ret;
+
+	/* Mark the VM as valid */
+	vmap_l2_vmtr = FIELD_PREP(GICV5_IRS_VMAP_L2_VMTR_VM_ID, vm_id) |
+		       FIELD_PREP(GICV5_IRS_VMAP_L2_VMTR_M, true);
+	irs_writeq_relaxed(vmap_l2_vmtr, GICV5_IRS_VMAP_L2_VMTR);
+
+	return vgic_v5_irs_wait_for_vm_op();
+}
+
+static int __vgic_v5_irs_vmap_vm(int vm_id, bool unmap)
+{
+	u64 vmap_vmr;
+	int ret;
+
+	guard(raw_spinlock)(&vm_config_lock);
+
+	/* Make sure that we are idle to begin with */
+	ret = vgic_v5_irs_wait_for_vm_op();
+	if (ret)
+		return ret;
+
+	/* Mark the VM as valid */
+	vmap_vmr = FIELD_PREP(GICV5_IRS_VMAP_VMR_VM_ID, vm_id) |
+		   FIELD_PREP(GICV5_IRS_VMAP_VMR_U, unmap) |
+		   FIELD_PREP(GICV5_IRS_VMAP_VMR_M, true);
+	irs_writeq_relaxed(vmap_vmr, GICV5_IRS_VMAP_VMR);
+
+	return vgic_v5_irs_wait_for_vm_op();
+}
+
+static int vgic_v5_irs_set_vm_valid(int vm_id)
+{
+	return __vgic_v5_irs_vmap_vm(vm_id, false);
+}
+
+static int vgic_v5_irs_set_vm_invalid(int vm_id)
+{
+	return __vgic_v5_irs_vmap_vm(vm_id, true);
+}
+
+static int __vgic_v5_irs_update_vist_validity(int vm_id, bool spi_ist, bool unmap)
+{
+	u8 type = spi_ist ? 0b011 : 0b010;
+	u64 vmap_vistr;
+	int ret;
+
+	guard(raw_spinlock)(&vm_config_lock);
+
+	/* Make sure that we are idle to begin with */
+	ret = vgic_v5_irs_wait_for_vm_op();
+	if (ret)
+		return ret;
+
+	/* Mark the IST as valid */
+	vmap_vistr = FIELD_PREP(GICV5_IRS_VMAP_VISTR_TYPE, type) |
+		     FIELD_PREP(GICV5_IRS_VMAP_VISTR_VM_ID, vm_id) |
+		     FIELD_PREP(GICV5_IRS_VMAP_VISTR_U, unmap) |
+		     FIELD_PREP(GICV5_IRS_VMAP_VISTR_M, true);
+	irs_writeq_relaxed(vmap_vistr, GICV5_IRS_VMAP_VISTR);
+
+	return vgic_v5_irs_wait_for_vm_op();
+}
+
+static int vgic_v5_irs_set_vist_valid(int vm_id, bool spi_ist)
+{
+	return __vgic_v5_irs_update_vist_validity(vm_id, spi_ist, false);
+}
+
+/* Note: We currently do not use this as we rely on the VM becoming invalid. */
+static int vgic_v5_irs_set_vist_invalid(int vm_id, bool spi_ist)
+{
+	return __vgic_v5_irs_update_vist_validity(vm_id, spi_ist, true);
+}
+
+static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
+{
+	struct vgic_v5_vm *vm = data->domain->host_data;
+	struct gicv5_cmd_info *cmd_info = vcpu_info;
+
+	switch (cmd_info->cmd_type) {
+	case VMT_L2_MAP:
+		return vgic_v5_irs_vmap_l2_vmt(vm->vm_id);
+	case VMTE_MAKE_VALID:
+		return vgic_v5_irs_set_vm_valid(vm->vm_id);
+	case VMTE_MAKE_INVALID:
+		return vgic_v5_irs_set_vm_invalid(vm->vm_id);
+	case SPI_VIST_MAKE_VALID:
+		return vgic_v5_irs_set_vist_valid(vm->vm_id, true);
+	case LPI_VIST_MAKE_VALID:
+		return vgic_v5_irs_set_vist_valid(vm->vm_id, false);
+	case LPI_VIST_MAKE_INVALID:
+		return vgic_v5_irs_set_vist_invalid(vm->vm_id, false);
+	default:
+		return -EINVAL;
+	}
+}
+
 /*
  * This set of irq_chip functions is specific for doorbells.
  */
@@ -174,6 +403,7 @@ static struct irq_chip vgic_v5_db_irq_chip = {
 	.irq_set_affinity = irq_chip_set_affinity_parent,
 	.irq_get_irqchip_state = irq_chip_get_parent_state,
 	.irq_set_irqchip_state = irq_chip_set_parent_state,
+	.irq_set_vcpu_affinity = vgic_v5_db_set_vcpu_affinity,
 	.flags = IRQCHIP_SET_TYPE_MASKED | IRQCHIP_SKIP_SET_WAKE |
 		 IRQCHIP_MASK_ON_SUSPEND,
 };
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index ccec0a045927c..ff5ad653252d2 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -87,6 +87,12 @@
 #define GICV5_IRS_IST_CFGR		0x0190
 #define GICV5_IRS_IST_STATUSR		0x0194
 #define GICV5_IRS_MAP_L2_ISTR		0x01c0
+#define GICV5_IRS_VMT_BASER		0x0200
+#define GICV5_IRS_VMT_CFGR		0x0210
+#define GICV5_IRS_VMT_STATUSR		0x0214
+#define GICV5_IRS_VMAP_L2_VMTR		0x02c0
+#define GICV5_IRS_VMAP_VMR		0x02c8
+#define GICV5_IRS_VMAP_VISTR		0x02d0
 
 #define GICV5_IRS_IDR0_VIRT		BIT(6)
 
@@ -181,6 +187,30 @@
 
 #define GICV5_IRS_MAP_L2_ISTR_ID	GENMASK(23, 0)
 
+#define GICV5_IRS_VMT_BASER_ADDR	GENMASK_ULL(51, 3)
+#define GICV5_IRS_VMT_BASER_ADDR_SHIFT	3ULL
+#define GICV5_IRS_VMT_BASER_VALID	BIT_ULL(0)
+
+#define GICV5_IRS_VMT_CFGR_STRUCTURE_TWO_LEVEL	0b1
+#define GICV5_IRS_VMT_CFGR_STRUCTURE_LINEAR	0b0
+
+#define GICV5_IRS_VMT_CFGR_STRUCTURE	BIT(16)
+#define GICV5_IRS_VMT_CFGR_VM_ID_BITS	GENMASK(4, 0)
+
+#define GICV5_IRS_VMT_STATUSR_IDLE	BIT(0)
+
+#define GICV5_IRS_VMAP_L2_VMTR_M	BIT_ULL(63)
+#define GICV5_IRS_VMAP_L2_VMTR_VM_ID	GENMASK_ULL(15, 0)
+
+#define GICV5_IRS_VMAP_VMR_M		BIT_ULL(63)
+#define GICV5_IRS_VMAP_VMR_U		BIT_ULL(62)
+#define GICV5_IRS_VMAP_VMR_VM_ID	GENMASK_ULL(15, 0)
+
+#define GICV5_IRS_VMAP_VISTR_M		BIT_ULL(63)
+#define GICV5_IRS_VMAP_VISTR_U		BIT_ULL(62)
+#define GICV5_IRS_VMAP_VISTR_VM_ID	GENMASK_ULL(47, 32)
+#define GICV5_IRS_VMAP_VISTR_TYPE	GENMASK_ULL(31, 29)
+
 #define GICV5_ISTL1E_VALID		BIT_ULL(0)
 #define GICV5_IRS_ISTL1E_SIZE		8UL
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 10/43] KVM: arm64: gic-v5: Implement VPE IRS MMIO Ops
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (8 preceding siblings ...)
  2026-04-27 16:09 ` [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops Sascha Bischoff
@ 2026-04-27 16:09 ` Sascha Bischoff
  2026-04-30  8:46   ` Marc Zyngier
  2026-04-27 16:09 ` [PATCH 11/43] KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset() Sascha Bischoff
                   ` (32 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:09 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Introduce interfaces to make VPEs valid, and to configure them, via
the host's IRS. As with the other valid bits in the GICv5 VM tables,
VPEs cannot be made valid directly, and instead are made valid via an
IRS MMIO Op.

Additionally, some of the VPE configuration takes place via the IRS
MMIO interface too (via the IRS_VPE_CR0, IRS_VPE_DBR). VPE doorbells
are, for example, configured via this interface.

The existing VPE-doorbell-based commands are extended with:

        VPE_MAKE_VALID - Make the VPE valid in the VPET
        VPE_CR0_READ - Handle a guest read from IRS_PE_CR0
        VPE_CR0_WRITE - Handle a guest write to IRS_PE_CR0

Note: There is no VPE_MAKE_INVALID as VPEs are only made invalid on
teardown, at which point the whole VMTE is marked as invalid. Hence,
it is not required.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c      | 164 +++++++++++++++++++++++++++++
 include/linux/irqchip/arm-gic-v5.h |  27 +++++
 2 files changed, 191 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 49eb01ca07961..0649729f6b834 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -253,6 +253,25 @@ static int vgic_v5_irs_wait_for_vm_op(void)
 	return 0;
 }
 
+/* Wait for completion of an VPE_STATUSR change */
+static int vgic_v5_irs_wait_for_vpe_op(void)
+{
+	int ret;
+	u32 statusr;
+
+	ret = readl_relaxed_poll_timeout_atomic(
+		irs_base + GICV5_IRS_VPE_STATUSR, statusr,
+		FIELD_GET(GICV5_IRS_VPE_STATUSR_IDLE, statusr), 1,
+		USEC_PER_SEC);
+
+	if (ret == -ETIMEDOUT) {
+		pr_err_ratelimited("Time out waiting for IRS VPE Op\n");
+		return ret;
+	}
+
+	return 0;
+}
+
 static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base)
 {
 	u64 vmt_baser;
@@ -369,10 +388,142 @@ static int vgic_v5_irs_set_vist_invalid(int vm_id, bool spi_ist)
 	return __vgic_v5_irs_update_vist_validity(vm_id, spi_ist, true);
 }
 
+static int vgic_v5_irs_set_up_vpe(int vm_id, int vpe_id, irq_hw_number_t db_hwirq)
+{
+	u64 vmap_vper, dbr, selr;
+	u32 statusr, cr0;
+	int ret;
+
+	guard(raw_spinlock)(&vm_config_lock);
+
+	/* Make sure that we are idle to begin with */
+	ret = vgic_v5_irs_wait_for_vm_op();
+	if (ret)
+		return ret;
+
+	/* Mark the VPE as valid */
+	vmap_vper = FIELD_PREP(GICV5_IRS_VMAP_VPER_VPE_ID, vpe_id) |
+		    FIELD_PREP(GICV5_IRS_VMAP_VPER_VM_ID, vm_id) |
+		    FIELD_PREP(GICV5_IRS_VMAP_VPER_M, true);
+	irs_writeq_relaxed(vmap_vper, GICV5_IRS_VMAP_VPER);
+
+	/* Wait for the VPE to be marked valid in the VPET */
+	ret = vgic_v5_irs_wait_for_vm_op();
+	if (ret)
+		return ret;
+
+	selr = FIELD_PREP(GICV5_IRS_VPE_SELR_VPE_ID, vpe_id) |
+	       FIELD_PREP(GICV5_IRS_VPE_SELR_VM_ID, vm_id) |
+	       FIELD_PREP(GICV5_IRS_VPE_SELR_S, true);
+	irs_writeq_relaxed(selr, GICV5_IRS_VPE_SELR);
+
+	ret = vgic_v5_irs_wait_for_vpe_op();
+	if (ret)
+		return ret;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
+	if (!FIELD_GET(GICV5_IRS_VPE_STATUSR_V, statusr))
+		return -EINVAL;
+
+	/* Set targeted only routing (disable 1ofN vPE selection) */
+	cr0 = FIELD_PREP(GICV5_IRS_VPE_CR0_DPS, true);
+	irs_writel_relaxed(cr0, GICV5_IRS_VPE_CR0);
+
+	ret = vgic_v5_irs_wait_for_vpe_op();
+	if (ret)
+		return ret;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
+	if (FIELD_GET(GICV5_IRS_VPE_STATUSR_F, statusr))
+		ret = -EINVAL;
+
+	/*
+	 * The VPE has not yet run. Therefore, make sure that all interrupts
+	 * will generate a doorbell.
+	 */
+	dbr = FIELD_PREP(GICV5_IRS_VPE_DBR_LPI_ID, db_hwirq) |
+	      FIELD_PREP(GICV5_IRS_VPE_DBR_DBPM, 0b11111) |
+	      FIELD_PREP(GICV5_IRS_VPE_DBR_REQ_DB, false) |
+	      FIELD_PREP(GICV5_IRS_VPE_DBR_DBV, true);
+	irs_writeq_relaxed(dbr, GICV5_IRS_VPE_DBR);
+
+	ret = vgic_v5_irs_wait_for_vpe_op();
+	if (ret)
+		return ret;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
+	if (FIELD_GET(GICV5_IRS_VPE_STATUSR_F, statusr))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vgic_v5_irs_vpe_cr0_read(int vm_id, int vpe_id, u64 *cr0)
+{
+	u32 statusr;
+	u64 selr;
+	int ret;
+
+	guard(raw_spinlock)(&vm_config_lock);
+
+	selr = FIELD_PREP(GICV5_IRS_VPE_SELR_VPE_ID, vpe_id) |
+	       FIELD_PREP(GICV5_IRS_VPE_SELR_VM_ID, vm_id) |
+	       FIELD_PREP(GICV5_IRS_VPE_SELR_S, true);
+	irs_writeq_relaxed(selr, GICV5_IRS_VPE_SELR);
+
+	ret = vgic_v5_irs_wait_for_vpe_op();
+	if (ret)
+		return ret;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
+	if (!FIELD_GET(GICV5_IRS_VPE_STATUSR_V, statusr))
+		return -EINVAL;
+
+	*cr0 = irs_readl_relaxed(GICV5_IRS_VPE_CR0);
+
+	return 0;
+}
+
+static int vgic_v5_irs_vpe_cr0_update(int vm_id, int vpe_id, u32 cr0)
+{
+	u32 statusr;
+	u64 selr;
+	int ret;
+
+	guard(raw_spinlock)(&vm_config_lock);
+
+	selr = FIELD_PREP(GICV5_IRS_VPE_SELR_VPE_ID, vpe_id) |
+	       FIELD_PREP(GICV5_IRS_VPE_SELR_VM_ID, vm_id) |
+	       FIELD_PREP(GICV5_IRS_VPE_SELR_S, true);
+	irs_writeq_relaxed(selr, GICV5_IRS_VPE_SELR);
+
+	ret = vgic_v5_irs_wait_for_vpe_op();
+	if (ret)
+		return ret;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
+	if (!FIELD_GET(GICV5_IRS_VPE_STATUSR_V, statusr))
+		return ret;
+
+	irs_writel_relaxed(cr0, GICV5_IRS_VPE_CR0);
+
+	ret = vgic_v5_irs_wait_for_vpe_op();
+	if (ret)
+		return ret;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
+	if (FIELD_GET(GICV5_IRS_VPE_STATUSR_F, statusr))
+		return -EINVAL;
+
+	return 0;
+}
+
 static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 {
 	struct vgic_v5_vm *vm = data->domain->host_data;
 	struct gicv5_cmd_info *cmd_info = vcpu_info;
+	/* Our VPE ID is the index within the doorbell domain */
+	u16 vpe_id = data->hwirq;
 
 	switch (cmd_info->cmd_type) {
 	case VMT_L2_MAP:
@@ -381,6 +532,19 @@ static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 		return vgic_v5_irs_set_vm_valid(vm->vm_id);
 	case VMTE_MAKE_INVALID:
 		return vgic_v5_irs_set_vm_invalid(vm->vm_id);
+	case VPE_MAKE_VALID:
+		/*
+		 * We need the actual LPI ID which lives in the top-most parent
+		 * domain. This hwirq won't include the type (LPI) but that's
+		 * not required for the IRS_VPE_DBR.
+		 */
+		while (data->parent_data != NULL)
+			data = data->parent_data;
+		return vgic_v5_irs_set_up_vpe(vm->vm_id, vpe_id, data->hwirq);
+	case VPE_CR0_READ:
+		return vgic_v5_irs_vpe_cr0_read(vm->vm_id, vpe_id, &cmd_info->data);
+	case VPE_CR0_WRITE:
+		return vgic_v5_irs_vpe_cr0_update(vm->vm_id, vpe_id, cmd_info->data);
 	case SPI_VIST_MAKE_VALID:
 		return vgic_v5_irs_set_vist_valid(vm->vm_id, true);
 	case LPI_VIST_MAKE_VALID:
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index ff5ad653252d2..54b573783cd75 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -90,9 +90,14 @@
 #define GICV5_IRS_VMT_BASER		0x0200
 #define GICV5_IRS_VMT_CFGR		0x0210
 #define GICV5_IRS_VMT_STATUSR		0x0214
+#define GICV5_IRS_VPE_SELR		0x0240
+#define GICV5_IRS_VPE_DBR		0x0248
+#define GICV5_IRS_VPE_CR0		0x0258
+#define GICV5_IRS_VPE_STATUSR		0x025c
 #define GICV5_IRS_VMAP_L2_VMTR		0x02c0
 #define GICV5_IRS_VMAP_VMR		0x02c8
 #define GICV5_IRS_VMAP_VISTR		0x02d0
+#define GICV5_IRS_VMAP_VPER		0x02e0
 
 #define GICV5_IRS_IDR0_VIRT		BIT(6)
 
@@ -199,6 +204,21 @@
 
 #define GICV5_IRS_VMT_STATUSR_IDLE	BIT(0)
 
+#define GICV5_IRS_VPE_SELR_S		BIT_ULL(63)
+#define GICV5_IRS_VPE_SELR_VPE_ID	GENMASK_ULL(47, 32)
+#define GICV5_IRS_VPE_SELR_VM_ID	GENMASK_ULL(15, 0)
+
+#define GICV5_IRS_VPE_DBR_DBV		BIT_ULL(63)
+#define GICV5_IRS_VPE_DBR_REQ_DB	BIT_ULL(62)
+#define GICV5_IRS_VPE_DBR_DBPM		GENMASK_ULL(36, 32)
+#define GICV5_IRS_VPE_DBR_LPI_ID	GENMASK_ULL(23, 0)
+
+#define GICV5_IRS_VPE_CR0_DPS		BIT(0)
+
+#define GICV5_IRS_VPE_STATUSR_F		BIT(2)
+#define GICV5_IRS_VPE_STATUSR_V		BIT(1)
+#define GICV5_IRS_VPE_STATUSR_IDLE	BIT(0)
+
 #define GICV5_IRS_VMAP_L2_VMTR_M	BIT_ULL(63)
 #define GICV5_IRS_VMAP_L2_VMTR_VM_ID	GENMASK_ULL(15, 0)
 
@@ -211,6 +231,10 @@
 #define GICV5_IRS_VMAP_VISTR_VM_ID	GENMASK_ULL(47, 32)
 #define GICV5_IRS_VMAP_VISTR_TYPE	GENMASK_ULL(31, 29)
 
+#define GICV5_IRS_VMAP_VPER_M		BIT_ULL(63)
+#define GICV5_IRS_VMAP_VPER_VM_ID	GENMASK_ULL(47, 32)
+#define GICV5_IRS_VMAP_VPER_VPE_ID	GENMASK_ULL(15, 0)
+
 #define GICV5_ISTL1E_VALID		BIT_ULL(0)
 #define GICV5_IRS_ISTL1E_SIZE		8UL
 
@@ -480,6 +504,9 @@ enum gicv5_vcpu_info_cmd_type {
 	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
 	VMTE_MAKE_VALID,	/* Make the VMTE valid */
 	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
+	VPE_MAKE_VALID,		/* No corresponding invalid */
+	VPE_CR0_READ,		/* Read of VPE_CR0 (guest read from PE_CR0) */
+	VPE_CR0_WRITE,		/* Write to VPE_CR0 (guest write to PE_CR0) */
 	SPI_VIST_MAKE_VALID,	/* No corresponding invalid */
 	LPI_VIST_MAKE_VALID,	/* Triggered by a guest */
 	LPI_VIST_MAKE_INVALID,	/* Triggered by a guest */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 11/43] KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset()
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (9 preceding siblings ...)
  2026-04-27 16:09 ` [PATCH 10/43] KVM: arm64: gic-v5: Implement VPE " Sascha Bischoff
@ 2026-04-27 16:09 ` Sascha Bischoff
  2026-04-30  9:37   ` Marc Zyngier
  2026-04-27 16:10 ` [PATCH 12/43] KVM: arm64: gic-v5: Clear db_fired flag before making VPE non-resident Sascha Bischoff
                   ` (31 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:09 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

When resetting VPEs, also mark them as valid in the VM VPE Table. This
is required as it informs the IRS that a specific VPE may be made
resident, and without this the IRS will treat the VPE as invalid.

As part of this change, we also introduce a wrapper around the VPE
doorbells - vgic_v5_send_command(). This takes a struct kvm_vcpu
pointer, and the command to run, and triggers the function bound to
the command via that vcpu's doorbell. This is a convenience function
to simplify the code.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 0649729f6b834..92bb63b6dd6bb 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -518,6 +518,18 @@ static int vgic_v5_irs_vpe_cr0_update(int vm_id, int vpe_id, u32 cr0)
 	return 0;
 }
 
+static int vgic_v5_send_command(struct kvm_vcpu *vcpu,
+				enum gicv5_vcpu_info_cmd_type type)
+{
+	struct gicv5_cmd_info cmd_info;
+
+	if (!vcpu)
+		return -EINVAL;
+
+	cmd_info.cmd_type = type;
+	return irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu), &cmd_info);
+}
+
 static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 {
 	struct vgic_v5_vm *vm = data->domain->host_data;
@@ -691,6 +703,8 @@ static void vgic_v5_teardown_per_vm_domain(struct vgic_v5_vm *vm)
 
 void vgic_v5_reset(struct kvm_vcpu *vcpu)
 {
+	int rc;
+
 	/*
 	 * We always present 16-bits of ID space to the guest, irrespective of
 	 * the host allowing more.
@@ -702,6 +716,14 @@ void vgic_v5_reset(struct kvm_vcpu *vcpu)
 	 * CPUIF (but potentially fewer in the IRS).
 	 */
 	vcpu->arch.vgic_cpu.num_pri_bits = 5;
+
+	/* Make the VPE valid in the VPET */
+	rc = vgic_v5_send_command(vcpu, VPE_MAKE_VALID);
+	if (rc) {
+		/* We can't continue, so mark the VM as dead */
+		kvm_vm_dead(vcpu->kvm);
+		return;
+	}
 }
 
 int vgic_v5_init(struct kvm *kvm)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 12/43] KVM: arm64: gic-v5: Clear db_fired flag before making VPE non-resident
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (10 preceding siblings ...)
  2026-04-27 16:09 ` [PATCH 11/43] KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset() Sascha Bischoff
@ 2026-04-27 16:10 ` Sascha Bischoff
  2026-04-27 16:10 ` [PATCH 13/43] KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put Sascha Bischoff
                   ` (30 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:10 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

The GICv5 VPE doorbells are used to notify KVM that a non-resident VPE
has an interrupt to handle. We use the db_fired flag to track this,
and therefore need to ensure that the flag is cleared once it has been
consumed.

Explicitly set the db_fired flag to false before making a VPE
non-resident as any state is stale at this stage. This also ensures
that any incoming doorbells, if requested, will be observed.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/hyp/vgic-v5-sr.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c
index 04c5846b9abac..d27fe2911df3f 100644
--- a/arch/arm64/kvm/hyp/vgic-v5-sr.c
+++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c
@@ -22,6 +22,15 @@ void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if)
 
 void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if)
 {
+	/*
+	 * Clear the db_fired state to ensure that we're ready for the next
+	 * doorbell when it is requested. If a doorbell firing caused us to
+	 * enter the guest, then we've already consumed that state at this
+	 * point, so this is safe to clear. Use WRITE_ONCE() to ensure we're not
+	 * racing with the doorbell firing and setting the state true again.
+	 */
+	WRITE_ONCE(cpu_if->gicv5_vpe.db_fired, false);
+
 	/*
 	 * Make as non-resident before actually making non-resident. Avoids race
 	 * with doorbell arriving.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 13/43] KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (11 preceding siblings ...)
  2026-04-27 16:10 ` [PATCH 12/43] KVM: arm64: gic-v5: Clear db_fired flag before making VPE non-resident Sascha Bischoff
@ 2026-04-27 16:10 ` Sascha Bischoff
  2026-04-30 10:26   ` Marc Zyngier
  2026-04-27 16:10 ` [PATCH 14/43] KVM: arm64: gic-v5: Request VPE doorbells when going non-resident Sascha Bischoff
                   ` (29 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:10 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Extend vgic_v5_load and vgic_v5_put to make the VPEs resident and
non-resident, respectively. This makes the IRS aware of which VPE is
currently resident, and therefore allows it to perform HPPI selection
for LPIs and SPIs, which would otherwise never be signalled to the
VPE.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 92bb63b6dd6bb..11a1a491b7e0a 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -1053,6 +1053,8 @@ void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu)
 void vgic_v5_load(struct kvm_vcpu *vcpu)
 {
 	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+	u16 vm = vgic_v5_vm_id(vcpu->kvm);
+	u16 vpe = vgic_v5_vpe_id(vcpu);
 
 	/*
 	 * On the WFI path, vgic_load is called a second time. The first is when
@@ -1065,7 +1067,11 @@ void vgic_v5_load(struct kvm_vcpu *vcpu)
 
 	kvm_call_hyp(__vgic_v5_restore_vmcr_apr, cpu_if);
 
-	cpu_if->gicv5_vpe.resident = true;
+	cpu_if->vgic_contextr = FIELD_PREP(ICH_CONTEXTR_EL2_V, true) |
+				FIELD_PREP(ICH_CONTEXTR_EL2_VPE, vpe) |
+				FIELD_PREP(ICH_CONTEXTR_EL2_VM, vm);
+
+	kvm_call_hyp(__vgic_v5_make_resident, cpu_if);
 }
 
 void vgic_v5_put(struct kvm_vcpu *vcpu)
@@ -1083,7 +1089,9 @@ void vgic_v5_put(struct kvm_vcpu *vcpu)
 
 	kvm_call_hyp(__vgic_v5_save_apr, cpu_if);
 
-	cpu_if->gicv5_vpe.resident = false;
+	cpu_if->vgic_contextr = 0;
+
+	kvm_call_hyp(__vgic_v5_make_non_resident, cpu_if);
 
 	/* The shadow priority is only updated on entering WFI */
 	if (vcpu_get_flag(vcpu, IN_WFI))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 14/43] KVM: arm64: gic-v5: Request VPE doorbells when going non-resident
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (12 preceding siblings ...)
  2026-04-27 16:10 ` [PATCH 13/43] KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put Sascha Bischoff
@ 2026-04-27 16:10 ` Sascha Bischoff
  2026-04-30 10:37   ` Marc Zyngier
  2026-04-27 16:11 ` [PATCH 15/43] KVM: arm64: gic-v5: Handle doorbells in kvm_vgic_vcpu_pending_irq() Sascha Bischoff
                   ` (28 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:10 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

When a VPE is made non-resident and is entering WFI, a doorbell should
be requested for the VPE. This allows the VPE to be easily woken once
an SPI/LPI interrupt is pending for it. This is tracked by the IRS,
which will signal the specific VPE doorbell for the VPE once such an
interrupt arrives.

Requesting a doorbell involves calculating the DBPM - DoorBell
Priority Mask - which ensures that the DB is only signalled by the
hardware if the pending interrupt is of sufficient priority. This
avoids waking a VPE that can't process the incoming interrupt.

Doorbells are NOT requested if a VPE is not entering WFI as we expect
to enter again imminently.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 11a1a491b7e0a..2fc6fa4df034f 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -1077,6 +1077,9 @@ void vgic_v5_load(struct kvm_vcpu *vcpu)
 void vgic_v5_put(struct kvm_vcpu *vcpu)
 {
 	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+	bool req_db = !!vcpu_get_flag(vcpu, IN_WFI);
+	u32 priority_mask;
+	int dbpm;
 
 	/*
 	 * Do nothing if we're not resident. This can happen in the WFI path
@@ -1090,6 +1093,31 @@ void vgic_v5_put(struct kvm_vcpu *vcpu)
 	kvm_call_hyp(__vgic_v5_save_apr, cpu_if);
 
 	cpu_if->vgic_contextr = 0;
+	if (req_db) {
+		/*
+		 * Find the virtual running priority and use this to calculate
+		 * the doorbell priority mask. We combine the highest active
+		 * priority and the CPU's priority mask. The guest can't handle
+		 * interrupts with priorities less than or equal to the virtual
+		 * running priority, so there's literally no point in waking the
+		 * guest for these.
+		 *
+		 * The priority needs to be higher than the mask to signal, so
+		 * pick the next higher priority (subtract 1).
+		 */
+		priority_mask = vgic_v5_get_effective_priority_mask(vcpu);
+
+		/* Don't request a doorbell if the max priority is masked */
+		if (priority_mask) {
+			dbpm = priority_mask - 1;
+			cpu_if->vgic_contextr = FIELD_PREP(ICH_CONTEXTR_EL2_DB, 1) |
+						FIELD_PREP(ICH_CONTEXTR_EL2_DBPM, dbpm);
+		}
+
+		/* Make the doorbell affine to this CPU */
+		WARN_ON(irq_set_affinity(vgic_v5_vpe_db(vcpu),
+					 cpumask_of(smp_processor_id())));
+	}
 
 	kvm_call_hyp(__vgic_v5_make_non_resident, cpu_if);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 15/43] KVM: arm64: gic-v5: Handle doorbells in kvm_vgic_vcpu_pending_irq()
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (13 preceding siblings ...)
  2026-04-27 16:10 ` [PATCH 14/43] KVM: arm64: gic-v5: Request VPE doorbells when going non-resident Sascha Bischoff
@ 2026-04-27 16:11 ` Sascha Bischoff
  2026-04-27 16:11 ` [PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells Sascha Bischoff
                   ` (27 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:11 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Extend the GICv5 check in kvm_vgic_vcpu_pending_irq() to treat a vcpu
as having pending interrupts if the doorbell has fired. The hardware
itself knows which vcpu the incoming interrupt is targeting, and
therefore it is immediately known which vcpu thread needs to be
scheduled to process the interrupt.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index b697678d68b01..d56e87a0d2acc 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -1229,8 +1229,12 @@ int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu)
 	unsigned long flags;
 	struct vgic_vmcr vmcr;
 
-	if (vgic_is_v5(vcpu->kvm))
+	if (vgic_is_v5(vcpu->kvm)) {
+		if (READ_ONCE(vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db_fired))
+			return true;
+
 		return vgic_v5_has_pending_ppi(vcpu);
+	}
 
 	if (!vcpu->kvm->arch.vgic.enabled)
 		return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (14 preceding siblings ...)
  2026-04-27 16:11 ` [PATCH 15/43] KVM: arm64: gic-v5: Handle doorbells in kvm_vgic_vcpu_pending_irq() Sascha Bischoff
@ 2026-04-27 16:11 ` Sascha Bischoff
  2026-04-30 12:23   ` Marc Zyngier
  2026-04-27 16:11 ` [PATCH 17/43] KVM: arm64: gic-v5: Enable VPE DBs on VPE reset and disable on teardown Sascha Bischoff
                   ` (26 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:11 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Each GICv5 VM requires a valid VM Table Entry (VMTE). The VM Table
itself is allocated during probe time, but a VM needs to provision a
VMTE before it is able to properly run (PPIs will work, but nothing
else will - and PPIs only are not useful!).

The correct time for setting up the VMTE is during VM
initialisation. For GICv5, this is vgic_v5_init(). Each VM needs a VM
ID - this is actually the index into the VM Table so it is how a
specific VMTE is selected too. As part of vgic_v5_init get a VM ID via
vgic_v5_allocate_vm_id(), which internally uses an IDA to select an
unused VM ID (and hence VMTE) within the range of allowed VM IDs.

Once the VM ID has been allocated, the doorbell domain for the VM is
allocated, and each of the doorbells itself is allocated and assigned
to a vcpu.

Assuming everything up until this point has succeeded, initialise the
VMTE. Internally this allocates the additional data structures
required by the hardware - the VM Descriptor, VPE Table, etc. This
VMTE is then made valid via the IRS's MMIO interface. Finally, all
VPEs are allocated within the VPET.

On teardown, this process is reversed again. The VMTE is made invalid,
the VPEs are freed, the doorbells are released and the domain torn
down, and finally the VM ID is released. The latter allows the VM ID
and VMTE to be reused for a future VM.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 146 +++++++++++++++++++++++++++++-----
 1 file changed, 128 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 2fc6fa4df034f..9347bc6895223 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -518,6 +518,18 @@ static int vgic_v5_irs_vpe_cr0_update(int vm_id, int vpe_id, u32 cr0)
 	return 0;
 }
 
+static irqreturn_t db_handler(int irq, void *data)
+{
+	struct kvm_vcpu *vcpu = data;
+
+	WRITE_ONCE(vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db_fired, true);
+
+	kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
+	kvm_vcpu_kick(vcpu);
+
+	return IRQ_HANDLED;
+}
+
 static int vgic_v5_send_command(struct kvm_vcpu *vcpu,
 				enum gicv5_vcpu_info_cmd_type type)
 {
@@ -726,26 +738,46 @@ void vgic_v5_reset(struct kvm_vcpu *vcpu)
 	}
 }
 
-int vgic_v5_init(struct kvm *kvm)
+int vgic_v5_map_resources(struct kvm *kvm)
 {
-	struct kvm_vcpu *vcpu;
-	unsigned long idx;
-	int ret;
+	if (!vgic_initialized(kvm))
+		return -EBUSY;
 
-	if (vgic_initialized(kvm))
-		return 0;
+	return 0;
+}
 
-	ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
-	if (ret)
-		return ret;
+/*
+ * Claim and populate a VMTE (optionally making a new L2 VMT valid), create VPE
+ * doorbells, allocate VPET and populate for each VPE. Finally, we also init the
+ * vIRS, which means allocating and making the virtual SPI IST valid.
+ *
+ * Note: We do need to put the cart before the horse here. The VPE doorbells are
+ * our conduit for communication with the IRS, which means we need to have those
+ * before making the VMTE valid.
+ *
+ * On failure, we clean up in the teardown path (vgic_v5_teardown()).
+ */
+int vgic_v5_init(struct kvm *kvm)
+{
+	int nr_vcpus, ret = 0;
+	struct kvm_vcpu *vcpu, *vcpu0;
+	unsigned long i;
+	struct irq_data *d;
+	unsigned int db_virq;
+
+	nr_vcpus = atomic_read(&kvm->online_vcpus);
+	if (nr_vcpus == 0)
+		return -ENODEV;
 
-	kvm_for_each_vcpu(idx, vcpu, kvm) {
+	kvm_for_each_vcpu(i, vcpu, kvm) {
 		if (vcpu_has_nv(vcpu)) {
 			kvm_err("Nested GICv5 VMs are currently unsupported\n");
 			return -EINVAL;
 		}
 	}
 
+	kvm->arch.vgic.gicv5_vm.nr_vpes = nr_vcpus;
+
 	/* We only allow userspace to drive the SW_PPI, if it is implemented. */
 	bitmap_zero(kvm->arch.vgic.gicv5_vm.userspace_ppis,
 		    VGIC_V5_NR_PRIVATE_IRQS);
@@ -754,20 +786,98 @@ int vgic_v5_init(struct kvm *kvm)
 		   kvm->arch.vgic.gicv5_vm.userspace_ppis,
 		   ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS);
 
-	return 0;
+	ret = vgic_v5_allocate_vm_id(kvm);
+	if (ret) {
+		kvm_err("Maximum number of GICv5 VMs reached!\n");
+		return ret;
+	}
+
+	ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
+	if (ret)
+		return ret;
+
+	/*
+	 * Allocate VPE doorbells first - these are our conduit for
+	 * communicating with the host irqchip driver.
+	 */
+	db_virq = irq_domain_alloc_irqs(kvm->arch.vgic.gicv5_vm.domain,
+					nr_vcpus, NUMA_NO_NODE,
+					&kvm->arch.vgic.gicv5_vm);
+	if (db_virq < 0) {
+		/* Simplify teardown by doing this early! */
+		vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
+		return db_virq;
+	}
+
+	kvm->arch.vgic.gicv5_vm.vpe_db_base = db_virq;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		d = irq_domain_get_irq_data(kvm->arch.vgic.gicv5_vm.domain,
+					    db_virq + i);
+		irq_set_status_flags(db_virq + i, IRQ_NOAUTOEN);
+
+		ret = request_irq(db_virq + i, db_handler, 0, "vcpu", vcpu);
+		if (ret)
+			return ret;
+
+		/* Stash it with the VCPU for easy retrieval */
+		vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db = db_virq + i;
+	}
+
+	/* Populate VMTE (with VPET and VM descriptor) */
+	ret = vgic_v5_vmte_init(kvm);
+	if (ret)
+		return ret;
+
+	/* We pick the first vcpu to make the VMTE valid - any would do */
+	vcpu0 = kvm_get_vcpu(kvm, 0);
+	ret = vgic_v5_send_command(vcpu0, VMTE_MAKE_VALID);
+	if (ret)
+		return ret;
+
+	/* Loop over all VPEs, allocate/populate their data structures */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		ret = vgic_v5_vmte_alloc_vpe(vcpu);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
 }
 
 void vgic_v5_teardown(struct kvm *kvm)
 {
-	vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
-}
+	struct kvm_vcpu *vcpu, *vcpu0;
+	struct vgic_dist *dist = &kvm->arch.vgic;
+	unsigned long i;
+	int rc;
 
-int vgic_v5_map_resources(struct kvm *kvm)
-{
-	if (!vgic_initialized(kvm))
-		return -EBUSY;
+	/*
+	 * If the VM's ID isn't valid, then we failed init very early. Nothing
+	 * to do here.
+	 */
+	if (!kvm->arch.vgic.gicv5_vm.vm_id_valid)
+		return;
 
-	return 0;
+	if (kvm->arch.vgic.gicv5_vm.vmte_allocated) {
+		/* Make the VM invalid  */
+		vcpu0 = kvm_get_vcpu(kvm, 0);
+		rc = vgic_v5_send_command(vcpu0, VMTE_MAKE_INVALID);
+		if (rc)
+			kvm_err("could not make VMTE invalid\n");
+
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			if (vgic_v5_vmte_free_vpe(vcpu))
+				kvm_err("Failed to free VPE\n");
+		}
+
+		if (vgic_v5_vmte_release(kvm))
+			kvm_err("Failed to release VM 0x%x\n", dist->gicv5_vm.vm_id);
+	}
+
+	vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
+
+	vgic_v5_release_vm_id(kvm);
 }
 
 int vgic_v5_finalize_ppi_state(struct kvm *kvm)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 17/43] KVM: arm64: gic-v5: Enable VPE DBs on VPE reset and disable on teardown
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (15 preceding siblings ...)
  2026-04-27 16:11 ` [PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells Sascha Bischoff
@ 2026-04-27 16:11 ` Sascha Bischoff
  2026-04-27 16:12 ` [PATCH 18/43] KVM: arm64: gic-v5: Define remaining IRS MMIO registers Sascha Bischoff
                   ` (25 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:11 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Ensure that each VPE doorbell is enabled on the host prior to running
the guest by enabling it as part of vgic_v5_reset(). Add a function to
disable it again as part of VM teardown, vgic_v5_disable_vcpu(), which
also frees the IRQ again.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 9347bc6895223..48ffcb2494db3 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -736,6 +736,31 @@ void vgic_v5_reset(struct kvm_vcpu *vcpu)
 		kvm_vm_dead(vcpu->kvm);
 		return;
 	}
+
+	enable_irq(vgic_v5_vpe_db(vcpu));
+}
+
+static void vgic_v5_disable_vcpu(struct kvm_vcpu *vcpu)
+{
+	int virq = vgic_v5_vpe_db(vcpu);
+
+	if (!vcpu->kvm->arch.vgic.gicv5_vm.domain)
+		return;
+
+	if (!virq)
+		return;
+
+	/*
+	 * We are called in the vgic_v5_teardown path. We no longer need the
+	 * doorbell virqs.
+	 */
+	disable_irq(virq);
+
+	/* Free the doorbell irq (counter-part to request_irq)*/
+	free_irq(virq, vcpu);
+
+	/* Remove the irq from the domain too */
+	irq_domain_free_irqs(virq, 1);
 }
 
 int vgic_v5_map_resources(struct kvm *kvm)
@@ -875,6 +900,10 @@ void vgic_v5_teardown(struct kvm *kvm)
 			kvm_err("Failed to release VM 0x%x\n", dist->gicv5_vm.vm_id);
 	}
 
+	/* Goodbye doorbells */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		vgic_v5_disable_vcpu(vcpu);
+
 	vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
 
 	vgic_v5_release_vm_id(kvm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 18/43] KVM: arm64: gic-v5: Define remaining IRS MMIO registers
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (16 preceding siblings ...)
  2026-04-27 16:11 ` [PATCH 17/43] KVM: arm64: gic-v5: Enable VPE DBs on VPE reset and disable on teardown Sascha Bischoff
@ 2026-04-27 16:12 ` Sascha Bischoff
  2026-04-27 16:12 ` [PATCH 19/43] KVM: arm64: gic-v5: Introduce struct vgic_v5_irs and IRS base address Sascha Bischoff
                   ` (24 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:12 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Complete the set of defined IRS MMIO registers in the GICv5 header
file. Up until now, the set of defined IRS MMIO registers has been
driven by code requirements. However, in order to properly emulate the
IRS MMIO interface in KVM, the complete set of IRS MMIO registers
needs to be added.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 include/linux/irqchip/arm-gic-v5.h | 105 ++++++++++++++++++++++++++---
 1 file changed, 96 insertions(+), 9 deletions(-)

diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index 54b573783cd75..9ea3674a6613b 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -62,6 +62,14 @@
 #define GICV5_OUTER_SHARE		0b10
 #define GICV5_INNER_SHARE		0b11
 
+#define GICV5_AIDR_COMPONENT_IRS	0b00
+#define GICV5_AIDR_COMPONENT_ITS	0b01
+#define GICV5_AIDR_COMPONENT_IWB	0b10
+
+#define GICV5_AIDR_ARCH_MAJ_REV_V5	0
+#define GICV5_AIDR_ARCH_MIN_REV_V0	0
+#define GICV5_IIDR_IMPLEMENTER_ARM	0x43b
+
 /*
  * IRS registers and tables structures
  */
@@ -73,11 +81,16 @@
 #define GICV5_IRS_IDR5			0x0014
 #define GICV5_IRS_IDR6			0x0018
 #define GICV5_IRS_IDR7			0x001c
+#define GICV5_IRS_IIDR			0x0040
+#define GICV5_IRS_AIDR			0x0044
 #define GICV5_IRS_CR0			0x0080
 #define GICV5_IRS_CR1			0x0084
 #define GICV5_IRS_SYNCR			0x00c0
 #define GICV5_IRS_SYNC_STATUSR		0x00c4
+#define GICV5_IRS_SPI_VMR		0x0100
 #define GICV5_IRS_SPI_SELR		0x0108
+#define GICV5_IRS_SPI_DOMAINR		0x010c
+#define GICV5_IRS_SPI_RESAMPLER		0x0110
 #define GICV5_IRS_SPI_CFGR		0x0114
 #define GICV5_IRS_SPI_STATUSR		0x0118
 #define GICV5_IRS_PE_SELR		0x0140
@@ -92,17 +105,48 @@
 #define GICV5_IRS_VMT_STATUSR		0x0214
 #define GICV5_IRS_VPE_SELR		0x0240
 #define GICV5_IRS_VPE_DBR		0x0248
+#define GICV5_IRS_VPE_HPPIR		0x0250
 #define GICV5_IRS_VPE_CR0		0x0258
 #define GICV5_IRS_VPE_STATUSR		0x025c
+#define GICV5_IRS_VM_DBR		0x0280
+#define GICV5_IRS_VM_SELR		0x0288
+#define GICV5_IRS_VM_STATUSR		0x028c
 #define GICV5_IRS_VMAP_L2_VMTR		0x02c0
 #define GICV5_IRS_VMAP_VMR		0x02c8
 #define GICV5_IRS_VMAP_VISTR		0x02d0
+#define GICV5_IRS_VMAP_L2_VISTR		0x02d8
 #define GICV5_IRS_VMAP_VPER		0x02e0
-
+#define GICV5_IRS_SAVE_VMR		0x0300
+#define GICV5_IRS_SAVE_VM_STATUSR	0x0308
+#define GICV5_IRS_MEC_IDR		0x0340
+#define GICV5_IRS_MEC_MECID_R		0x0344
+#define GICV5_IRS_MPAM_IDR		0x0380
+#define GICV5_IRS_MPAM_PARTID_R		0x0384
+#define GICV5_IRS_SWERR_STATUSR		0x03c0
+#define GICV5_IRS_SWERR_SYNDROMER0	0x03c8
+#define GICV5_IRS_SWERR_SYNDROMER1	0x03d0
+
+#define GICV5_IRS_SETLPIR		0x0000
+
+#define GICV5_IRS_IDR0_IRSID		GENMASK(31, 16)
+#define GICV5_IRS_IDR0_SWE		BIT(12)
+#define GICV5_IRS_IDR0_MPAM		BIT(11)
+#define GICV5_IRS_IDR0_MEC		BIT(10)
+#define GICV5_IRS_IDR0_SETLPI		BIT(9)
+#define GICV5_IRS_IDR0_VIRT1OFN		BIT(8)
+#define GICV5_IRS_IDR0_ONEOFN		BIT(7)
 #define GICV5_IRS_IDR0_VIRT		BIT(6)
+#define GICV5_IRS_IDR0_PA_RANGE		GENMASK(5, 2)
+#define GICV5_IRS_IDR0_DOM		GENMASK(1, 0)
+
+#define GICV5_IRS_IDR0_DOMAIN_SECURE		0b00
+#define GICV5_IRS_IDR0_DOMAIN_NON_SECURE	0b01
+#define GICV5_IRS_IDR0_DOMAIN_EL3		0b10
+#define GICV5_IRS_IDR0_DOMAIN_REALM		0b11
 
 #define GICV5_IRS_IDR1_PRIORITY_BITS	GENMASK(22, 20)
 #define GICV5_IRS_IDR1_IAFFID_BITS	GENMASK(19, 16)
+#define GICV5_IRS_IDR1_PE_CNT		GENMASK(15, 0)
 
 #define GICV5_IRS_IDR1_PRIORITY_BITS_1BITS	0b000
 #define GICV5_IRS_IDR1_PRIORITY_BITS_2BITS	0b001
@@ -118,6 +162,10 @@
 #define GICV5_IRS_IDR2_LPI		BIT(5)
 #define GICV5_IRS_IDR2_ID_BITS		GENMASK(4, 0)
 
+#define GICV5_IRS_IST_L2SZ_SUPPORT_4KB(r)	FIELD_GET(BIT(11), (r))
+#define GICV5_IRS_IST_L2SZ_SUPPORT_16KB(r)	FIELD_GET(BIT(12), (r))
+#define GICV5_IRS_IST_L2SZ_SUPPORT_64KB(r)	FIELD_GET(BIT(13), (r))
+
 #define GICV5_IRS_IDR3_VMT_LEVELS	BIT(10)
 #define GICV5_IRS_IDR3_VM_ID_BITS	GENMASK(9, 5)
 #define GICV5_IRS_IDR3_VMD_SZ		GENMASK(4, 1)
@@ -130,9 +178,14 @@
 #define GICV5_IRS_IDR6_SPI_IRS_RANGE	GENMASK(24, 0)
 #define GICV5_IRS_IDR7_SPI_BASE		GENMASK(23, 0)
 
-#define GICV5_IRS_IST_L2SZ_SUPPORT_4KB(r)	FIELD_GET(BIT(11), (r))
-#define GICV5_IRS_IST_L2SZ_SUPPORT_16KB(r)	FIELD_GET(BIT(12), (r))
-#define GICV5_IRS_IST_L2SZ_SUPPORT_64KB(r)	FIELD_GET(BIT(13), (r))
+#define GICV5_IRS_IIDR_PRODUCT_ID	GENMASK(31, 20)
+#define GICV5_IRS_IIDR_VARIANT		GENMASK(19, 16)
+#define GICV5_IRS_IIDR_REVISION		GENMASK(15, 12)
+#define GICV5_IRS_IIDR_IMPLEMENTER	GENMASK(11, 0)
+
+#define GICV5_IRS_AIDR_COMPONENT	GENMASK(11, 8)
+#define GICV5_IRS_AIDR_ARCHMAJORREV	GENMASK(7, 4)
+#define GICV5_IRS_AIDR_ARCHMINORREV	GENMASK(3, 0)
 
 #define GICV5_IRS_CR0_IDLE		BIT(1)
 #define GICV5_IRS_CR0_IRSEN		BIT(0)
@@ -155,21 +208,39 @@
 
 #define GICV5_IRS_SYNC_STATUSR_IDLE	BIT(0)
 
-#define GICV5_IRS_SPI_STATUSR_V		BIT(1)
-#define GICV5_IRS_SPI_STATUSR_IDLE	BIT(0)
+#define GICV5_IRS_SPI_VMR_VIRT		BIT_ULL(63)
+#define GICV5_IRS_SPI_VMR_VM_ID		GENMASK_ULL(15, 0)
 
 #define GICV5_IRS_SPI_SELR_ID		GENMASK(23, 0)
 
+#define GICV5_IRS_SPI_DOMAINR_DOMAIN	GENMASK(1, 0)
+
+#define GICV5_IRS_SPI_DOMAINR_DOMAIN_SECURE	0b00
+#define GICV5_IRS_SPI_DOMAINR_DOMAIN_NON_SECURE	0b01
+#define GICV5_IRS_SPI_DOMAINR_DOMAIN_EL3	0b10
+#define GICV5_IRS_SPI_DOMAINR_DOMAIN_REALM	0b11
+
+#define GICV5_IRS_SPI_RESAMPLER_RESAMPLE	BIT(0)
+
 #define GICV5_IRS_SPI_CFGR_TM		BIT(0)
 
+#define GICV5_IRS_SPI_CFGR_TM_EDGE	0b0
+#define GICV5_IRS_SPI_CFGR_TM_LEVEL	0b1
+
+#define GICV5_IRS_SPI_STATUSR_V		BIT(1)
+#define GICV5_IRS_SPI_STATUSR_IDLE	BIT(0)
+
 #define GICV5_IRS_PE_SELR_IAFFID	GENMASK(15, 0)
 
+#define GICV5_IRS_PE_STATUSR_ONLINE	BIT(2)
 #define GICV5_IRS_PE_STATUSR_V		BIT(1)
 #define GICV5_IRS_PE_STATUSR_IDLE	BIT(0)
 
 #define GICV5_IRS_PE_CR0_DPS		BIT(0)
 
-#define GICV5_IRS_IST_STATUSR_IDLE	BIT(0)
+#define GICV5_IRS_IST_BASER_ADDR_MASK	GENMASK_ULL(55, 6)
+#define GICV5_IRS_IST_BASER_VALID	BIT_ULL(0)
+#define GICV5_IRS_IST_BASER_ADDR_SHIFT	6ULL
 
 #define GICV5_IRS_IST_CFGR_STRUCTURE	BIT(16)
 #define GICV5_IRS_IST_CFGR_ISTSZ	GENMASK(8, 7)
@@ -187,8 +258,7 @@
 #define GICV5_IRS_IST_CFGR_L2SZ_16K	0b01
 #define GICV5_IRS_IST_CFGR_L2SZ_64K	0b10
 
-#define GICV5_IRS_IST_BASER_ADDR_MASK	GENMASK_ULL(55, 6)
-#define GICV5_IRS_IST_BASER_VALID	BIT_ULL(0)
+#define GICV5_IRS_IST_STATUSR_IDLE	BIT(0)
 
 #define GICV5_IRS_MAP_L2_ISTR_ID	GENMASK(23, 0)
 
@@ -213,12 +283,24 @@
 #define GICV5_IRS_VPE_DBR_DBPM		GENMASK_ULL(36, 32)
 #define GICV5_IRS_VPE_DBR_LPI_ID	GENMASK_ULL(23, 0)
 
+#define GICv5_VPE_HPPIR_HPPIV		BIT_ULL(32)
+#define GICv5_VPE_HPPIR_TYPE		GENMASK_ULL(31, 29)
+#define GICv5_VPE_HPPIR_ID		GENMASK_ULL(23, 0)
+
 #define GICV5_IRS_VPE_CR0_DPS		BIT(0)
 
 #define GICV5_IRS_VPE_STATUSR_F		BIT(2)
 #define GICV5_IRS_VPE_STATUSR_V		BIT(1)
 #define GICV5_IRS_VPE_STATUSR_IDLE	BIT(0)
 
+#define GICV5_IRS_VM_DBR_EN		BIT_ULL(63)
+#define GICV5_IRS_VM_DBR_VPE_ID		GENMASK_ULL(15, 0)
+
+#define GICV5_IRS_VM_SELR_VM_ID		GENMASK(15, 0)
+
+#define GICV5_IRS_VM_STATUSR_V		BIT(1)
+#define GICV5_IRS_VM_STATUSR_IDLE	BIT(0)
+
 #define GICV5_IRS_VMAP_L2_VMTR_M	BIT_ULL(63)
 #define GICV5_IRS_VMAP_L2_VMTR_VM_ID	GENMASK_ULL(15, 0)
 
@@ -231,6 +313,11 @@
 #define GICV5_IRS_VMAP_VISTR_VM_ID	GENMASK_ULL(47, 32)
 #define GICV5_IRS_VMAP_VISTR_TYPE	GENMASK_ULL(31, 29)
 
+#define GICV5_IRS_VMAP_L2_VISTR_M	BIT_ULL(63)
+#define GICV5_IRS_VMAP_L2_VISTR_VM_ID	GENMASK_ULL(47, 32)
+#define GICV5_IRS_VMAP_L2_VISTR_TYPE	GENMASK_ULL(31, 29)
+#define GICV5_IRS_VMAP_L2_VISTR_ID	GENMASK_ULL(23, 0)
+
 #define GICV5_IRS_VMAP_VPER_M		BIT_ULL(63)
 #define GICV5_IRS_VMAP_VPER_VM_ID	GENMASK_ULL(47, 32)
 #define GICV5_IRS_VMAP_VPER_VPE_ID	GENMASK_ULL(15, 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 19/43] KVM: arm64: gic-v5: Introduce struct vgic_v5_irs and IRS base address
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (17 preceding siblings ...)
  2026-04-27 16:12 ` [PATCH 18/43] KVM: arm64: gic-v5: Define remaining IRS MMIO registers Sascha Bischoff
@ 2026-04-27 16:12 ` Sascha Bischoff
  2026-04-27 16:12 ` [PATCH 20/43] KVM: arm64: gic-v5: Add IRS IODEV to iodev_types and generic MMIO handlers Sascha Bischoff
                   ` (23 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:12 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

In order to properly emulate the operation of the IRS from KVM, we
require storage for the MMIO register state. This change introduces
struct vgic_v5_irs, and adds a pointer to it to the struct vgic_dist.

This new data structure contains the storage for IRS MMIO state that
is required for emulating the MMIO interface in KVM. This provides
persistent storage, and a way to track data across MMIO writes, e.g.,
selecting an SPI and updating the configuration of it is two MMIO
writes.

Note that only a pointer to the data structure is added to struct
vgic_dist as this new structure is very large, and hence it makes
sense to dynamically allocate it and just provide a pointer to
retrieve it in struct vgic_dist.

In addition to adding a structure to store the MMIO state for the IRS,
we add the base address in GPA space to struct vgic_dist.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 include/kvm/arm_vgic.h | 89 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 0bcbc751593cc..d8f416f1f46a1 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -376,6 +376,90 @@ struct vgic_v5_vm {
 	bool			vmte_allocated;
 };
 
+/*** GICv5 ***/
+struct vgic_v5_irs {
+	/* base addresses in guest physical address space: */
+	gpa_t vgic_v5_irs_base;
+
+	bool enabled;
+	struct vgic_io_device iodev;
+	struct kvm_device *dev;
+
+	/* IRS state - used for registers etc */
+	struct irs_idr0 {
+		u8 domain;
+		u8 pa_range;
+		bool virt;
+		bool one_of_n;
+		bool virt_one_of_n;
+		bool setlpi;
+		bool mec;
+		bool mpam;
+		bool swe;
+		u16 irs_id;
+	} idr0;
+
+	struct irs_idr1 {
+		/* num_pes is populated from online_vcpus at runtime */
+		u8 priority_bits;
+	} idr1;
+
+	struct irs_idr2 {
+		u8 id_bits;
+		u8 min_lpi_id_bits;
+		bool ist_levels;
+		u8 ist_l2sz;
+		bool istmd;
+		u8 istmd_sz;
+	} idr2;
+
+	struct irs_idr5 {
+		u32 spi_range;
+	} idr5;
+
+	struct irs_idr6 {
+		u32 spi_irs_range;
+	} idr6;
+
+	struct irs_idr7 {
+		u32 spi_base;
+	} idr7;
+
+	struct irs_cr1 {
+		u8 sh;
+		u8 oc;
+		u8 ic;
+		bool ist_ra;
+		bool ist_wa;
+		bool vmt_ra;
+		bool vpet_ra;
+		bool vmd_ra;
+		bool vmd_wa;
+		bool vped_ra;
+		bool vped_wa;
+	} cr1;
+
+	struct irs_spi_selr {
+		u32 id;
+	} spi_selr;
+
+	struct irs_pe_selr {
+		u32 iaffid;
+	} pe_selr;
+
+	struct irs_ist_cfgr {
+		u8 lpi_id_bits;
+		u8 l2sz;
+		u8 istsz;
+		bool structure;
+	} ist_cfgr;
+
+	struct irs_ist_baser {
+		bool valid;
+		u64 addr;
+	} ist_baser;
+};
+
 struct vgic_dist {
 	bool			in_kernel;
 	bool			ready;
@@ -453,6 +537,11 @@ struct vgic_dist {
 	 * GICv5 per-VM data.
 	 */
 	struct vgic_v5_vm	gicv5_vm;
+
+	/*
+	 * GICv5 IRS data. Dynamically allocated due to the size.
+	 */
+	struct vgic_v5_irs	*vgic_v5_irs_data;
 };
 
 struct vgic_v2_cpu_if {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 20/43] KVM: arm64: gic-v5: Add IRS IODEV to iodev_types and generic MMIO handlers
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (18 preceding siblings ...)
  2026-04-27 16:12 ` [PATCH 19/43] KVM: arm64: gic-v5: Introduce struct vgic_v5_irs and IRS base address Sascha Bischoff
@ 2026-04-27 16:12 ` Sascha Bischoff
  2026-04-27 16:13 ` [PATCH 21/43] KVM: arm64: gic-v5: Add KVM_VGIC_V5_ADDR_TYPE_IRS to UAPI Sascha Bischoff
                   ` (22 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:12 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

In order to support proper VMs (that support more than just PPIs) for
GICv5, it is important to emulate the GICv5 IRS too. The IRS includes
an MMIO interface which is used to interact with and configure the
IRS.

As part of providing the emulated IRS MMIO interface in KVM, extend
the iodev_types enum to include a GICV5_IRS, and extend the MMIO code
to handle reads and writes to that type of IO device. This will allow
the creation of a GICv5 IRS IO Device in KVM.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-mmio.c | 6 ++++++
 arch/arm64/kvm/vgic/vgic-mmio.h | 2 ++
 include/kvm/arm_vgic.h          | 3 ++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c
index 74d76dec97304..fddb9da0403d5 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio.c
@@ -1065,6 +1065,9 @@ static int dispatch_mmio_read(struct kvm_vcpu *vcpu, struct kvm_io_device *dev,
 	case IODEV_ITS:
 		data = region->its_read(vcpu->kvm, iodev->its, addr, len);
 		break;
+	case IODEV_GICV5_IRS:
+		data = region->read(vcpu, addr, len);
+		break;
 	}
 
 	vgic_data_host_to_mmio_bus(val, len, data);
@@ -1095,6 +1098,9 @@ static int dispatch_mmio_write(struct kvm_vcpu *vcpu, struct kvm_io_device *dev,
 	case IODEV_ITS:
 		region->its_write(vcpu->kvm, iodev->its, addr, len, data);
 		break;
+	case IODEV_GICV5_IRS:
+		region->write(vcpu, addr, len, data);
+		break;
 	}
 
 	return 0;
diff --git a/arch/arm64/kvm/vgic/vgic-mmio.h b/arch/arm64/kvm/vgic/vgic-mmio.h
index 50dc80220b0f3..38ed730d68ac3 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio.h
+++ b/arch/arm64/kvm/vgic/vgic-mmio.h
@@ -217,6 +217,8 @@ unsigned int vgic_v2_init_cpuif_iodev(struct vgic_io_device *dev);
 
 unsigned int vgic_v3_init_dist_iodev(struct vgic_io_device *dev);
 
+unsigned int vgic_v5_init_irs_iodev(struct vgic_io_device *dev);
+
 u64 vgic_sanitise_outer_cacheability(u64 reg);
 u64 vgic_sanitise_inner_cacheability(u64 reg);
 u64 vgic_sanitise_shareability(u64 reg);
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index d8f416f1f46a1..f8ecd19acc06d 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -290,7 +290,8 @@ enum iodev_type {
 	IODEV_CPUIF,
 	IODEV_DIST,
 	IODEV_REDIST,
-	IODEV_ITS
+	IODEV_ITS,
+	IODEV_GICV5_IRS
 };
 
 struct vgic_io_device {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 21/43] KVM: arm64: gic-v5: Add KVM_VGIC_V5_ADDR_TYPE_IRS to UAPI
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (19 preceding siblings ...)
  2026-04-27 16:12 ` [PATCH 20/43] KVM: arm64: gic-v5: Add IRS IODEV to iodev_types and generic MMIO handlers Sascha Bischoff
@ 2026-04-27 16:13 ` Sascha Bischoff
  2026-04-27 16:13 ` [PATCH 22/43] KVM: arm64: gic-v5: Add GICv5 IRS IODEV and MMIO emulation Sascha Bischoff
                   ` (21 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:13 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Add an interface to set the address of the emulated IRS in GPA space
to the UAPI. As per the GICv5 specification, the IRS has one
CONFIG_FRAME and optionally one SETLPI_FRAME per interrupt
domain. Within a KVM VM we are only concerned with one interrupt
domain. Each of these frames is 64kB in size, so we reserve 2x64kB of
contiguous memory in the GPA space for a GICv5 IRS.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/include/uapi/asm/kvm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 1c13bfa2d38aa..d1b2ca317f586 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -97,6 +97,11 @@ struct kvm_regs {
 #define KVM_VGIC_V3_REDIST_SIZE		(2 * SZ_64K)
 #define KVM_VGIC_V3_ITS_SIZE		(2 * SZ_64K)
 
+/* Supported VGICv5 address types  */
+#define KVM_VGIC_V5_ADDR_TYPE_IRS	6
+
+#define KVM_VGIC_V5_IRS_SIZE		(2 * SZ_64K)
+
 #define KVM_ARM_VCPU_POWER_OFF		0 /* CPU is started in OFF state */
 #define KVM_ARM_VCPU_EL1_32BIT		1 /* CPU running a 32bit VM */
 #define KVM_ARM_VCPU_PSCI_0_2		2 /* CPU uses PSCI v0.2 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 22/43] KVM: arm64: gic-v5: Add GICv5 IRS IODEV and MMIO emulation
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (20 preceding siblings ...)
  2026-04-27 16:13 ` [PATCH 21/43] KVM: arm64: gic-v5: Add KVM_VGIC_V5_ADDR_TYPE_IRS to UAPI Sascha Bischoff
@ 2026-04-27 16:13 ` Sascha Bischoff
  2026-04-27 16:13 ` [PATCH 23/43] KVM: arm64: gic-v5: Set IRICHPPIDIS based on IRS enable state Sascha Bischoff
                   ` (20 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:13 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

In order to properly support GICv5-based VMs in KVM, we need to
emulate the CONFIG_FRAME for a virtual IRS. This emulation needs to
handle all guest accesses to the MMIO region, and mimic the behaviour
of a real IRS.

Introduce an IODEV for the GICv5 IRS, and an associated init function
that sets up the SPIs and initial state for the IRS. The MMIO emulation
provides support for the guest to query the IRS_IDx registers,
manipulate SPIs, configure ISTs, and so forth.

Some of the guest's interactions with the MMIO region require KVM to
interact with the host IRS to complete the operation. One example of
this is a guest write to the emulated IRS_PE_CR0. First of all, the
guest must write to the IRS_PE_SELR register to select a PE by IAFFID
- this is the VPE ID for a VM, but the guest doesn't know this - which
is stashed. Ideally, the guest should read the IRS_PE_STATUSR at this
point in order to check that the written IAFFID is valid. At this
point, the IRS emulation code checks this, and sets the V bit
accordingly. Finally, when the guest writes to the emulated
IRS_PE_CR0, we again check that the selected VPE is valid, and then
relay this write to the host IRS via a VPE doorbell.

Similar interactions take place for SPIs too.

When it comes to the LPI IST this also requires KVM to perform actions
on behalf of the guest. When the emulated IRS_IST_BASER is written,
KVM re-allocates the IST on the host, matching the guest's
configuration (from the emulated IRS_IST_CFGR) where appropriate. This
is then provided to the physical IRS via the VMTE. As far as the guest
is concerned, the IST it allocated is being used by the hardware, but
in reality the host IST is used instead.

This change provides the IRS IODEV as a whole, but this is not plumbed
into the rest of KVM yet.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/Makefile              |   2 +-
 arch/arm64/kvm/vgic/vgic-irs-v5.c    | 823 +++++++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic-v5-tables.c |  16 +
 arch/arm64/kvm/vgic/vgic-v5-tables.h |   1 +
 arch/arm64/kvm/vgic/vgic.h           |   2 +
 5 files changed, 843 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/vgic/vgic-irs-v5.c

diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 431de9b145ca1..92dda57c08766 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -24,7 +24,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
 	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
-	 vgic/vgic-v5.o vgic/vgic-v5-tables.o
+	 vgic/vgic-v5.o vgic/vgic-v5-tables.o vgic/vgic-irs-v5.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
diff --git a/arch/arm64/kvm/vgic/vgic-irs-v5.c b/arch/arm64/kvm/vgic/vgic-irs-v5.c
new file mode 100644
index 0000000000000..729a3a3aca3a3
--- /dev/null
+++ b/arch/arm64/kvm/vgic/vgic-irs-v5.c
@@ -0,0 +1,823 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025 ARM Limited, All Rights Reserved.
+ */
+#include <linux/bitops.h>
+#include <linux/bsearch.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <kvm/iodev.h>
+#include <kvm/arm_arch_timer.h>
+#include <kvm/arm_vgic.h>
+
+#include "vgic.h"
+#include "vgic-mmio.h"
+#include "vgic-v5-tables.h"
+
+static struct vgic_dist *vgic_v5_get_vgic(struct kvm_vcpu *vcpu)
+{
+	return &vcpu->kvm->arch.vgic;
+}
+
+static struct vgic_v5_irs *vgic_v5_get_irs(struct kvm_vcpu *vcpu)
+{
+	return vcpu->kvm->arch.vgic.vgic_v5_irs_data;
+}
+
+static unsigned long vgic_v5_mmio_read_irs_misc(struct kvm_vcpu *vcpu,
+						gpa_t addr, unsigned int len)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	const size_t offset = addr & (SZ_64K - 1);
+	struct gicv5_cmd_info cmd_info;
+	struct kvm_vcpu *target_vcpu;
+	u64 value = 0;
+	int rc;
+
+	switch (offset) {
+	case GICV5_IRS_IDR0:
+		value = FIELD_PREP(GICV5_IRS_IDR0_DOM, irs->idr0.domain);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_PA_RANGE, irs->idr0.pa_range);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_VIRT, irs->idr0.virt);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_ONEOFN, irs->idr0.one_of_n);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_VIRT1OFN, irs->idr0.virt_one_of_n);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_SETLPI, irs->idr0.setlpi);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_MEC, irs->idr0.mec);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_MPAM, irs->idr0.mpam);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_SWE, irs->idr0.swe);
+		value |= FIELD_PREP(GICV5_IRS_IDR0_IRSID, irs->idr0.irs_id);
+		break;
+	case GICV5_IRS_IDR1:
+		value = FIELD_PREP(GICV5_IRS_IDR1_PE_CNT,
+				   atomic_read(&vcpu->kvm->online_vcpus));
+		value |= FIELD_PREP(GICV5_IRS_IDR1_IAFFID_BITS, vgic_v5_vmte_vpe_id_bits(vcpu));
+		value |= FIELD_PREP(GICV5_IRS_IDR1_PRIORITY_BITS, irs->idr1.priority_bits);
+		break;
+	case GICV5_IRS_IDR2:
+		value = FIELD_PREP(GICV5_IRS_IDR2_ISTMD_SZ, irs->idr2.istmd_sz);
+		value |= FIELD_PREP(GICV5_IRS_IDR2_ISTMD, irs->idr2.istmd);
+		value |= FIELD_PREP(GICV5_IRS_IDR2_IST_L2SZ, irs->idr2.ist_l2sz);
+		value |= FIELD_PREP(GICV5_IRS_IDR2_IST_LEVELS, irs->idr2.ist_levels);
+		value |= FIELD_PREP(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, irs->idr2.min_lpi_id_bits);
+		value |= GICV5_IRS_IDR2_LPI; /* We always support LPIs */
+		value |= FIELD_PREP(GICV5_IRS_IDR2_ID_BITS, irs->idr2.id_bits);
+		break;
+	case GICV5_IRS_IDR5:
+		value = FIELD_PREP(GICV5_IRS_IDR5_SPI_RANGE, irs->idr5.spi_range);
+		break;
+	case GICV5_IRS_IDR6:
+		value = FIELD_PREP(GICV5_IRS_IDR6_SPI_IRS_RANGE, irs->idr6.spi_irs_range);
+		break;
+	case GICV5_IRS_IDR7:
+		value = FIELD_PREP(GICV5_IRS_IDR7_SPI_BASE, irs->idr7.spi_base);
+		break;
+	case GICV5_IRS_IIDR:
+		/* Revision, Variant, ProductID are implementation defined */
+		value = FIELD_PREP(GICV5_IRS_IIDR_PRODUCT_ID, PRODUCT_ID_KVM);
+		value |= FIELD_PREP(GICV5_IRS_IIDR_VARIANT, 0);
+		value |= FIELD_PREP(GICV5_IRS_IIDR_REVISION, 0);
+		value |= FIELD_PREP(GICV5_IRS_IIDR_IMPLEMENTER, IMPLEMENTER_ARM);
+		break;
+	case GICV5_IRS_AIDR:
+		value = FIELD_PREP(GICV5_IRS_AIDR_COMPONENT,
+				   GICV5_AIDR_COMPONENT_IRS);
+		value |= FIELD_PREP(GICV5_IRS_AIDR_ARCHMAJORREV,
+				    GICV5_AIDR_ARCH_MAJ_REV_V5);
+		value |= FIELD_PREP(GICV5_IRS_AIDR_ARCHMINORREV,
+				    GICV5_AIDR_ARCH_MIN_REV_V0);
+		break;
+	case GICV5_IRS_CR0:
+		/*
+		 * The IRS is ALWAYS idle as we handle things instantaneously
+		 * from a guest's viewpoint.
+		 */
+		value = GICV5_IRS_CR0_IDLE;
+		value |= FIELD_PREP(GICV5_IRS_CR0_IRSEN,
+				    irs->enabled);
+		break;
+	case GICV5_IRS_CR1:
+		value = FIELD_PREP(GICV5_IRS_CR1_VPED_WA, irs->cr1.vped_wa);
+		value |= FIELD_PREP(GICV5_IRS_CR1_VPED_RA, irs->cr1.vped_ra);
+		value |= FIELD_PREP(GICV5_IRS_CR1_VMD_WA, irs->cr1.vmd_wa);
+		value |= FIELD_PREP(GICV5_IRS_CR1_VMD_RA, irs->cr1.vmd_ra);
+		value |= FIELD_PREP(GICV5_IRS_CR1_VPET_RA, irs->cr1.vpet_ra);
+		value |= FIELD_PREP(GICV5_IRS_CR1_VMT_RA, irs->cr1.vmt_ra);
+		value |= FIELD_PREP(GICV5_IRS_CR1_IST_WA, irs->cr1.ist_wa);
+		value |= FIELD_PREP(GICV5_IRS_CR1_IST_RA, irs->cr1.ist_ra);
+		value |= FIELD_PREP(GICV5_IRS_CR1_IC, irs->cr1.ic);
+		value |= FIELD_PREP(GICV5_IRS_CR1_OC, irs->cr1.oc);
+		value |= FIELD_PREP(GICV5_IRS_CR1_SH, irs->cr1.sh);
+		break;
+	case GICV5_IRS_SYNC_STATUSR:
+		value = GICV5_IRS_SYNC_STATUSR_IDLE;
+		break;
+	case GICV5_IRS_PE_SELR:
+		value = FIELD_PREP(GICV5_IRS_PE_SELR_IAFFID, irs->pe_selr.iaffid);
+		break;
+	case GICV5_IRS_PE_STATUSR:
+		/* We assume that the PE is Online if present. Always IDLE too */
+		value = GICV5_IRS_PE_STATUSR_IDLE;
+
+		/* Set ONLINE and V if IAFFID selects a present PE */
+		if (kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid)) {
+			value |= GICV5_IRS_PE_STATUSR_ONLINE;
+			value |= GICV5_IRS_PE_STATUSR_V;
+		}
+		break;
+	case GICV5_IRS_PE_CR0:
+		/*
+		 * Make sure that we are doing something reasonable first.
+		 * Remember, the IAFFID is the same as the VPE_ID
+		 */
+		target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid);
+		if (!target_vcpu) {
+			kvm_err("Guest programmed invalid IAFFID (0x%x) into the IRS_PE_SELR\n",
+				irs->pe_selr.iaffid);
+			break;
+		}
+
+		mutex_lock(&vcpu->kvm->arch.config_lock);
+
+		/*
+		 * Read the corresponding IRS_VPE_CR0. We do so via the doorbell
+		 * for the specific vcpu we have in the PE_SELR.
+		 */
+		cmd_info.cmd_type = VPE_CR0_READ;
+		rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(target_vcpu), &cmd_info);
+		if (rc)
+			kvm_err("Could not read VPE_CR0 in IRS: %d\n", rc);
+		else
+			value = cmd_info.data;
+
+		mutex_unlock(&vcpu->kvm->arch.config_lock);
+
+		break;
+	default:
+		return 0;
+	}
+
+	return value;
+}
+
+static void vgic_v5_mmio_write_irs_misc(struct kvm_vcpu *vcpu, gpa_t addr,
+					unsigned int len, unsigned long val)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	struct vgic_dist *vgic = vgic_v5_get_vgic(vcpu);
+	const size_t offset = addr & (SZ_64K - 1);
+	struct gicv5_cmd_info cmd_info;
+	struct kvm_vcpu *target_vcpu;
+	int rc;
+
+	switch (offset) {
+	case GICV5_IRS_CR0:
+		mutex_lock(&vcpu->kvm->arch.config_lock);
+		/*
+		 * We need to make sure that the IRS coming online (or
+		 * going offline) is visible to all vCPUs, even if
+		 * they are currently resident. Halt all of the vCPUs
+		 * now, and resume once we've done the update.
+		 */
+		kvm_arm_halt_guest(vcpu->kvm);
+
+		if (FIELD_GET(GICV5_IRS_CR0_IRSEN, val)) {
+			irs->enabled = true;
+			/*
+			 * This second enable is the one used by the existing,
+			 * non-GICv5 code.
+			 */
+			vgic->enabled = true;
+		} else {
+			irs->enabled = false;
+			/* Ditto */
+			vgic->enabled = false;
+		}
+
+		kvm_arm_resume_guest(vcpu->kvm);
+		mutex_unlock(&vcpu->kvm->arch.config_lock);
+
+		return;
+	case GICV5_IRS_CR1:
+		irs->cr1.sh = FIELD_GET(GICV5_IRS_CR1_SH, val);
+		irs->cr1.oc = FIELD_GET(GICV5_IRS_CR1_OC, val);
+		irs->cr1.ic = FIELD_GET(GICV5_IRS_CR1_IC, val);
+		irs->cr1.ist_ra = FIELD_GET(GICV5_IRS_CR1_IST_RA, val);
+		irs->cr1.ist_wa = FIELD_GET(GICV5_IRS_CR1_IST_WA, val);
+		irs->cr1.vmt_ra = FIELD_GET(GICV5_IRS_CR1_VMT_RA, val);
+		irs->cr1.vpet_ra = FIELD_GET(GICV5_IRS_CR1_VPET_RA, val);
+		irs->cr1.vmd_ra = FIELD_GET(GICV5_IRS_CR1_VMD_RA, val);
+		irs->cr1.vmd_wa = FIELD_GET(GICV5_IRS_CR1_VMD_WA, val);
+		irs->cr1.vped_ra = FIELD_GET(GICV5_IRS_CR1_VPED_RA, val);
+		irs->cr1.vped_wa = FIELD_GET(GICV5_IRS_CR1_VPED_WA, val);
+		return;
+	case GICV5_IRS_PE_SELR:
+		irs->pe_selr.iaffid = FIELD_GET(GICV5_IRS_PE_SELR_IAFFID, val);
+		return;
+	case GICV5_IRS_PE_CR0:
+		/*
+		 * Make sure that we are doing something reasonable first.
+		 * Remember, the IAFFID is the same as the VPE_ID.
+		 */
+		target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid);
+		if (!target_vcpu)
+			return;
+
+		mutex_lock(&vcpu->kvm->arch.config_lock);
+
+		/*
+		 * Write the corresponding IRS_VPE_CR0. We do so via the
+		 * doorbell for the specific vcpu we have in the PE_SELR.
+		 */
+		cmd_info.cmd_type = VPE_CR0_WRITE;
+		cmd_info.data = val;
+		rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(target_vcpu), &cmd_info);
+		if (rc)
+			kvm_err("Could not update VPE_CR0 in IRS: %d\n", rc);
+
+		mutex_unlock(&vcpu->kvm->arch.config_lock);
+		return;
+	default:
+		return;
+	}
+}
+
+static bool vgic_v5_is_spi_selr_valid(struct vgic_v5_irs *irs)
+{
+	/* Invalid - we don't have any SPIs at all */
+	if (irs->idr5.spi_range == 0)
+		return false;
+
+	/* Invalid - we don't have any on this IRS */
+	if (irs->idr6.spi_irs_range == 0)
+		return false;
+
+	/* Invalid - ID is less than min */
+	if (irs->spi_selr.id < irs->idr7.spi_base)
+		return false;
+
+	/* Invalid - ID is greater than max */
+	if (irs->spi_selr.id >=
+	    (irs->idr7.spi_base + irs->idr6.spi_irs_range))
+		return false;
+
+	return true;
+}
+
+static unsigned long vgic_v5_mmio_read_irs_spi(struct kvm_vcpu *vcpu,
+					       gpa_t addr, unsigned int len)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	struct vgic_dist *vgic = vgic_v5_get_vgic(vcpu);
+	const size_t offset = addr & (SZ_64K - 1);
+	u64 value = 0;
+
+	switch (offset) {
+	case GICV5_IRS_SPI_SELR:
+		/* Return whatever was last written */
+		value = FIELD_PREP(GICV5_IRS_SPI_SELR_ID, irs->spi_selr.id);
+		break;
+	case GICV5_IRS_SPI_STATUSR:
+		/* We assume that we can always claim to be idle */
+		value = GICV5_IRS_SPI_STATUSR_IDLE;
+		value |= FIELD_PREP(GICV5_IRS_SPI_STATUSR_V, vgic_v5_is_spi_selr_valid(irs));
+		break;
+	case GICV5_IRS_SPI_DOMAINR:
+		value = FIELD_PREP(GICV5_IRS_SPI_DOMAINR_DOMAIN,
+				   GICV5_IRS_SPI_DOMAINR_DOMAIN_NON_SECURE);
+		break;
+	case GICV5_IRS_SPI_CFGR:
+		if (!vgic_v5_is_spi_selr_valid(irs)) {
+			/* Fault with IRS_SPI_SELR; return 0*/
+			value = 0;
+			break;
+		}
+
+		/* Sanity check for KVM's sake */
+		if (irs->spi_selr.id >= vgic->nr_spis) {
+			kvm_err("Guest trying to access SPI not backed by KVM\n");
+			value = 0;
+			break;
+		}
+
+		if (vgic->spis[irs->spi_selr.id].config == VGIC_CONFIG_EDGE)
+			value = FIELD_PREP(GICV5_IRS_SPI_CFGR_TM, GICV5_IRS_SPI_CFGR_TM_EDGE);
+		else
+			value = FIELD_PREP(GICV5_IRS_SPI_CFGR_TM, GICV5_IRS_SPI_CFGR_TM_LEVEL);
+
+		break;
+	default:
+		return 0;
+	}
+
+	return value;
+}
+
+static void vgic_v5_mmio_write_irs_spi(struct kvm_vcpu *vcpu, gpa_t addr,
+				       unsigned int len, unsigned long val)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	const size_t offset = addr & (SZ_64K - 1);
+	struct vgic_irq *irq;
+
+	switch (offset) {
+	case GICV5_IRS_SPI_SELR:
+		irs->spi_selr.id = FIELD_GET(GICV5_IRS_SPI_SELR_ID, val);
+		return;
+	case GICV5_IRS_SPI_CFGR:
+		if (!vgic_v5_is_spi_selr_valid(irs))
+			return;
+
+		/*
+		 * Find KVM's representation of the interrupt - we need to make
+		 * sure that KVM's view agrees with the guest's, else interrupt
+		 * injection won't work properly for level-triggered interrupts
+		 * (we fail to handle the clearing of the pending state if KVM
+		 * thinks that the interrupt is edge-triggered, which is the
+		 * default.)
+		 */
+		irq = vgic_get_irq(vcpu->kvm, vgic_v5_make_spi(irs->spi_selr.id));
+		if (!irq)
+			return;
+
+		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
+			if (FIELD_GET(GICV5_IRS_SPI_CFGR_TM, val))
+				irq->config = VGIC_CONFIG_LEVEL;
+			else
+				irq->config = VGIC_CONFIG_EDGE;
+		}
+
+		vgic_put_irq(vcpu->kvm, irq);
+
+		return;
+	default:
+		return;
+	}
+}
+
+static bool vgic_v5_ist_cfgr_valid(struct vgic_v5_irs *irs)
+{
+	unsigned int expected_istsz;
+
+	if (irs->ist_cfgr.lpi_id_bits < irs->idr2.min_lpi_id_bits ||
+	    irs->ist_cfgr.lpi_id_bits > irs->idr2.id_bits)
+		return false;
+
+	if (!irs->idr2.istmd)
+		expected_istsz = GICV5_IRS_IST_CFGR_ISTSZ_4;
+	else if (irs->ist_cfgr.lpi_id_bits >= irs->idr2.istmd_sz)
+		expected_istsz = GICV5_IRS_IST_CFGR_ISTSZ_16;
+	else
+		expected_istsz = GICV5_IRS_IST_CFGR_ISTSZ_8;
+
+	if (irs->ist_cfgr.istsz != expected_istsz)
+		return false;
+
+	if (irs->ist_cfgr.structure && !irs->idr2.ist_levels)
+		return false;
+
+	if (!irs->ist_cfgr.structure)
+		return true;
+
+	return irs->ist_cfgr.l2sz == irs->idr2.ist_l2sz;
+}
+
+static unsigned long vgic_v5_mmio_read_irs_ist(struct kvm_vcpu *vcpu,
+					       gpa_t addr, unsigned int len)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	const size_t offset = addr & (SZ_64K - 1);
+	u64 value = 0;
+
+	switch (offset) {
+	case GICV5_IRS_IST_STATUSR:
+		return GICV5_IRS_IST_STATUSR_IDLE;
+	case GICV5_IRS_IST_CFGR:
+		value = FIELD_PREP(GICV5_IRS_IST_CFGR_STRUCTURE, irs->ist_cfgr.structure);
+		value |= FIELD_PREP(GICV5_IRS_IST_CFGR_ISTSZ, irs->ist_cfgr.istsz);
+		value |= FIELD_PREP(GICV5_IRS_IST_CFGR_L2SZ, irs->ist_cfgr.l2sz);
+		value |= FIELD_PREP(GICV5_IRS_IST_CFGR_LPI_ID_BITS, irs->ist_cfgr.lpi_id_bits);
+		break;
+	case GICV5_IRS_IST_BASER:
+		value = FIELD_PREP(GICV5_IRS_IST_BASER_ADDR_MASK,
+				   irs->ist_baser.addr >> GICV5_IRS_IST_BASER_ADDR_SHIFT);
+		value |= FIELD_PREP(GICV5_IRS_IST_BASER_VALID, irs->ist_baser.valid);
+		break;
+	default:
+		return 0;
+	}
+
+	return value;
+}
+
+static void vgic_v5_mmio_write_irs_ist(struct kvm_vcpu *vcpu, gpa_t addr,
+				       unsigned int len, unsigned long val)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	const size_t offset = addr & (SZ_64K - 1);
+	struct gicv5_cmd_info cmd_info;
+	int rc;
+
+	switch (offset) {
+	case GICV5_IRS_IST_CFGR:
+		irs->ist_cfgr.lpi_id_bits = FIELD_GET(GICV5_IRS_IST_CFGR_LPI_ID_BITS, val);
+		irs->ist_cfgr.l2sz = FIELD_GET(GICV5_IRS_IST_CFGR_L2SZ, val);
+		irs->ist_cfgr.istsz = FIELD_GET(GICV5_IRS_IST_CFGR_ISTSZ, val);
+		irs->ist_cfgr.structure = FIELD_GET(GICV5_IRS_IST_CFGR_STRUCTURE, val);
+		return;
+	case GICV5_IRS_IST_BASER: {
+		bool valid = FIELD_GET(GICV5_IRS_IST_BASER_VALID, val);
+
+		guard(mutex)(&vcpu->kvm->arch.config_lock);
+
+		/* Valid -> Invalid */
+		if (irs->ist_baser.valid && !valid) {
+			/* Make the LPI IST invalid and then ... */
+			cmd_info.cmd_type = LPI_VIST_MAKE_INVALID;
+			rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu), &cmd_info);
+			if (WARN_ON_ONCE(rc))
+				break;
+
+			/*
+			 * ... free the host IST if we successfully marked the
+			 * IST as invalid. Frankly, if we failed to make the
+			 * guest's IST as invalid, we're cooked because it means
+			 * that the IRS may still be using the memory that we
+			 * want to free. Hence, we leave it allocated and skip
+			 * the clearing of valid bit in the baser.
+			 */
+			rc = vgic_v5_lpi_ist_free(vcpu->kvm);
+			if (WARN_ON_ONCE(rc))
+				break;
+		} else if (!irs->ist_baser.valid && valid) { /* Invalid -> Valid */
+			if (!vgic_v5_ist_cfgr_valid(irs)) {
+				kvm_err("Guest programmed invalid IRS_IST_CFGR\n");
+				break;
+			}
+
+			rc = vgic_v5_lpi_ist_alloc(vcpu->kvm,
+						   irs->ist_cfgr.lpi_id_bits);
+			if (WARN_ON_ONCE(rc))
+				break;
+		}
+
+		/* Now that we've handled the edges, update the valid bit and addr */
+		irs->ist_baser.valid = FIELD_GET(GICV5_IRS_IST_BASER_VALID, val);
+		irs->ist_baser.addr = FIELD_GET(GICV5_IRS_IST_BASER_ADDR_MASK, val)
+			<< GICV5_IRS_IST_BASER_ADDR_SHIFT;
+
+		return;
+	}
+	default:
+		return;
+	}
+}
+
+static const struct vgic_register_region vgic_v5_irs_registers[] = {
+	/*
+	 * This is the IRS_CONFIG_FRAME.
+	 */
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR0, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR1, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR2, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR3, vgic_mmio_read_raz,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR4, vgic_mmio_read_raz,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR5, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR6, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR7, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IIDR, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_AIDR, vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_CR0, vgic_v5_mmio_read_irs_misc,
+				  vgic_v5_mmio_write_irs_misc, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_CR1, vgic_v5_mmio_read_irs_misc,
+				  vgic_v5_mmio_write_irs_misc, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SYNCR, vgic_mmio_read_raz,
+				  vgic_mmio_write_wi, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SYNC_STATUSR,
+				  vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_VMR, vgic_mmio_read_raz,
+				  vgic_mmio_write_wi, 8,
+				  VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_SELR, vgic_v5_mmio_read_irs_spi,
+				  vgic_v5_mmio_write_irs_spi, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_SPI_DOMAINR, vgic_v5_mmio_read_irs_spi,
+		vgic_v5_mmio_write_irs_spi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_RESAMPLER, vgic_mmio_read_raz,
+				  vgic_mmio_write_wi, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_CFGR, vgic_v5_mmio_read_irs_spi,
+				  vgic_v5_mmio_write_irs_spi, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_STATUSR,
+				  vgic_v5_mmio_read_irs_spi, vgic_mmio_write_wi,
+				  4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_PE_SELR, vgic_v5_mmio_read_irs_misc,
+				  vgic_v5_mmio_write_irs_misc, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_PE_STATUSR,
+				  vgic_v5_mmio_read_irs_misc,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_PE_CR0, vgic_v5_mmio_read_irs_misc,
+				  vgic_v5_mmio_write_irs_misc, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_IST_BASER, vgic_v5_mmio_read_irs_ist,
+		vgic_v5_mmio_write_irs_ist, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IST_CFGR, vgic_v5_mmio_read_irs_ist,
+				  vgic_v5_mmio_write_irs_ist, 4,
+				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IST_STATUSR,
+				  vgic_v5_mmio_read_irs_ist, vgic_mmio_write_wi,
+				  4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_MAP_L2_ISTR, vgic_mmio_read_raz,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+
+	/*
+	 * The following registers are only for running VMs. They are not yet
+	 * supported as we don't currently support nested, so expose them as
+	 * read-as-zero/write-ignored.
+	 */
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMT_BASER, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMT_CFGR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMT_STATUSR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VPE_SELR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VPE_DBR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VPE_HPPIR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VPE_CR0, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VPE_STATUSR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VM_DBR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VM_SELR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VM_STATUSR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMAP_L2_VMTR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMAP_VMR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMAP_VISTR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMAP_L2_VISTR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_VMAP_VPER, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_SAVE_VMR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_SAVE_VM_STATUSR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+
+	/* MEC, MPAM, SWERR - all unimplemented */
+
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_MEC_IDR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_MEC_MECID_R, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_MPAM_IDR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_MPAM_PARTID_R, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_SWERR_STATUSR, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_SWERR_SYNDROMER0, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+	REGISTER_DESC_WITH_LENGTH(
+		GICV5_IRS_SWERR_SYNDROMER1, vgic_mmio_read_raz,
+		vgic_mmio_write_wi, 8, VGIC_ACCESS_64bit),
+};
+
+unsigned int vgic_v5_init_irs_iodev(struct vgic_io_device *dev)
+{
+	dev->regions = vgic_v5_irs_registers;
+	dev->nr_regions = ARRAY_SIZE(vgic_v5_irs_registers);
+
+	kvm_iodevice_init(&dev->dev, &kvm_io_gic_ops);
+
+	/* We represent both of the IRS frames back to back, so this is 128K */
+	return KVM_VGIC_V5_IRS_SIZE;
+}
+
+int vgic_v5_register_irs_iodev(struct kvm *kvm, gpa_t irs_base_address)
+{
+	struct vgic_io_device *io_device = &kvm->arch.vgic.vgic_v5_irs_data->iodev;
+	unsigned int len;
+
+	/*
+	 * Design choice: Force MMIO region to be 64k aligned. Simplifies
+	 * pulling out registers.
+	 */
+	if (!IS_ALIGNED(irs_base_address, SZ_64K)) {
+		kvm_err("IRS Base address is not aligned to 64k\n");
+		return -EINVAL;
+	}
+
+	len = vgic_v5_init_irs_iodev(io_device);
+
+	io_device->base_addr = irs_base_address;
+	io_device->iodev_type = IODEV_GICV5_IRS;
+	io_device->redist_vcpu = NULL;
+
+	return kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, irs_base_address, len,
+				       &io_device->dev);
+}
+
+/**
+ * kvm_vgic_v5_irs_init: initialize the IRS data structures
+ * @kvm: kvm struct pointer
+ * @nr_spis: number of spis, frozen by caller
+ */
+int kvm_vgic_v5_irs_init(struct kvm *kvm, unsigned int nr_spis)
+{
+	struct vgic_dist *dist = &kvm->arch.vgic;
+	struct vgic_v5_irs *irs = dist->vgic_v5_irs_data;
+	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
+	size_t istsz, nr_spi_bits, istmd_sz;
+	phys_addr_t spi_ist_phys_base;
+	u64 mmfr0;
+	int ret;
+	int i;
+
+	/*
+	 * We (KVM) allocate an Interrupt State Table (IST) for SPIs. The
+	 * hardware mandates that lower 6 bits of the address are 0. Each ISTE
+	 * is 4 bytes in size (or larger if metadata storage is required). In
+	 * order to simplify the allocation logic, we round up the minimum
+	 * number of SPIs to 16 (2^6 = 64, 64/4 = 16).
+	 */
+	if (nr_spis && nr_spis < 16)
+		nr_spis = 16;
+
+	if (nr_spis) {
+		dist->spis = kcalloc(nr_spis, sizeof(struct vgic_irq),
+				     GFP_KERNEL_ACCOUNT);
+		if (!dist->spis)
+			return -ENOMEM;
+
+		/*
+		 * In the following code we do not take the irq struct lock since
+		 * no other action on irq structs can happen while the VGIC is
+		 * not initialized yet.
+		 */
+		for (i = 0; i < nr_spis; i++) {
+			struct vgic_irq *irq = &dist->spis[i];
+
+			irq->intid = vgic_v5_make_spi(i);
+			INIT_LIST_HEAD(&irq->ap_list);
+			raw_spin_lock_init(&irq->irq_lock);
+			irq->vcpu = NULL;
+			irq->target_vcpu = vcpu0;
+			refcount_set(&irq->refcount, 0);
+			/*
+			 * The guest controls the enable state, and again it is
+			 * directly handled by the hardware. From our point of
+			 * view it is always enabled.
+			 */
+			irq->enabled = 1;
+		}
+
+		nr_spi_bits = fls(roundup_pow_of_two(nr_spis)) - 1;
+
+		istsz = GICV5_IRS_IST_CFGR_ISTSZ_4;
+		if (vgic_v5_host_caps()->istmd) {
+			istmd_sz = vgic_v5_host_caps()->istmd_sz;
+
+			if (nr_spi_bits < istmd_sz)
+				istsz = GICV5_IRS_IST_CFGR_ISTSZ_8;
+			else
+				istsz = GICV5_IRS_IST_CFGR_ISTSZ_16;
+		}
+
+		ret = vgic_v5_spi_ist_allocate(kvm, &spi_ist_phys_base,
+					       nr_spi_bits, istsz);
+		if (ret)
+			return ret;
+
+		ret = vgic_v5_vmte_assign_ist(kvm, spi_ist_phys_base, false,
+					      nr_spi_bits, 0, istsz, true);
+		if (ret) {
+			vgic_v5_free_allocated_spi_ist(kvm);
+			return ret;
+		}
+	}
+
+	/* Set sane initial state for the IRS MMIO registers */
+
+	irs->idr0.domain = GICV5_IRS_IDR0_DOMAIN_NON_SECURE;
+
+	mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+	irs->idr0.pa_range = cpuid_feature_extract_unsigned_field(
+		mmfr0, ID_AA64MMFR0_EL1_PARANGE_SHIFT);
+
+	irs->idr0.virt = 0;
+	irs->idr0.one_of_n = 0;
+	irs->idr0.virt_one_of_n = 0;
+	irs->idr0.setlpi = 0;
+	irs->idr0.mec = 0;
+	irs->idr0.mpam = 0;
+	irs->idr0.swe = 0;
+	irs->idr0.irs_id = 0;
+
+		irs->idr1.priority_bits = gicv5_global_data.irs_pri_bits - 1;
+
+	/*
+	 * Support 16-bits of ID space for the IRS. This should be sufficient
+	 * for most applications, and the CPUIF is guaranteed to have at least
+	 * 16-bits of ID space support (we actually present 16-bits there, even
+	 * if the hardware supports more). Warn if the hardware doesn't support
+	 * 16 bits, and use the smaller value. YMMV!
+	 *
+	 * As for the minimum number of ID bits, we match the hardware's
+	 * capability.
+	 */
+	if (vgic_v5_host_caps()->ist_id_bits < 16)
+		pr_warn("Host IRS supports fewer than 16 ID bits for ISTs (%u)\n",
+			vgic_v5_host_caps()->ist_id_bits);
+
+	irs->idr2.id_bits = min(16, vgic_v5_host_caps()->ist_id_bits);
+	irs->idr2.min_lpi_id_bits = vgic_v5_host_caps()->min_lpi_id_bits;
+
+	/* Only allow the guest to create Linear ISTs - simplifies Save/Restore */
+	irs->idr2.ist_levels = 0;
+	irs->idr2.ist_l2sz = GICV5_IRS_IST_CFGR_L2SZ_4K;
+	irs->idr2.istmd = 0;
+	irs->idr2.istmd_sz = 0;
+
+	/* We have a single IRS, only. All SPIs reside here! */
+	irs->idr5.spi_range = nr_spis;
+	irs->idr6.spi_irs_range = nr_spis;
+	irs->idr7.spi_base = 0;
+
+	irs->cr1.sh = 0;
+	irs->cr1.oc = 0;
+	irs->cr1.ic = 0;
+	irs->cr1.ist_ra = 0;
+	irs->cr1.ist_wa = 0;
+	irs->cr1.vmt_ra = 0;
+	irs->cr1.vpet_ra = 0;
+	irs->cr1.vmd_ra = 0;
+	irs->cr1.vmd_wa = 0;
+	irs->cr1.vped_ra = 0;
+	irs->cr1.vped_wa = 0;
+
+	irs->spi_selr.id = -1;
+
+	irs->pe_selr.iaffid = -1;
+
+	irs->ist_cfgr.lpi_id_bits = 0;
+	irs->ist_cfgr.l2sz = 0;
+	irs->ist_cfgr.istsz = 0;
+	irs->ist_cfgr.structure = 0;
+
+	irs->ist_baser.valid = 0;
+	irs->ist_baser.addr = 0;
+
+	return 0;
+}
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
index 0120c3205dea6..77fc5fb27f30d 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
@@ -578,6 +578,22 @@ int vgic_v5_vmte_release(struct kvm *kvm)
 	return 0;
 }
 
+/*
+ * Provide a way for the IRS MMIO emulation to correctly populate the number of
+ * IAFFID bits (which correspond to our vpe_id_bits.
+ */
+u8 vgic_v5_vmte_vpe_id_bits(struct kvm_vcpu *vcpu)
+{
+	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
+	struct vgic_v5_vm_info *vmi;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return 0;
+
+	return vmi->vpe_id_bits;
+}
+
 /*
  * Allocate a VPE descriptor and provide it to the hardware via the VPE Table.
  */
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
index 6a024337eba79..25e1c9fff87b4 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
@@ -158,6 +158,7 @@ void vgic_v5_release_vm_id(struct kvm *kvm);
 
 int vgic_v5_vmte_init(struct kvm *kvm);
 int vgic_v5_vmte_release(struct kvm *kvm);
+u8 vgic_v5_vmte_vpe_id_bits(struct kvm_vcpu *vcpu);
 int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu);
 int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu);
 
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index f2f5fdc3211d7..282278e4a6c19 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -366,6 +366,7 @@ void vgic_debug_destroy(struct kvm *kvm);
 int vgic_v5_probe(const struct gic_kvm_info *info);
 void vgic_v5_reset(struct kvm_vcpu *vcpu);
 int vgic_v5_init(struct kvm *kvm);
+int kvm_vgic_v5_irs_init(struct kvm *kvm, unsigned int nr_spis);
 void vgic_v5_teardown(struct kvm *kvm);
 int vgic_v5_map_resources(struct kvm *kvm);
 void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid);
@@ -378,6 +379,7 @@ void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr);
 void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr);
 void vgic_v5_restore_state(struct kvm_vcpu *vcpu);
 void vgic_v5_save_state(struct kvm_vcpu *vcpu);
+int vgic_v5_register_irs_iodev(struct kvm *kvm, gpa_t irs_base_address);
 
 #define for_each_visible_v5_ppi(__i, __k)		\
 	for_each_set_bit(__i, (__k)->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 23/43] KVM: arm64: gic-v5: Set IRICHPPIDIS based on IRS enable state
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (21 preceding siblings ...)
  2026-04-27 16:13 ` [PATCH 22/43] KVM: arm64: gic-v5: Add GICv5 IRS IODEV and MMIO emulation Sascha Bischoff
@ 2026-04-27 16:13 ` Sascha Bischoff
  2026-04-27 16:14 ` [PATCH 24/43] KVM: arm64: gic-v5: Call IRS init/teardown from vgic_v5 init/teardown Sascha Bischoff
                   ` (19 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:13 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

The GICv5 ICH_CONTEXTR_EL2 has the IRICHPPIDIS field, which allows the
hypervisor to enable/disable the HPPI selection for SPIs and
LPIs. This can be used to emulate the guest enabling/disabling the
IRS. Therefore, make the state of this controlled by the IRS enable
state. Thus, SPIs and LPIs can't be delivered to the guest, until it
enables the emulated IRS, which matches the behaviour of the real
hardware.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 48ffcb2494db3..f12e96bc5fa21 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -1191,6 +1191,7 @@ void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu)
 
 void vgic_v5_load(struct kvm_vcpu *vcpu)
 {
+	bool irichppidis = !vcpu->kvm->arch.vgic.vgic_v5_irs_data->enabled;
 	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
 	u16 vm = vgic_v5_vm_id(vcpu->kvm);
 	u16 vpe = vgic_v5_vpe_id(vcpu);
@@ -1207,6 +1208,7 @@ void vgic_v5_load(struct kvm_vcpu *vcpu)
 	kvm_call_hyp(__vgic_v5_restore_vmcr_apr, cpu_if);
 
 	cpu_if->vgic_contextr = FIELD_PREP(ICH_CONTEXTR_EL2_V, true) |
+				FIELD_PREP(ICH_CONTEXTR_EL2_IRICHPPIDIS, irichppidis) |
 				FIELD_PREP(ICH_CONTEXTR_EL2_VPE, vpe) |
 				FIELD_PREP(ICH_CONTEXTR_EL2_VM, vm);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 24/43] KVM: arm64: gic-v5: Call IRS init/teardown from vgic_v5 init/teardown
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (22 preceding siblings ...)
  2026-04-27 16:13 ` [PATCH 23/43] KVM: arm64: gic-v5: Set IRICHPPIDIS based on IRS enable state Sascha Bischoff
@ 2026-04-27 16:14 ` Sascha Bischoff
  2026-04-27 16:14 ` [PATCH 25/43] KVM: arm64: gic-v5: Register the IRS IODEV Sascha Bischoff
                   ` (18 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:14 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

As has been the case thus far with the GICv5 KVM implementation, it is
only possible to support PPIs without both interaction with the host
IRS and providing an emulated IRS for the guest to interact with. Once
an emulated IRS is added, it becomes possible to support both SPIs and
LPIs in the guest, and hence it becomes possible to run something
real.

It makes little sense to allow VM creation with only PPI
support. Hence, as part of initialising or tearing down the GIC, also
do the same for the emulated IRS.

As of this change, GICv5 VMs support all three GICv5 interrupt types;
PPIs, SPIs, and LPIs.

Note that further work is required to properly support SPIs and LPIs,
but this change represents a bit step towards supporting a full guest
OS such as Linux, rather than just supporting simple PPI-based unit
tests.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-init.c | 29 ++++++++++++++++++++++-------
 arch/arm64/kvm/vgic/vgic-v5.c   |  6 ++++++
 include/kvm/arm_vgic.h          |  1 +
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 984908a271c8d..cea8e963ade66 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -182,14 +182,26 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
 	if (type == KVM_DEV_TYPE_ARM_VGIC_V3)
 		kvm->arch.vgic.nassgicap = system_supports_direct_sgis();
 
-	/*
-	 * We now know that we have a GICv5. The Arch Timer PPI interrupts may
-	 * have been initialised at this stage, but will have done so assuming
-	 * that we have an older GIC, meaning that the IntIDs won't be
-	 * correct. We init them again, and this time they will be correct.
-	 */
-	if (type == KVM_DEV_TYPE_ARM_VGIC_V5)
+	if (type == KVM_DEV_TYPE_ARM_VGIC_V5) {
+		/* Allocate a vIRS for GICv5 systems */
+		kvm->arch.vgic.vgic_v5_irs_data = kzalloc_obj(struct vgic_v5_irs,
+							      GFP_KERNEL_ACCOUNT);
+		if (!kvm->arch.vgic.vgic_v5_irs_data) {
+			kvm->arch.vgic.vgic_model = 0;
+			ret = -ENOMEM;
+			goto out_unlock;
+		}
+
+
+		/*
+		 * We now know that we have a GICv5. The Arch Timer PPI
+		 * interrupts may have been initialised at this stage, but will
+		 * have done so assuming that we have an older GIC, meaning that
+		 * the IntIDs won't be correct. We init them again, and this
+		 * time they will be correct.
+		 */
 		kvm_timer_init_vm(kvm);
+	}
 
 out_unlock:
 	mutex_unlock(&kvm->arch.config_lock);
@@ -461,6 +473,9 @@ int vgic_init(struct kvm *kvm)
 				return ret;
 		}
 	} else {
+		if (!dist->nr_spis)
+			dist->nr_spis = VGIC_V5_DEFAULT_NR_SPIS;
+
 		ret = vgic_v5_init(kvm);
 		if (ret)
 			return ret;
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index f12e96bc5fa21..86cfc08c557ba 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -867,6 +867,9 @@ int vgic_v5_init(struct kvm *kvm)
 			return ret;
 	}
 
+	/* Init IRS (and alloc SPI IST) */
+	ret = kvm_vgic_v5_irs_init(kvm, kvm->arch.vgic.nr_spis);
+
 	return ret;
 }
 
@@ -877,6 +880,9 @@ void vgic_v5_teardown(struct kvm *kvm)
 	unsigned long i;
 	int rc;
 
+	kfree(dist->vgic_v5_irs_data);
+	dist->vgic_v5_irs_data = NULL;
+
 	/*
 	 * If the VM's ID isn't valid, then we failed init very early. Nothing
 	 * to do here.
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index f8ecd19acc06d..3a2dfbd5210f2 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -25,6 +25,7 @@
 #define VGIC_V3_MAX_CPUS	512
 #define VGIC_V2_MAX_CPUS	8
 #define VGIC_NR_IRQS_LEGACY     256
+#define VGIC_V5_DEFAULT_NR_SPIS	32
 #define VGIC_NR_SGIS		16
 #define VGIC_NR_PPIS		16
 #define VGIC_NR_PRIVATE_IRQS	(VGIC_NR_SGIS + VGIC_NR_PPIS)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 25/43] KVM: arm64: gic-v5: Register the IRS IODEV
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (23 preceding siblings ...)
  2026-04-27 16:14 ` [PATCH 24/43] KVM: arm64: gic-v5: Call IRS init/teardown from vgic_v5 init/teardown Sascha Bischoff
@ 2026-04-27 16:14 ` Sascha Bischoff
  2026-04-27 16:14 ` [PATCH 26/43] Documentation: KVM: Extend VGICv5 docs for KVM_VGIC_V5_ADDR_TYPE_IRS Sascha Bischoff
                   ` (17 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:14 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Now that we have an emulated IRS, it needs to be registered, which
ensures that guest accesses to the MMIO regions handled by the device
are handled appropriately in KVM. Therefore, as part of
vgic_map_resources, the GICv5 IRS IODEV is registered. If the address
for the IRS is not provided, bail out reporting an error - this is not
a supported config.

As part of this change, expose setting the address of the emulated IRS
via KVM_VGIC_V5_ADDR_TYPE_IRS to userspace, as well as setting the
number of SPIs handled by the emulated GICv5 implementation.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-init.c       | 23 ++++++-
 arch/arm64/kvm/vgic/vgic-kvm-device.c | 97 ++++++++++++++++++---------
 2 files changed, 87 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index cea8e963ade66..865d6b125c2b2 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -192,6 +192,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
 			goto out_unlock;
 		}
 
+		/*
+		 * Initialization happens later, for now just explicitly
+		 * disable the device and undef its base address.
+		 */
+		kvm->arch.vgic.vgic_v5_irs_data->vgic_v5_irs_base = VGIC_ADDR_UNDEF;
 
 		/*
 		 * We now know that we have a GICv5. The Arch Timer PPI
@@ -638,7 +643,6 @@ int vgic_lazy_init(struct kvm *kvm)
 int kvm_vgic_map_resources(struct kvm *kvm)
 {
 	struct vgic_dist *dist = &kvm->arch.vgic;
-	bool needs_dist = true;
 	enum vgic_type type;
 	gpa_t dist_base;
 	int ret = 0;
@@ -663,13 +667,12 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	} else {
 		ret = vgic_v5_map_resources(kvm);
 		type = VGIC_V5;
-		needs_dist = false;
 	}
 
 	if (ret)
 		goto out;
 
-	if (needs_dist) {
+	if (type != VGIC_V5) {
 		dist_base = dist->vgic_dist_base;
 		mutex_unlock(&kvm->arch.config_lock);
 
@@ -680,6 +683,20 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 		}
 	} else {
 		mutex_unlock(&kvm->arch.config_lock);
+
+		if (IS_VGIC_ADDR_UNDEF(
+			    dist->vgic_v5_irs_data->vgic_v5_irs_base)) {
+			kvm_err("No IRS address provided\n");
+			ret = -ENXIO;
+			goto out_slots;
+		}
+
+		ret = vgic_v5_register_irs_iodev(
+			kvm, dist->vgic_v5_irs_data->vgic_v5_irs_base);
+		if (ret) {
+			kvm_err("Unable to register VGIC IRS MMIO regions\n");
+			goto out_slots;
+		}
 	}
 
 	smp_store_release(&dist->ready, true);
diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c
index 90be99443df3b..fd3b9cfbc2660 100644
--- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
+++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
@@ -181,6 +181,14 @@ static int kvm_vgic_addr(struct kvm *kvm, struct kvm_device_attr *attr, bool wri
 		addr |= (u64)rdreg->count << KVM_VGIC_V3_RDIST_COUNT_SHIFT;
 		goto out;
 	}
+	case KVM_VGIC_V5_ADDR_TYPE_IRS:
+		r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V5);
+		if (r)
+			break;
+		addr_ptr = &vgic->vgic_v5_irs_data->vgic_v5_irs_base;
+		alignment = SZ_64K;
+		size = KVM_VGIC_V5_IRS_SIZE;
+		break;
 	default:
 		r = -ENODEV;
 	}
@@ -224,31 +232,39 @@ static int vgic_set_common_attr(struct kvm_device *dev,
 		if (get_user(val, uaddr))
 			return -EFAULT;
 
-		/*
-		 * We require:
-		 * - at least 32 SPIs on top of the 16 SGIs and 16 PPIs
-		 * - at most 1024 interrupts
-		 * - a multiple of 32 interrupts
-		 */
-		if (val < (VGIC_NR_PRIVATE_IRQS + 32) ||
-		    val > VGIC_MAX_RESERVED ||
-		    (val & 31))
-			return -EINVAL;
+		if (!vgic_is_v5(dev->kvm)) {
+			/*
+			 * We require:
+			 * - at least 32 SPIs on top of the 16 SGIs and 16 PPIs
+			 * - at most 1024 interrupts
+			 * - a multiple of 32 interrupts
+			 */
+			if (val < (VGIC_NR_PRIVATE_IRQS + 32) ||
+			    val > VGIC_MAX_RESERVED || (val & 31))
+				return -EINVAL;
 
-		mutex_lock(&dev->kvm->arch.config_lock);
+			mutex_lock(&dev->kvm->arch.config_lock);
 
-		/*
-		 * Either userspace has already configured NR_IRQS or
-		 * the vgic has already been initialized and vgic_init()
-		 * supplied a default amount of SPIs.
-		 */
-		if (dev->kvm->arch.vgic.nr_spis)
-			ret = -EBUSY;
-		else
-			dev->kvm->arch.vgic.nr_spis =
-				val - VGIC_NR_PRIVATE_IRQS;
+			/*
+			 * Either userspace has already configured NR_IRQS or
+			 * the vgic has already been initialized and vgic_init()
+			 * supplied a default amount of SPIs.
+			 */
+			if (dev->kvm->arch.vgic.nr_spis)
+				ret = -EBUSY;
+			else
+				dev->kvm->arch.vgic.nr_spis =
+					val - VGIC_NR_PRIVATE_IRQS;
 
-		mutex_unlock(&dev->kvm->arch.config_lock);
+			mutex_unlock(&dev->kvm->arch.config_lock);
+		} else {
+			mutex_lock(&dev->kvm->arch.config_lock);
+			if (vgic_initialized(dev->kvm) || dev->kvm->arch.vgic.nr_spis)
+				ret = -EBUSY;
+			else
+				dev->kvm->arch.vgic.nr_spis = val;
+			mutex_unlock(&dev->kvm->arch.config_lock);
+		}
 
 		return ret;
 	}
@@ -299,9 +315,14 @@ static int vgic_get_common_attr(struct kvm_device *dev,
 		return (r == -ENODEV) ? -ENXIO : r;
 	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: {
 		u32 __user *uaddr = (u32 __user *)(long)attr->addr;
-
-		r = put_user(dev->kvm->arch.vgic.nr_spis +
-			     VGIC_NR_PRIVATE_IRQS, uaddr);
+		/* Older GICs */
+		if (!vgic_is_v5(dev->kvm)) {
+			r = put_user(dev->kvm->arch.vgic.nr_spis +
+					     VGIC_NR_PRIVATE_IRQS,
+				     uaddr);
+		} else {
+			r = put_user(dev->kvm->arch.vgic.nr_spis, uaddr);
+		}
 		break;
 	}
 	}
@@ -748,21 +769,25 @@ static int vgic_v5_set_attr(struct kvm_device *dev,
 {
 	switch (attr->group) {
 	case KVM_DEV_ARM_VGIC_GRP_ADDR:
+		break;
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
-	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
 		return -ENXIO;
+	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
+		break;
 	case KVM_DEV_ARM_VGIC_GRP_CTRL:
 		switch (attr->attr) {
 		case KVM_DEV_ARM_VGIC_CTRL_INIT:
-			return vgic_set_common_attr(dev, attr);
+			break;
 		case KVM_DEV_ARM_VGIC_USERSPACE_PPIS:
 		default:
 			return -ENXIO;
 		}
+		break;
 	default:
 		return -ENXIO;
 	}
 
+	return vgic_set_common_attr(dev, attr);
 }
 
 static int vgic_v5_get_attr(struct kvm_device *dev,
@@ -770,21 +795,26 @@ static int vgic_v5_get_attr(struct kvm_device *dev,
 {
 	switch (attr->group) {
 	case KVM_DEV_ARM_VGIC_GRP_ADDR:
+		break;
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
-	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
 		return -ENXIO;
+	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
+		break;
 	case KVM_DEV_ARM_VGIC_GRP_CTRL:
 		switch (attr->attr) {
 		case KVM_DEV_ARM_VGIC_CTRL_INIT:
-			return vgic_get_common_attr(dev, attr);
+			break;
 		case KVM_DEV_ARM_VGIC_USERSPACE_PPIS:
 			return vgic_v5_get_userspace_ppis(dev, attr);
 		default:
 			return -ENXIO;
 		}
+		break;
 	default:
 		return -ENXIO;
 	}
+
+	return vgic_get_common_attr(dev, attr);
 }
 
 static int vgic_v5_has_attr(struct kvm_device *dev,
@@ -792,15 +822,22 @@ static int vgic_v5_has_attr(struct kvm_device *dev,
 {
 	switch (attr->group) {
 	case KVM_DEV_ARM_VGIC_GRP_ADDR:
+		switch (attr->attr) {
+		case KVM_VGIC_V5_ADDR_TYPE_IRS:
+			return 0;
+		}
+		return -ENXIO;
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
-	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
 		return -ENXIO;
+	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
+		return 0;
 	case KVM_DEV_ARM_VGIC_GRP_CTRL:
 		switch (attr->attr) {
 		case KVM_DEV_ARM_VGIC_CTRL_INIT:
 			return 0;
 		case KVM_DEV_ARM_VGIC_USERSPACE_PPIS:
 			return 0;
+		case KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES:
 		default:
 			return -ENXIO;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 26/43] Documentation: KVM: Extend VGICv5 docs for KVM_VGIC_V5_ADDR_TYPE_IRS
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (24 preceding siblings ...)
  2026-04-27 16:14 ` [PATCH 25/43] KVM: arm64: gic-v5: Register the IRS IODEV Sascha Bischoff
@ 2026-04-27 16:14 ` Sascha Bischoff
  2026-04-27 16:15 ` [PATCH 27/43] KVM: arm64: selftests: Update vGICv5 selftest to set IRS address Sascha Bischoff
                   ` (16 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:14 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Now that it is possible and required to set the address of the GICv5
IRS in GPA space, update the documentation accordingly. This region
must be 64KByte-aligned, and covers a total range of 128KBytes.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 .../virt/kvm/devices/arm-vgic-v5.rst          | 24 +++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
index 1985b2d880322..2417eba01830c 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
@@ -12,12 +12,28 @@ Only one VGIC instance may be instantiated through this API.  The created VGIC
 will act as the VM interrupt controller, requiring emulated user-space devices
 to inject interrupts to the VGIC instead of directly to CPUs.
 
-Creating a guest GICv5 device requires a GICv5 host.  The current VGICv5 device
-only supports PPI interrupts.  These can either be injected from emulated
-in-kernel devices (such as the Arch Timer, or PMU), or via the KVM_IRQ_LINE
-ioctl.
+Creating a guest GICv5 device requires a GICv5 host.  The VGICv5 device supports
+PPI, SPI, and LPI interrupts.  The PPI and SPI interrupts can either be injected
+from emulated in-kernel devices (such as the Arch Timer, or PMU), or via the
+KVM_IRQ_LINE ioctl.  LPIs are not externally injected, but are handled in
+hardware via the LPI IST.  Their pending state is driven directly by the guest.
 
 Groups:
+  KVM_DEV_ARM_VGIC_GRP_ADDR
+   Attributes:
+
+    KVM_VGIC_V5_ADDR_TYPE_IRS (rw, 64-bit)
+      Base address in the guest physical address space of the GICv5 IRS
+      (Interrupt Routing Service) register mappings. Only valid for
+      KVM_DEV_TYPE_ARM_VGIC_V5.  This address needs to be 64K aligned and the
+      region covers 128 KByte - the IRS has a CONFIG_FRAME and a SETLPI_FRAME,
+      each of which is 64kByte in sized.
+
+      Setting the address of the IRS in GPA space is mandatory for initialising
+      a virtual GICv5 as the IRs is responsible for handling SPIs and
+      LPIs. Failure to set the IRS address prior to initialising the GIC results
+      in an error.
+
   KVM_DEV_ARM_VGIC_GRP_CTRL
    Attributes:
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 27/43] KVM: arm64: selftests: Update vGICv5 selftest to set IRS address
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (25 preceding siblings ...)
  2026-04-27 16:14 ` [PATCH 26/43] Documentation: KVM: Extend VGICv5 docs for KVM_VGIC_V5_ADDR_TYPE_IRS Sascha Bischoff
@ 2026-04-27 16:15 ` Sascha Bischoff
  2026-04-27 16:15 ` [PATCH 28/43] KVM: arm64: gic-v5: Introduce SPI AP list Sascha Bischoff
                   ` (15 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:15 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

This selftest was added before the GICv5 IRS was supported in
KVM. Therefore, there was no address to set, and the specific UAPI
didn't even exist.

Now that the IRS is supported, and the setting of the address is
mandatory, set the address of the emulated IRS in GPA space before
initialising the vgic. Initialising a GICv5 irqchip will fail if the
IRS address has not been set by userspace.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 tools/testing/selftests/kvm/arm64/vgic_v5.c        | 6 ++++++
 tools/testing/selftests/kvm/include/arm64/gic_v5.h | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/tools/testing/selftests/kvm/arm64/vgic_v5.c b/tools/testing/selftests/kvm/arm64/vgic_v5.c
index 49c57ea72eaa9..f9df49784a5ca 100644
--- a/tools/testing/selftests/kvm/arm64/vgic_v5.c
+++ b/tools/testing/selftests/kvm/arm64/vgic_v5.c
@@ -100,6 +100,7 @@ static void test_vgic_v5_ppis(uint32_t gic_dev_type)
 	struct ucall uc;
 	u64 user_ppis[2];
 	struct vm_gic v;
+	uint64_t attr;
 	int ret, i;
 
 	v.gic_dev_type = gic_dev_type;
@@ -116,6 +117,11 @@ static void test_vgic_v5_ppis(uint32_t gic_dev_type)
 	for (i = 0; i < NR_VCPUS; i++)
 		vcpu_init_descriptor_tables(vcpus[i]);
 
+	/* Set the address of the IRS before initialising the GIC */
+	attr = GICV5_IRS_CONFIG_BASE_GPA;
+	kvm_device_attr_set(v.gic_fd, KVM_DEV_ARM_VGIC_GRP_ADDR,
+			    KVM_VGIC_V5_ADDR_TYPE_IRS, &attr);
+
 	kvm_device_attr_set(v.gic_fd, KVM_DEV_ARM_VGIC_GRP_CTRL,
 			    KVM_DEV_ARM_VGIC_CTRL_INIT, NULL);
 
diff --git a/tools/testing/selftests/kvm/include/arm64/gic_v5.h b/tools/testing/selftests/kvm/include/arm64/gic_v5.h
index eb523d9277cf1..c388df8f2a2b4 100644
--- a/tools/testing/selftests/kvm/include/arm64/gic_v5.h
+++ b/tools/testing/selftests/kvm/include/arm64/gic_v5.h
@@ -10,6 +10,9 @@
 
 #include "processor.h"
 
+/* GIC component base address is guest PA space */
+#define GICV5_IRS_CONFIG_BASE_GPA	0x8000000ULL
+
 /*
  * Definitions for GICv5 instructions for the Current Domain
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 28/43] KVM: arm64: gic-v5: Introduce SPI AP list
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (26 preceding siblings ...)
  2026-04-27 16:15 ` [PATCH 27/43] KVM: arm64: selftests: Update vGICv5 selftest to set IRS address Sascha Bischoff
@ 2026-04-27 16:15 ` Sascha Bischoff
  2026-04-27 16:15 ` [PATCH 29/43] KVM: arm64: gic-v5: Add GIC VDPEND and GIC VDRCFG hyp calls Sascha Bischoff
                   ` (14 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:15 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

As a general rule, GICv5 works a bit differently to previous
generation GICs. When it comes to virtual interrupts, as much as
possible is handled directly by the hardware and requires minimal
software interaction.

So far, the GICv5 support has been limited to PPIs. These are handled
via a set of ICH_PPI_*_EL2 registers, which are used by the hypervisor
to manage the PPI state exposed to the guest. They effectively take
the role of the ICH_LR*_EL2 registers found in earlier GICs, but do so
for EVERY PPI in parallel. For this reason, the GICv5 PPI support
doesn't use AP lists at all - all PPI state is always presented to the
guest.

The lifecycle of a virtual SPI is largely handled by the hardware with
GICv5. GICv5 itself provides a set of system instructions that act
upon the virtual domain. One of these, GIC VDPEND, can be used to make
a specified interrupt pending for a guest. The state of guest
interrupts is tracked by ISTs, which are allocated by the hypervisor
and provided directly by the hardware. The enable state for SPIs and
LPIs is driven directly by the guest (using the GIC CDEN/CDDIS system
instructions). Priority, affinity are also driven by the guest.

All of the above means that it is in theory possible to handle virtual
SPIs from KVM by just executing GIC VDPEND whenever new state is to be
injected into the guest. Of course, reality is a little bit more
complicated.

KVM itself provides an interface to register a notifier on interrupt
deactivation - specifically intended for use with SPIs on Arm-based
systems. This notifier requires KVM to track when an interrupt has
been consumed by the guest, so that the notifier can be called.

SPIs are not per-vcpu - they are effectively global to the VM (even if
they are affine to a specific VCPU, KVM doesn't need to know this
information). Therefore, this change introduces a per-VM AP list
specifically for tracking SPIs for a GICv5 guest. The intent is that
while an SPI is in-flight (pending/active) it remains on this list,
such that KVM knows to track the state of said SPI. Once the interrupt
has been consumed by the guest, it can be popped off the list.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-irs-v5.c |  3 +++
 include/kvm/arm_vgic.h            | 14 ++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-irs-v5.c b/arch/arm64/kvm/vgic/vgic-irs-v5.c
index 729a3a3aca3a3..8e69b624194d5 100644
--- a/arch/arm64/kvm/vgic/vgic-irs-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-irs-v5.c
@@ -685,6 +685,9 @@ int kvm_vgic_v5_irs_init(struct kvm *kvm, unsigned int nr_spis)
 	int ret;
 	int i;

+	INIT_LIST_HEAD(&dist->vgic_v5_spi_ap_list_head);
+	raw_spin_lock_init(&dist->vgic_v5_spi_ap_list_lock);
+
 	/*
 	 * We (KVM) allocate an Interrupt State Table (IST) for SPIs. The
 	 * hardware mandates that lower 6 bits of the address are 0. Each ISTE
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 3a2dfbd5210f2..812ad325b01be 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -544,6 +544,20 @@ struct vgic_dist {
 	 * GICv5 IRS data. Dynamically allocated due to the size.
 	 */
 	struct vgic_v5_irs	*vgic_v5_irs_data;
+
+	/*
+	 * The GICv5 SPI AP list is global to the VM. This spinlock ensures that
+	 * we don't do anything untoward!
+	 */
+	raw_spinlock_t		vgic_v5_spi_ap_list_lock;
+
+	/*
+	 * List of global (non-private) IRQs that must be tracked because they
+	 * are either Active or Pending (hence the name; AP list). This list
+	 * will only ever contain SPIs or LPIs. All private IRQs must go into a
+	 * specific vcpu's AP list.
+	 */
+	struct list_head	vgic_v5_spi_ap_list_head;
 };

 struct vgic_v2_cpu_if {
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 29/43] KVM: arm64: gic-v5: Add GIC VDPEND and GIC VDRCFG hyp calls
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (27 preceding siblings ...)
  2026-04-27 16:15 ` [PATCH 28/43] KVM: arm64: gic-v5: Introduce SPI AP list Sascha Bischoff
@ 2026-04-27 16:15 ` Sascha Bischoff
  2026-04-27 16:16 ` [PATCH 30/43] KVM: arm64: gic-v5: Track SPI state for in-flight SPIs Sascha Bischoff
                   ` (13 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:15 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

With PPIs, their state is injected via the ICH_PPI_x_EL2 system
registers. For SPIs and LPIs, there are no such registers as these
would limit the number of interrupts significantly. Instead, SPI and
LPI pending state can be managed from the hypervisor using the GIC
VDPEND instruction. This provides a way to set an SPI or LPI for a VM
as pending or non-pending, i.e., to inject interrupts into a guest.

At times, it is important to detect when there is an interrupt that
has been "consumed" by the guest (deactivated). For PPIs, it was
possible to do this via the ICH_PPI_x_EL2 registers, but for SPIs and
LPIs this needs to be done using the GIC VDRCFG instruction. This, in
combination with a read of the ICC_ICSR_EL1, allows the hypervisor to
query the state of any valid SPIs/LPIs for a guest.

These system instructions are only executable from EL2, and therefore
they must be wrapped in hypercalls for NVHE/hVHE configurations. In
the case of the GIC VDRCFG, this hypercall also does the read of the
ICSR to ensure that it snapshots the correct state. Not doing this
could result in reading incorrect state from the ICSR as there is no
guarantee that someone else didn't sneak in meanwhile.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  2 ++
 arch/arm64/include/asm/kvm_hyp.h   |  2 ++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 18 ++++++++++++++++++
 arch/arm64/kvm/hyp/vgic-v5-sr.c    | 20 ++++++++++++++++++++
 4 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 8c69f1f4de534..453fc063eb61b 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -81,6 +81,8 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs,
 	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_resident,
 	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_non_resident,
+	__KVM_HOST_SMCCC_FUNC___vgic_v5_vdpend,
+	__KVM_HOST_SMCCC_FUNC___vgic_v5_vdrcfg,
 	__KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr,
 	__KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr,
 
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 5f9184276b04e..20aeb29a4adf1 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -97,6 +97,8 @@ void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if);
 void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if);
 void __vgic_v5_save_state(struct vgic_v5_cpu_if *cpu_if);
 void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if);
+void __vgic_v5_vdpend(u32 intid, bool pending, u16 vm);
+u64 __vgic_v5_vdrcfg(u32 intid);
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 void __timer_enable_traps(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 804a9ffdc8594..5c4dc2e71fcbe 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -699,6 +699,22 @@ static void handle___vgic_v5_restore_vmcr_apr(struct kvm_cpu_context *host_ctxt)
 	__vgic_v5_restore_vmcr_apr(kern_hyp_va(cpu_if));
 }
 
+static void handle___vgic_v5_vdpend(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(u32, intid, host_ctxt, 1);
+	DECLARE_REG(bool, pending, host_ctxt, 2);
+	DECLARE_REG(u16, vm, host_ctxt, 3);
+
+	__vgic_v5_vdpend(intid, pending, vm);
+}
+
+static void handle___vgic_v5_vdrcfg(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(u32, intid, host_ctxt, 1);
+
+	cpu_reg(host_ctxt, 1) = __vgic_v5_vdrcfg(intid);
+}
+
 typedef void (*hcall_t)(struct kvm_cpu_context *);
 
 #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -726,6 +742,8 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs),
 	HANDLE_FUNC(__vgic_v5_make_resident),
 	HANDLE_FUNC(__vgic_v5_make_non_resident),
+	HANDLE_FUNC(__vgic_v5_vdpend),
+	HANDLE_FUNC(__vgic_v5_vdrcfg),
 	HANDLE_FUNC(__vgic_v5_save_apr),
 	HANDLE_FUNC(__vgic_v5_restore_vmcr_apr),
 
diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c
index d27fe2911df3f..05090f5a0d9b6 100644
--- a/arch/arm64/kvm/hyp/vgic-v5-sr.c
+++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c
@@ -148,3 +148,23 @@ void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if)
 {
 	write_sysreg_s(cpu_if->vgic_icsr, SYS_ICC_ICSR_EL1);
 }
+
+void __vgic_v5_vdpend(u32 intid, bool pending, u16 vm)
+{
+	u64 value;
+
+	value = intid & (GICV5_GIC_VDPEND_ID_MASK | GICV5_GIC_VDPEND_TYPE_MASK);
+	value |= FIELD_PREP(GICV5_GIC_VDPEND_PENDING_MASK, pending);
+	value |= FIELD_PREP(GICV5_GIC_VDPEND_VM_MASK, vm);
+	gic_insn(value, VDPEND);
+}
+
+u64 __vgic_v5_vdrcfg(u32 intid)
+{
+	u64 value;
+
+	value = intid & (GICV5_GIC_VDRCFG_ID_MASK | GICV5_GIC_VDRCFG_TYPE_MASK);
+	gic_insn(value, VDRCFG);
+	isb();
+	return read_sysreg_s(SYS_ICC_ICSR_EL1);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 30/43] KVM: arm64: gic-v5: Track SPI state for in-flight SPIs
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (28 preceding siblings ...)
  2026-04-27 16:15 ` [PATCH 29/43] KVM: arm64: gic-v5: Add GIC VDPEND and GIC VDRCFG hyp calls Sascha Bischoff
@ 2026-04-27 16:16 ` Sascha Bischoff
  2026-04-27 16:16 ` [PATCH 31/43] KVM: arm64: gic: Introduce set_pending_state() to irq_op Sascha Bischoff
                   ` (12 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:16 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

GICv5 interrupt state is largely managed by the hardware
itself. However, it is possible to register a notifier for the
deactivation of an SPI, and hence KVM is required to track when such
an SPI has been consumed by the guest in order to trigger the
notifier. This allows the code that registered the notifier to be
informed when an SPI has been consumed and deactivated by a guest, and
that the guest is ready to receive the next interrupt, if required.

As part of folding interrupt state for GICv5, which until now just
included PPIs, check the SPI state.  For each in-flight SPI (an SPI
that is on the VM's SPI AP list), use GIC VDRCFG to retrieve the state
of the SPI, and track the active and pending states to determine when
the SPI has been deactivated by the guest. This needs to happen on
*every* vcpu exit for *all* vcpus belonging to the VM whenever any SPI
is in flight. When no SPIs are in flight, it is skipped altogether.

When an SPI deactivation is detected, kvm_notify_acked_irq() is called
which triggers and registered notifiers for the SPI (and is a NOP,
otherwise). Additionally, the SPI itself is popped off the AP list.

NOTE: there is currently no way to query if an SPI has a notification
requirement or not. This could be optimised by introducing that and
only tracking the state of SPIs that actually have notifiers attached.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5.c | 81 ++++++++++++++++++++++++++++++++++-
 arch/arm64/kvm/vgic/vgic.c    |  2 +-
 arch/arm64/kvm/vgic/vgic.h    |  2 +-
 3 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 86cfc08c557ba..f36d37c694d71 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -1123,7 +1123,7 @@ bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu)
  * Detect any PPIs state changes, and propagate the state with KVM's
  * shadow structures.
  */
-void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu)
+static void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu)
 {
 	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
 	unsigned long *activer, *pendr;
@@ -1195,6 +1195,85 @@ void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu)
 		    VGIC_V5_NR_PRIVATE_IRQS);
 }
 
+struct vgic_v5_acked_irq {
+	struct list_head node;
+	u32 intid;
+};
+
+void vgic_v5_fold_irq_state(struct kvm_vcpu *vcpu)
+{
+	struct vgic_dist *vgic_dist = &vcpu->kvm->arch.vgic;
+	struct vgic_v5_acked_irq *acked, *tmp_acked;
+	struct vgic_irq *irq, *tmp;
+	LIST_HEAD(acked_irqs);
+
+	/* Sync back the guest PPI state to the KVM shadow state */
+	vgic_v5_fold_ppi_state(vcpu);
+
+	/*
+	 * For SPIs, which are on the global AP list, we synchronise their state
+	 * with the hardware state. If they have been deactivated, immediately
+	 * pop them off the list and call the notifier.
+	 */
+	raw_spin_lock(&vgic_dist->vgic_v5_spi_ap_list_lock);
+	list_for_each_entry_safe(irq, tmp, &vgic_dist->vgic_v5_spi_ap_list_head, ap_list) {
+		bool pending, deactivated = false;
+		u64 icsr;
+
+		raw_spin_lock(&irq->irq_lock);
+
+		icsr = kvm_call_hyp_ret(__vgic_v5_vdrcfg, irq->intid);
+
+		irq->active = !!FIELD_GET(ICC_ICSR_EL1_Active, icsr);
+		pending = !!FIELD_GET(ICC_ICSR_EL1_Pending, icsr);
+
+		if (irq->config == VGIC_CONFIG_EDGE)
+			irq->pending_latch = pending;
+
+		if (irq->config == VGIC_CONFIG_LEVEL && !(pending || irq->active))
+			irq->pending_latch = false;
+
+		/* Deactivated? */
+		if (!irq->active && !irq_is_pending(irq)) {
+			deactivated = true;
+
+			acked = kzalloc_obj(*acked, GFP_ATOMIC);
+			if (!acked) {
+				raw_spin_unlock(&irq->irq_lock);
+				raw_spin_unlock(&vgic_dist->vgic_v5_spi_ap_list_lock);
+
+				/* Notify what we have queued before returning */
+				goto out_notify_acked_irqs;
+			}
+
+			/* Use raw SPI index without type for the GSI */
+			acked->intid = FIELD_GET(GICV5_HWIRQ_ID, irq->intid);
+			list_add_tail(&acked->node, &acked_irqs);
+
+			/* And we're done with this SPI */
+			list_del(&irq->ap_list);
+			irq->vcpu = NULL;
+		}
+
+		raw_spin_unlock(&irq->irq_lock);
+
+		if (deactivated)
+			vgic_put_irq(vcpu->kvm, irq);
+	}
+	raw_spin_unlock(&vgic_dist->vgic_v5_spi_ap_list_lock);
+
+	/*
+	 * Call the notifiers for the deactivated SPIs without holding the SPI
+	 * AP List lock to avoid deadlocks!
+	 */
+out_notify_acked_irqs:
+	list_for_each_entry_safe(acked, tmp_acked, &acked_irqs, node) {
+		kvm_notify_acked_irq(vcpu->kvm, 0, acked->intid);
+		list_del(&acked->node);
+		kfree(acked);
+	}
+}
+
 void vgic_v5_load(struct kvm_vcpu *vcpu)
 {
 	bool irichppidis = !vcpu->kvm->arch.vgic.vgic_v5_irs_data->enabled;
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index d56e87a0d2acc..d628eea4cfa4e 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -855,7 +855,7 @@ static void vgic_prune_ap_list(struct kvm_vcpu *vcpu)
 static void vgic_fold_state(struct kvm_vcpu *vcpu)
 {
 	if (vgic_is_v5(vcpu->kvm)) {
-		vgic_v5_fold_ppi_state(vcpu);
+		vgic_v5_fold_irq_state(vcpu);
 		return;
 	}
 
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 282278e4a6c19..7eef8ece52dde 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -372,7 +372,7 @@ int vgic_v5_map_resources(struct kvm *kvm);
 void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid);
 bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu);
 void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu);
-void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu);
+void vgic_v5_fold_irq_state(struct kvm_vcpu *vcpu);
 void vgic_v5_load(struct kvm_vcpu *vcpu);
 void vgic_v5_put(struct kvm_vcpu *vcpu);
 void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 31/43] KVM: arm64: gic: Introduce set_pending_state() to irq_op
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (29 preceding siblings ...)
  2026-04-27 16:16 ` [PATCH 30/43] KVM: arm64: gic-v5: Track SPI state for in-flight SPIs Sascha Bischoff
@ 2026-04-27 16:16 ` Sascha Bischoff
  2026-04-27 16:16 ` [PATCH 32/43] KVM: arm64: gic-v5: Support SPI injection Sascha Bischoff
                   ` (11 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:16 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

There are times, such as with GICv5 SPIs and LPIs, where the hardware
itself handles the interrupt lifecycle. This means that interrupt
state can be directly communicated to the hardware, and no AP lists or
similar are required to do so. This means that such interrupts don't
fit into the existing flows.

In order to accommodate cases where the hardware handles pending state
directly, a new function is added via a function pointer:
set_pending_state(). The intent is for this to be used to directly set
the pending state in hardware, and can be used when the hardware
itself is able to handle the lifecycle of an interrupt (for example,
for GICv5's SPIs).

This new function is plumbed into kvm_vgic_inject_irq(), and is only
called if irq_ops are provided and this function pointer is explicitly
set. In the general case, this has no effect.

This is part of preparatory work for GICv5 SPI support in KVM.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic.c | 3 +++
 include/kvm/arm_vgic.h     | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index d628eea4cfa4e..b35833a4e2bf9 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -565,6 +565,9 @@ int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu,
 	else
 		irq->pending_latch = true;
 
+	if (irq->ops && irq->ops->set_pending_state)
+		WARN_ON_ONCE(!irq->ops->set_pending_state(vcpu, irq));
+
 	vgic_queue_irq_unlock(kvm, irq, flags);
 	vgic_put_irq(kvm, irq);
 
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 812ad325b01be..a47e5c742aac8 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -217,6 +217,12 @@ struct irq_ops {
 	 */
 	bool (*get_input_level)(int vintid);
 
+	/*
+	 * Function pointer to directly set the pending state for interrupts
+	 * that don't need to be enqueued on AP lists (for example, GICv5 PPIs).
+	 */
+	bool (*set_pending_state)(struct kvm_vcpu *vcpu, struct vgic_irq *irq);
+
 	/*
 	 * Function pointer to override the queuing of an IRQ.
 	 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 32/43] KVM: arm64: gic-v5: Support SPI injection
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (30 preceding siblings ...)
  2026-04-27 16:16 ` [PATCH 31/43] KVM: arm64: gic: Introduce set_pending_state() to irq_op Sascha Bischoff
@ 2026-04-27 16:16 ` Sascha Bischoff
  2026-04-27 16:17 ` [PATCH 33/43] KVM: arm64: gic-v5: Add GICv5 SPI injection to irqfd Sascha Bischoff
                   ` (10 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:16 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

GICv5 SPI lifecycle is handled by the GICv5 hardware once the pending
state has been injected.

This change adds support for injecting and managing SPIs to the core
VGIC code and GICv5 code. First of all, allow GICv5 SPIs to be looked
up by ID via vgic_get_irq(). Previously, only PPIs were supported.

Two irq_ops are used to inject the SPI pending state into the
hardware, and to append the SPI to the VM's global SPI AP list.  The
set_pending_state() irq_op is used to inject the SPI's pending state
into the guest. The queue_irq_unlock irq_op is used to append the SPI
to the SPI AP list - they are not added to a per-VCPU AP list as they
are global to the VM. Also, this would require KVM to track the
affinity of individual interrupts, which would negate much of the
benefit of their lifecycle's being hardware managed.

While the SPIs are on the global AP list, their state is checked on
every vcpu exit, and once they've been consumed they are removed from
the AP list again.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-irs-v5.c |  1 +
 arch/arm64/kvm/vgic/vgic-v5.c     | 91 +++++++++++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic.c        | 14 +++--
 arch/arm64/kvm/vgic/vgic.h        |  2 +
 4 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-irs-v5.c b/arch/arm64/kvm/vgic/vgic-irs-v5.c
index 8e69b624194d5..3a2539bf99d0d 100644
--- a/arch/arm64/kvm/vgic/vgic-irs-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-irs-v5.c
@@ -724,6 +724,7 @@ int kvm_vgic_v5_irs_init(struct kvm *kvm, unsigned int nr_spis)
 			 * view it is always enabled.
 			 */
 			irq->enabled = 1;
+			vgic_v5_set_spi_ops(irq);
 		}
 
 		nr_spi_bits = fls(roundup_pow_of_two(nr_spis)) - 1;
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index f36d37c694d71..38e37a03ac951 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -1274,6 +1274,97 @@ void vgic_v5_fold_irq_state(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool vgic_v5_set_spi_pending_state(struct kvm_vcpu *vcpu,
+					  struct vgic_irq *irq)
+{
+	vgic_v5_set_irq_pend(irq->target_vcpu, irq);
+	return true;
+}
+
+/*
+ * Put the SPI on the SPI AP list. No need to kick the VCPU. If it is running,
+ * the interrupt will signal at some point, and if not, then a VPE doorbell will
+ * fire (based on the IAFFID the guest has configured).
+ */
+static bool vgic_v5_spi_queue_irq_unlock(struct kvm *kvm,
+					struct vgic_irq *irq,
+					unsigned long flags)
+	__releases(&irq->irq_lock)
+{
+	struct vgic_dist *vgic_dist = &kvm->arch.vgic;
+
+	lockdep_assert_held(&irq->irq_lock);
+
+	if (WARN_ON(!__irq_is_spi(KVM_DEV_TYPE_ARM_VGIC_V5, irq->intid)))
+		return false;
+
+retry:
+	/*
+	 * We're already on the AP list or don't need to be on
+	 * one; nothing more to do.
+	 */
+	if (irq->vcpu) {
+		raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
+		return true;
+	}
+
+	raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
+
+	/* someone can do stuff here, which we re-check below */
+	raw_spin_lock_irqsave(&vgic_dist->vgic_v5_spi_ap_list_lock, flags);
+	raw_spin_lock(&irq->irq_lock);
+
+	/*
+	 * We've lost the race; and have already been queued. Unlock
+	 * global AP list, relock IRQ, and retry.
+	 */
+	if (unlikely(irq->vcpu)) {
+		raw_spin_unlock(&irq->irq_lock);
+		raw_spin_unlock_irqrestore(&vgic_dist->vgic_v5_spi_ap_list_lock, flags);
+
+		raw_spin_lock_irqsave(&irq->irq_lock, flags);
+
+		goto retry;
+	}
+
+	list_add_tail(&irq->ap_list, &vgic_dist->vgic_v5_spi_ap_list_head);
+
+	/*
+	 * Use the VCPU we've been given as the target VCPU to track
+	 * that we're on an AP list. We're not queued on that VCPU's AP
+	 * list, but in lieu of an AP flag, this will do.
+	 */
+	irq->vcpu = irq->target_vcpu;
+
+	raw_spin_unlock(&irq->irq_lock);
+	raw_spin_unlock_irqrestore(&vgic_dist->vgic_v5_spi_ap_list_lock, flags);
+
+	return true;
+}
+
+static struct irq_ops vgic_v5_spi_irq_ops = {
+	.set_pending_state = vgic_v5_set_spi_pending_state,
+	.queue_irq_unlock = vgic_v5_spi_queue_irq_unlock,
+};
+
+void vgic_v5_set_spi_ops(struct vgic_irq *irq)
+{
+	if (WARN_ON(!irq) || WARN_ON(irq->ops))
+		return;
+
+	irq->ops = &vgic_v5_spi_irq_ops;
+}
+
+/* Set the pending state for GICv5 SPIs and LPIs */
+void vgic_v5_set_irq_pend(struct kvm_vcpu *vcpu, struct vgic_irq *irq)
+{
+	if (WARN_ON(__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, irq->intid)))
+		return;
+
+	kvm_call_hyp(__vgic_v5_vdpend, irq->intid, irq_is_pending(irq),
+		     vcpu->kvm->arch.vgic.gicv5_vm.vm_id);
+}
+
 void vgic_v5_load(struct kvm_vcpu *vcpu)
 {
 	bool irichppidis = !vcpu->kvm->arch.vgic.vgic_v5_irs_data->enabled;
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index b35833a4e2bf9..3c8655aee8bbf 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -86,12 +86,16 @@ static struct vgic_irq *vgic_get_lpi(struct kvm *kvm, u32 intid)
  */
 struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid)
 {
-	/* Non-private IRQs are not yet implemented for GICv5 */
-	if (vgic_is_v5(kvm))
-		return NULL;
-
 	/* SPIs */
-	if (intid >= VGIC_NR_PRIVATE_IRQS &&
+	if (vgic_is_v5(kvm) && __irq_is_spi(KVM_DEV_TYPE_ARM_VGIC_V5, intid)) {
+		u32 int_num = FIELD_GET(GICV5_HWIRQ_ID, intid);
+
+		if (int_num >= kvm->arch.vgic.nr_spis)
+			return NULL;
+
+		int_num = array_index_nospec(int_num, kvm->arch.vgic.nr_spis);
+		return &kvm->arch.vgic.spis[int_num];
+	} else if (intid >= VGIC_NR_PRIVATE_IRQS &&
 	    intid < (kvm->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) {
 		intid = array_index_nospec(intid, kvm->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS);
 		return &kvm->arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 7eef8ece52dde..b5036170430dd 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -370,6 +370,8 @@ int kvm_vgic_v5_irs_init(struct kvm *kvm, unsigned int nr_spis);
 void vgic_v5_teardown(struct kvm *kvm);
 int vgic_v5_map_resources(struct kvm *kvm);
 void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid);
+void vgic_v5_set_spi_ops(struct vgic_irq *irq);
+void vgic_v5_set_irq_pend(struct kvm_vcpu *vcpu, struct vgic_irq *irq);
 bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu);
 void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu);
 void vgic_v5_fold_irq_state(struct kvm_vcpu *vcpu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 33/43] KVM: arm64: gic-v5: Add GICv5 SPI injection to irqfd
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (31 preceding siblings ...)
  2026-04-27 16:16 ` [PATCH 32/43] KVM: arm64: gic-v5: Support SPI injection Sascha Bischoff
@ 2026-04-27 16:17 ` Sascha Bischoff
  2026-04-27 16:17 ` [PATCH 34/43] KVM: arm64: gic-v5: Mask per-vcpu PPI state in vgic_v5_finalize_ppi_state() Sascha Bischoff
                   ` (9 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:17 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Now that there is support for GICv5 SPIs in KVM, update
vgic_irqfd_set_irq() to handle GICv5 SPIs. This allows their state to
be injected via that interface.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-irqfd.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c
index b9b86e3a6c862..3b9ebd8e367be 100644
--- a/arch/arm64/kvm/vgic/vgic-irqfd.c
+++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
@@ -9,6 +9,8 @@
 #include <kvm/arm_vgic.h>
 #include "vgic.h"
 
+#include <linux/irqchip/arm-gic-v5.h>
+
 /*
  * vgic_irqfd_set_irq: inject the IRQ corresponding to the
  * irqchip routing entry
@@ -19,7 +21,12 @@ static int vgic_irqfd_set_irq(struct kvm_kernel_irq_routing_entry *e,
 			struct kvm *kvm, int irq_source_id,
 			int level, bool line_status)
 {
-	unsigned int spi_id = e->irqchip.pin + VGIC_NR_PRIVATE_IRQS;
+	unsigned int spi_id;
+
+	if (kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5)
+		spi_id = vgic_v5_make_spi(e->irqchip.pin & GICV5_HWIRQ_ID);
+	else
+		spi_id = e->irqchip.pin + VGIC_NR_PRIVATE_IRQS;
 
 	if (!vgic_valid_spi(kvm, spi_id))
 		return -EINVAL;
@@ -39,6 +46,8 @@ int kvm_set_routing_entry(struct kvm *kvm,
 			  struct kvm_kernel_irq_routing_entry *e,
 			  const struct kvm_irq_routing_entry *ue)
 {
+	unsigned int nr_pins = vgic_is_v5(kvm) ? kvm->arch.vgic.nr_spis :
+						 KVM_IRQCHIP_NUM_PINS;
 	int r = -EINVAL;
 
 	switch (ue->type) {
@@ -46,7 +55,7 @@ int kvm_set_routing_entry(struct kvm *kvm,
 		e->set = vgic_irqfd_set_irq;
 		e->irqchip.irqchip = ue->u.irqchip.irqchip;
 		e->irqchip.pin = ue->u.irqchip.pin;
-		if ((e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS) ||
+		if ((e->irqchip.pin >= nr_pins) ||
 		    (e->irqchip.irqchip >= KVM_NR_IRQCHIPS))
 			goto out;
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 34/43] KVM: arm64: gic-v5: Mask per-vcpu PPI state in vgic_v5_finalize_ppi_state()
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (32 preceding siblings ...)
  2026-04-27 16:17 ` [PATCH 33/43] KVM: arm64: gic-v5: Add GICv5 SPI injection to irqfd Sascha Bischoff
@ 2026-04-27 16:17 ` Sascha Bischoff
  2026-04-27 16:17 ` [PATCH 35/43] KVM: arm64: gic-v5: Add GICv5 EL1 sysreg userspace set/get interface Sascha Bischoff
                   ` (8 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:17 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Only a subset of the possible PPIs are exposed to a guest when running
with a vGICv5. First of all, only the architected PPIS are considered
by KVM. Secondly, only a set of those is exposed to a guest - those
corresponding to devices that KVM emulates (timers, PMU) and the GICv5
SW_PPI.

The finalisation of exposed PPIs happens on first vCPU run as this is
the first time when the full set of exposed devices is known. At this
stage a mask is calculated, and this mask is applied to both hide
non-exposed PPI state from the guest and to reduce overhead when
iterating over the PPIs.

As part of introducing support for userspace accesses to the GICv5
system registers it has become apparent that userspace sets of the
GICv5 PPI registers can result in a mismatch between the state exposed
to the guest and what KVM expects to be exposed. Effectively,
userspace can set the Enable, Active, Pending state of PPIs that KVM
has chosen to hide from a guest.

Under the assumption that on a VM restore userspace will set the PPI
state prior to running the vCPU(s) for the first time, rework
vgic_v5_finalize_ppi_state() to not only calculate the mask of exposed
PPIs, but also to clear any state for the non-exposed PPIs. This
ensures that only the state that KVM intends to expose to the guest is
exposed.

Note: If userspace chooses to set the state of PPI registers after
running a vCPU for the first time, then no masking takes place and
that state is directly exposed to a guest.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/arm.c          |  2 +-
 arch/arm64/kvm/vgic/vgic-v5.c | 71 +++++++++++++++++++++++++----------
 include/kvm/arm_vgic.h        |  2 +-
 3 files changed, 53 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 176cbe8baad30..28dfb6e079ee7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -953,7 +953,7 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	ret = vgic_v5_finalize_ppi_state(kvm);
+	ret = vgic_v5_finalize_ppi_state(vcpu);
 	if (ret)
 		return ret;
 
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 38e37a03ac951..3e435a31b463e 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -915,9 +915,10 @@ void vgic_v5_teardown(struct kvm *kvm)
 	vgic_v5_release_vm_id(kvm);
 }
 
-int vgic_v5_finalize_ppi_state(struct kvm *kvm)
+int vgic_v5_finalize_ppi_state(struct kvm_vcpu *vcpu)
 {
-	struct kvm_vcpu *vcpu0;
+	struct kvm *kvm	= vcpu->kvm;
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
 	int i;
 
 	if (!vgic_is_v5(kvm))
@@ -926,35 +927,65 @@ int vgic_v5_finalize_ppi_state(struct kvm *kvm)
 	guard(mutex)(&kvm->arch.config_lock);
 
 	/*
-	 * If SW_PPI has been advertised, then we know we already
-	 * initialised the whole thing, and we can return early. Yes,
-	 * this is pretty hackish as far as state tracking goes...
+	 * Discover the set of PPIs that are exposed to the guest once per VM.
+	 * Once known, apply that mask to each VCPU's restored PPI state as the
+	 * VCPUs are first run.
 	 */
-	if (test_bit(GICV5_ARCH_PPI_SW_PPI, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask))
-		return 0;
-
-	/* The PPI state for all VCPUs should be the same. Pick the first. */
-	vcpu0 = kvm_get_vcpu(kvm, 0);
+	if (!test_bit(GICV5_ARCH_PPI_SW_PPI, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask)) {
+		bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_mask,
+			    VGIC_V5_NR_PRIVATE_IRQS);
+		bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr,
+			    VGIC_V5_NR_PRIVATE_IRQS);
+
+		for_each_set_bit(i, ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) {
+			const u32 intid = vgic_v5_make_ppi(i);
+			struct vgic_irq *irq;
+
+			irq = vgic_get_vcpu_irq(vcpu, intid);
+
+			/* Expose PPIs with an owner or the SW_PPI, only */
+			scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
+				if (irq->owner || i == GICV5_ARCH_PPI_SW_PPI) {
+					__set_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask);
+					__assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr,
+						     irq->config == VGIC_CONFIG_LEVEL);
+				}
+			}
 
-	bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS);
-	bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS);
+			vgic_put_irq(kvm, irq);
+		}
+	}
 
-	for_each_set_bit(i, ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) {
+	/*
+	 * Apply the mask to Enable, Active. Skip pending as that's calculated
+	 * on guest entry.
+	 */
+	bitmap_and(cpu_if->vgic_ppi_enabler, cpu_if->vgic_ppi_enabler,
+		   kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS);
+	bitmap_and(cpu_if->vgic_ppi_activer, cpu_if->vgic_ppi_activer,
+		   kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS);
+
+	/* Also update the vgic_irqs */
+	for (i = 0; i < VGIC_V5_NR_PRIVATE_IRQS; i++) {
+		bool visible = test_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask);
 		const u32 intid = vgic_v5_make_ppi(i);
 		struct vgic_irq *irq;
 
-		irq = vgic_get_vcpu_irq(vcpu0, intid);
+		irq = vgic_get_vcpu_irq(vcpu, intid);
 
-		/* Expose PPIs with an owner or the SW_PPI, only */
 		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
-			if (irq->owner || i == GICV5_ARCH_PPI_SW_PPI) {
-				__set_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask);
-				__assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr,
-					     irq->config == VGIC_CONFIG_LEVEL);
+			if (!visible) {
+				irq->enabled = false;
+				irq->active = false;
+				irq->pending_latch = false;
+				irq->line_level = false;
+			} else {
+				irq->enabled = test_bit(i, cpu_if->vgic_ppi_enabler);
+				irq->active = test_bit(i, cpu_if->vgic_ppi_activer);
 			}
 		}
 
-		vgic_put_irq(vcpu0->kvm, irq);
+		vgic_put_irq(kvm, irq);
 	}
 
 	return 0;
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index a47e5c742aac8..778ecc761fe79 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -752,7 +752,7 @@ int vgic_v4_load(struct kvm_vcpu *vcpu);
 void vgic_v4_commit(struct kvm_vcpu *vcpu);
 int vgic_v4_put(struct kvm_vcpu *vcpu);
 
-int vgic_v5_finalize_ppi_state(struct kvm *kvm);
+int vgic_v5_finalize_ppi_state(struct kvm_vcpu *vcpu);
 bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
 				  unsigned long flags);
 void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, bool dvi);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 35/43] KVM: arm64: gic-v5: Add GICv5 EL1 sysreg userspace set/get interface
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (33 preceding siblings ...)
  2026-04-27 16:17 ` [PATCH 34/43] KVM: arm64: gic-v5: Mask per-vcpu PPI state in vgic_v5_finalize_ppi_state() Sascha Bischoff
@ 2026-04-27 16:17 ` Sascha Bischoff
  2026-04-27 16:18 ` [PATCH 36/43] KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs Sascha Bischoff
                   ` (7 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:17 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Now that KVM is at the point where it is able to run meaningful VMs
with GICv5, it is important to be able to save/restore the GICv5
state in order to allow for VM migration.

Add functions to handle the set/get for GICv5 EL1 system registers to
facilitate the save/restore of these. These access the stored
hypervisor state for the guest, rather than the guest's registers
themselves. Much of the state that is read out is generated at this
point as it is stored across a range of registers. When writing the
system registers, the state is merged back into the appropriate
places.

The save/restore mechanism introduced here matches that already
present for GICv3 - there are no changes required to that interface,
so we are able to reuse the existing UAPI for GICv5.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/Makefile          |   3 +-
 arch/arm64/kvm/sys_regs.c        |   6 +-
 arch/arm64/kvm/vgic-sys-reg-v5.c | 510 +++++++++++++++++++++++++++++++
 arch/arm64/kvm/vgic/vgic.h       |   5 +
 4 files changed, 522 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/kvm/vgic-sys-reg-v5.c

diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 92dda57c08766..7aaeeb84e788e 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -24,7 +24,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
 	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
-	 vgic/vgic-v5.o vgic/vgic-v5-tables.o vgic/vgic-irs-v5.o
+	 vgic/vgic-v5.o vgic/vgic-v5-tables.o vgic/vgic-irs-v5.o \
+	 vgic-sys-reg-v5.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 6083a1b23dbf9..af0d8357003be 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -5831,7 +5831,7 @@ int kvm_finalize_sys_regs(struct kvm_vcpu *vcpu)
 
 int __init kvm_sys_reg_table_init(void)
 {
-	const struct sys_reg_desc *gicv3_regs;
+	const struct sys_reg_desc *gicv3_regs, *gicv5_regs;
 	bool valid = true;
 	unsigned int i, sz;
 	int ret = 0;
@@ -5844,8 +5844,12 @@ int __init kvm_sys_reg_table_init(void)
 	valid &= check_sysreg_table(cp15_64_regs, ARRAY_SIZE(cp15_64_regs), false);
 	valid &= check_sysreg_table(sys_insn_descs, ARRAY_SIZE(sys_insn_descs), false);
 
+	/* The GICv3 system registers... */
 	gicv3_regs = vgic_v3_get_sysreg_table(&sz);
 	valid &= check_sysreg_table(gicv3_regs, sz, false);
+	/* ...and the GICv5 system registers. */
+	gicv5_regs = vgic_v5_get_sysreg_table(&sz);
+	valid &= check_sysreg_table(gicv5_regs, sz, false);
 
 	if (!valid)
 		return -EINVAL;
diff --git a/arch/arm64/kvm/vgic-sys-reg-v5.c b/arch/arm64/kvm/vgic-sys-reg-v5.c
new file mode 100644
index 0000000000000..7623edd01dce5
--- /dev/null
+++ b/arch/arm64/kvm/vgic-sys-reg-v5.c
@@ -0,0 +1,510 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025, 2026 Arm Ltd.
+ */
+
+/*
+ * VGICv5 system registers handling functions for AArch64 mode
+ */
+
+#include <linux/irqchip/arm-gic-v5.h>
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_emulate.h>
+
+#include "vgic/vgic.h"
+#include "sys_regs.h"
+
+static int set_gic_apr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	/* The upper 32 bits are RES0 */
+	cpu_if->vgic_apr = val & ~ICC_APR_EL1_RES0;
+
+	return 0;
+}
+
+static int get_gic_apr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 *val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	*val = cpu_if->vgic_apr;
+
+	return 0;
+}
+
+static int set_gic_cr0(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	/*
+	 * We only support setting the ICC_CR0_EL1.En bit, which is actually
+	 * stored in the VMCR.
+	 */
+	FIELD_MODIFY(FEAT_GCIE_ICH_VMCR_EL2_EN, &cpu_if->vgic_vmcr,
+		     FIELD_GET(ICC_CR0_EL1_EN, val));
+
+	return 0;
+}
+
+static int get_gic_cr0(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 *val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	/*
+	 * PID only applies if EL3 is present. Same applies to IPPT. Hence,
+	 * those fields are always presented as 0.
+	 *
+	 * We always present the link as connected and idle:
+	 *     (LINK = 1, LINK_IDLE = 1).
+	 */
+	*val = FIELD_PREP(ICC_CR0_EL1_EN,
+			  FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, cpu_if->vgic_vmcr));
+	*val |= ICC_CR0_EL1_LINK_MASK;
+	*val |= ICC_CR0_EL1_LINK_IDLE_MASK;
+
+	return 0;
+}
+
+static int set_gic_pcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	/* Set the VPMR field in the VMCR */
+	FIELD_MODIFY(FEAT_GCIE_ICH_VMCR_EL2_VPMR, &cpu_if->vgic_vmcr,
+		     FIELD_GET(ICC_PCR_EL1_PRIORITY, val));
+
+	return 0;
+}
+
+static int get_gic_pcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 *val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	*val = FIELD_PREP(ICC_PCR_EL1_PRIORITY,
+			  FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, cpu_if->vgic_vmcr));
+
+	return 0;
+}
+
+static int set_gic_icsr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	cpu_if->vgic_icsr = val & ~ICC_ICSR_EL1_RES0;
+
+	return 0;
+}
+
+static int get_gic_icsr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+		       u64 *val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+
+	*val = cpu_if->vgic_icsr;
+
+	return 0;
+}
+
+/*
+ * Helper macro to iterate over a range of PPIs and execute some code (to either
+ * extract or set the vgic_irq state). This is used when `get`-ing the PPI
+ * ENABLER, ACTIVER, PENDR and when setting the PRIORITYR state.
+ *
+ * vcpu: Pointer to struct kvm_vcpu (to which these PPIs belong)
+ * r: The register index. 0 or 1 for all except PRIORITYR (which is 0-15)
+ * nr: The number of PPIs iterated over. 64 for all but PRIORITYR (which is 8)
+ * code: The code snippet to execute for each vgic_irq
+ */
+#define for_ppi_state(vcpu, r, nr, code)				\
+	({								\
+		for (int i = 0; i < nr; i++) {				\
+			u32 id = vgic_v5_make_ppi(r * nr + i);		\
+			struct vgic_irq *irq;				\
+									\
+			irq = vgic_get_vcpu_irq(vcpu, id);		\
+			scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { \
+				code;					\
+			}						\
+			vgic_put_irq(vcpu->kvm, irq);			\
+		}							\
+	})
+
+static int set_gic_ppi_enabler(struct kvm_vcpu *vcpu,
+			       const struct sys_reg_desc *r, u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+	int i, start, end, reg = r->Op2 % 2;
+
+	/*
+	 * If we're only handling architected PPIs and the guest writes to the
+	 * enable for the non-architected PPIs, we just return as there's
+	 * nothing to do at all. We don't even allocate the storage for them in
+	 * this case.
+	 */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg == 1)
+		return 0;
+
+	/*
+	 * Merge the raw guest write into our bitmap at an offset of either 0 or
+	 * 64.
+	 *
+	 * Note that there is *NO* masking applied - the enable state is written
+	 * unfiltered. The assumption is that userspace uses this interface to
+	 * set initial state before the guest runs, and then the exposed PPI
+	 * mask is applied later, when vgic_v5_finalize_ppi_state() runs on
+	 * first entry to each vCPU. If userspace chooses to set the enabler
+	 * state later, it is fully capable of breaking the illusion we provided
+	 * to the guest by exposing register state (and PPIs) to the guest that
+	 * were not initially exposed. Good luck!
+	 */
+	bitmap_write(cpu_if->vgic_ppi_enabler, val, 64 * reg, 64);
+
+	/*
+	 * Sync the change in enable states to the vgic_irqs for the written
+	 * register slice.
+	 */
+	start = VGIC_V5_NR_PRIVATE_IRQS * reg;
+	end = start + VGIC_V5_NR_PRIVATE_IRQS;
+	for (i = start; i < end; i++) {
+		u32 intid = vgic_v5_make_ppi(i);
+		struct vgic_irq *irq;
+
+		irq = vgic_get_vcpu_irq(vcpu, intid);
+
+		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock)
+			irq->enabled = test_bit(i, cpu_if->vgic_ppi_enabler);
+
+		vgic_put_irq(vcpu->kvm, irq);
+	}
+
+	return 0;
+}
+
+static int get_gic_ppi_enabler(struct kvm_vcpu *vcpu,
+			       const struct sys_reg_desc *r, u64 *val)
+{
+	unsigned long enabler = 0;
+	int reg = r->Op2 % 2;
+
+	/* If we only support architected PPIs, return 0 */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg == 1) {
+		*val = 0;
+		return 0;
+	}
+
+	/* Iterate over each struct vgic_irq to build the ENABLER value. */
+	for_ppi_state(vcpu, reg, 64, __assign_bit(i % 64, &enabler, irq->enabled));
+
+	*val = enabler;
+
+	return 0;
+}
+
+static int set_gic_ppi_activer(struct kvm_vcpu *vcpu,
+			       const struct sys_reg_desc *r, u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+	int i, start, end, reg = r->Op2 % 2;
+
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg == 1)
+		return 0;
+
+	/*
+	 * Store the raw guest write. The exposed PPI mask is applied later,
+	 * when vgic_v5_finalize_ppi_state() runs on first entry to each
+	 * vCPU. See comment on set_gic_ppi_enabler() for details.
+	 */
+	bitmap_write(cpu_if->vgic_ppi_activer, val, 64 * reg, 64);
+
+	start = VGIC_V5_NR_PRIVATE_IRQS * reg;
+	end = start + VGIC_V5_NR_PRIVATE_IRQS;
+	for (i = start; i < end; i++) {
+		u32 intid = vgic_v5_make_ppi(i);
+		struct vgic_irq *irq;
+
+		irq = vgic_get_vcpu_irq(vcpu, intid);
+
+		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock)
+			irq->active = test_bit(i, cpu_if->vgic_ppi_activer);
+
+		vgic_put_irq(vcpu->kvm, irq);
+	}
+
+	return 0;
+}
+
+static int get_gic_ppi_activer(struct kvm_vcpu *vcpu,
+			       const struct sys_reg_desc *r, u64 *val)
+{
+	unsigned long activer = 0;
+	int reg = r->Op2 % 2;
+
+	/* If we only support architected PPIs, return 0 */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg == 1) {
+		*val = 0;
+		return 0;
+	}
+
+	/* Iterate over each struct vgic_irq to build the ACTIVER value. */
+	for_ppi_state(vcpu, reg, 64, __assign_bit(i % 64, &activer, irq->active));
+
+	*val = activer;
+
+	return 0;
+}
+
+static int set_gic_ppi_pendr(struct kvm_vcpu *vcpu,
+			     const struct sys_reg_desc *r, u64 val)
+{
+	int i, start, end, reg = r->Op2 % 2;
+
+	/* If we only support architected PPIs, return */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg == 1)
+		return 0;
+
+	/*
+	 * Update each struct vgic_irq with the pending state, treating Level
+	 * and Edge interrupts differently. The exposed PPI mask is applied
+	 * later, when vgic_v5_finalize_ppi_state() runs on first entry to each
+	 * vCPU. See comment on set_gic_ppi_enabler() for details.
+	 */
+	start = VGIC_V5_NR_PRIVATE_IRQS * reg;
+	end = start + VGIC_V5_NR_PRIVATE_IRQS;
+	for (i = start; i < end; i++) {
+		u32 intid = vgic_v5_make_ppi(i);
+		struct vgic_irq *irq;
+
+		irq = vgic_get_vcpu_irq(vcpu, intid);
+
+		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
+			bool level = !!(val & BIT_ULL(i));
+
+			if (irq->config == VGIC_CONFIG_LEVEL)
+				irq->line_level = level;
+			else
+				irq->pending_latch = level;
+		}
+
+		vgic_put_irq(vcpu->kvm, irq);
+	}
+
+	/*
+	 * The pending state is generated from the vgic_irqs on each guest
+	 * entry. Therefore, we don't store the raw value written anywhere in
+	 * the case of userspace PPI_PENDRx_EL1 writes.
+	 */
+
+	return 0;
+}
+
+static int get_gic_ppi_pendr(struct kvm_vcpu *vcpu,
+			     const struct sys_reg_desc *r, u64 *val)
+{
+	unsigned long pendr = 0;
+	int reg = r->Op2 % 2;
+
+	/* If we only support architected PPIs, return 0 */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg == 1) {
+		*val = 0;
+		return 0;
+	}
+
+	/* Iterate over each struct vgic_irq to build the PENDR value. */
+	for_ppi_state(vcpu, reg, 64,
+		if (irq_is_pending(irq))
+			__assign_bit(i % 64, &pendr, 1);
+		);
+
+	*val = pendr;
+
+	return 0;
+}
+
+static int set_gic_ppi_priorityr(struct kvm_vcpu *vcpu,
+				 const struct sys_reg_desc *r, u64 val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+	int reg = ((r->CRm & 0x1) << 3) + r->Op2;
+
+	/* If we only support architected PPIs, return */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg > 7)
+		return 0;
+
+	/*
+	 * Although priorities are not regularly synced back to the vgic_irq
+	 * state, they are explcitly synced back here. This is to ensure that
+	 * any pending PPIs are evaluated correctly when first running the guest
+	 * after setting the state.
+	 */
+	for_ppi_state(vcpu, reg, 8,
+		      irq->priority = (val >> 8 * i) & ICC_PPI_PRIORITYRx_EL1_Priority0;
+		);
+
+	/*
+	 * Update the state that will be written to the ICH_PPI_PRIORITYRx_EL2
+	 * on next guest entry.
+	 */
+	cpu_if->vgic_ppi_priorityr[reg] = val;
+
+	return 0;
+}
+
+static int get_gic_ppi_priorityr(struct kvm_vcpu *vcpu,
+				 const struct sys_reg_desc *r, u64 *val)
+{
+	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
+	int reg = ((r->CRm & 0x1) << 3) + r->Op2;
+
+	/* If we only support architected PPIs, return 0 */
+	if (VGIC_V5_NR_PRIVATE_IRQS == 64 && reg > 7) {
+		*val = 0;
+		return 0;
+	}
+
+	/*
+	 * The priorities are only synced back to the vgic_irq state when the
+	 * vcpu is entering WFI (KVM only needs to know the priorities when
+	 * evaluating if there are pending PPI interrupts for a vcpu). The raw
+	 * register ICH_PPI_PRIORITYRx_EL1 state is simply saved and restored
+	 * blindly. This state is just returned as it contains the most recent
+	 * priorities written by the guest.
+	 */
+	*val = cpu_if->vgic_ppi_priorityr[reg];
+
+	return 0;
+}
+
+/*
+ * The following registers are NOT supported:
+ *
+ * - ICC_HAPR_EL1
+ *	The value of this is directly generated by the GICv5 hardware based on
+ *	the ICC_APR_EL1 when the guest is running.
+ * - ICC_IAFFIDR_EL1
+ *	The IAFFID for a GICv5 VPE is the same as the VPE ID, which is the index
+ *	into the in-memory VPE Table. This is not configurable, and instead we
+ *	rely on userspace recreating the VPEs in the same order prior to
+ *	restoring guest state.
+ * - ICC_PPI_CACTIVER<n>_EL1
+ *	Only raw state writes are supported via the S(et) variant.
+ * - ICC_PPI_CPENDR<n>_EL1
+ *	Only raw state writes are supported via the S(et) variant.
+ */
+static const struct sys_reg_desc gic_v5_icc_reg_descs[] = {
+	{ SYS_DESC(SYS_ICC_ICSR_EL1),
+	  .set_user = set_gic_icsr, .get_user = get_gic_icsr, },
+	{ SYS_DESC(SYS_ICC_PPI_ENABLER0_EL1),
+	  .set_user = set_gic_ppi_enabler, .get_user = get_gic_ppi_enabler, },
+	{ SYS_DESC(SYS_ICC_PPI_ENABLER1_EL1),
+	  .set_user = set_gic_ppi_enabler, .get_user = get_gic_ppi_enabler, },
+	/*
+	 * Only ICC_SACTIVER<n>_EL1 is exposed to the guest. This is treated as
+	 * a *RAW* write of register state for writes.
+	 */
+	{ SYS_DESC(SYS_ICC_PPI_SACTIVER0_EL1),
+	  .set_user = set_gic_ppi_activer, .get_user = get_gic_ppi_activer, },
+	{ SYS_DESC(SYS_ICC_PPI_SACTIVER1_EL1),
+	  .set_user = set_gic_ppi_activer, .get_user = get_gic_ppi_activer, },
+	/*
+	 * Only ICC_SPENDR<n>_EL1 is exposed to the guest. This is treated as
+	 * a *RAW* write of register state for writes.
+	 */
+	{ SYS_DESC(SYS_ICC_PPI_SPENDR0_EL1),
+	  .set_user = set_gic_ppi_pendr, .get_user = get_gic_ppi_pendr, },
+	{ SYS_DESC(SYS_ICC_PPI_SPENDR1_EL1),
+	  .set_user = set_gic_ppi_pendr, .get_user = get_gic_ppi_pendr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR0_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR1_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR2_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR3_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR4_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR5_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR6_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR7_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR8_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR9_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR10_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR11_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR12_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR13_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR14_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_PPI_PRIORITYR15_EL1),
+	  .set_user = set_gic_ppi_priorityr, .get_user = get_gic_ppi_priorityr, },
+	{ SYS_DESC(SYS_ICC_APR_EL1),
+	  .set_user = set_gic_apr, .get_user = get_gic_apr, },
+	{ SYS_DESC(SYS_ICC_CR0_EL1),
+	  .set_user = set_gic_cr0, .get_user = get_gic_cr0, },
+	{ SYS_DESC(SYS_ICC_PCR_EL1),
+	  .set_user = set_gic_pcr, .get_user = get_gic_pcr, },
+};
+
+const struct sys_reg_desc *vgic_v5_get_sysreg_table(unsigned int *sz)
+{
+	*sz = ARRAY_SIZE(gic_v5_icc_reg_descs);
+	return gic_v5_icc_reg_descs;
+}
+
+static u64 attr_to_id(u64 attr)
+{
+	return ARM64_SYS_REG(FIELD_GET(KVM_REG_ARM_VGIC_SYSREG_OP0_MASK, attr),
+			     FIELD_GET(KVM_REG_ARM_VGIC_SYSREG_OP1_MASK, attr),
+			     FIELD_GET(KVM_REG_ARM_VGIC_SYSREG_CRN_MASK, attr),
+			     FIELD_GET(KVM_REG_ARM_VGIC_SYSREG_CRM_MASK, attr),
+			     FIELD_GET(KVM_REG_ARM_VGIC_SYSREG_OP2_MASK, attr));
+}
+
+int vgic_v5_has_cpu_sysregs_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	const struct sys_reg_desc *r;
+
+	r = get_reg_by_id(attr_to_id(attr->attr), gic_v5_icc_reg_descs,
+			  ARRAY_SIZE(gic_v5_icc_reg_descs));
+
+	if (r && !sysreg_hidden(vcpu, r))
+		return 0;
+
+	return -ENXIO;
+}
+
+int vgic_v5_cpu_sysregs_uaccess(struct kvm_vcpu *vcpu,
+				struct kvm_device_attr *attr,
+				bool is_write)
+{
+	struct kvm_one_reg reg = {
+		.id	= attr_to_id(attr->attr),
+		.addr	= attr->addr,
+	};
+
+	if (is_write)
+		return kvm_sys_reg_set_user(vcpu, &reg, gic_v5_icc_reg_descs,
+					    ARRAY_SIZE(gic_v5_icc_reg_descs));
+	else
+		return kvm_sys_reg_get_user(vcpu, &reg, gic_v5_icc_reg_descs,
+					    ARRAY_SIZE(gic_v5_icc_reg_descs));
+}
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index b5036170430dd..36604f911e089 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -383,6 +383,11 @@ void vgic_v5_restore_state(struct kvm_vcpu *vcpu);
 void vgic_v5_save_state(struct kvm_vcpu *vcpu);
 int vgic_v5_register_irs_iodev(struct kvm *kvm, gpa_t irs_base_address);
 
+int vgic_v5_cpu_sysregs_uaccess(struct kvm_vcpu *vcpu,
+				struct kvm_device_attr *attr, bool is_write);
+int vgic_v5_has_cpu_sysregs_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+const struct sys_reg_desc *vgic_v5_get_sysreg_table(unsigned int *sz);
+
 #define for_each_visible_v5_ppi(__i, __k)		\
 	for_each_set_bit(__i, (__k)->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 36/43] KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (34 preceding siblings ...)
  2026-04-27 16:17 ` [PATCH 35/43] KVM: arm64: gic-v5: Add GICv5 EL1 sysreg userspace set/get interface Sascha Bischoff
@ 2026-04-27 16:18 ` Sascha Bischoff
  2026-05-01 18:54   ` Vladimir Murzin
  2026-04-27 16:18 ` [PATCH 37/43] KVM: arm64: gic-v5: Handle userspace accesses to IRS MMIO region Sascha Bischoff
                   ` (6 subsequent siblings)
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:18 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

When running a GICv5 VM, there are up to two ISTs that must be saved
or restored when migrating a VM.

The SPI IST is allocated by the hypervisor, as the guest presumes the
memory for the SPI state is allocated by the hardware. The LPI IST, on
the other hand, is allocated by the guest in the event that it wishes
to use LPIs. We shadow the guest's LPI IST in KVM, and therefore the
guest's memory is never directly used by the GICv5 hardware. Hence, in
both cases, the in-use ISTs are allocated by the hypervisor.

As there is no guest-allocated memory for the SPI IST, the state of
this must be saved by the VMM. Therefore, the VMM must provide a
memory buffer large enough to store/restore the SPI IST (32-bits per
SPI).

The LPI IST, if present, is stored into guest memory as the guest has
already allocated storage under the assumption that it would be used
by the GIC. Each IST Entry is written back to guest memory (skipping
metadata sections) on a save, or restored from guest memory on a
restore. The guest is only allowed to create a linear IST, so there's
a sufficiently large region of memory that is contiguous in GPA space.

On a save, the VM itself is quiesced using IRS_SAVE_VMR - this ensures
that the hardware has written all interrupt state back to the
ISTs. Following the save operation, the IRS_SAVE_VM_STATUSR is checked
to ensure that the guest has remained quiescent. In the event that it
has not, an error is propagated back to the VMM such that it can retry
the save.

On restore, the VM is first made invalid - it is not allowed to write
to any of the tables while they are valid - and then the SPI and LPI
ISTs are restored (if required) before making the VM valid again. As
part of restoring the ISTs, any pending interrupts are tracked, and
IST pending state is cleared. Once the VM is made valid, these valid
interrupts are made pending again via the GIC VDPEND system
instruction.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-v5-tables.c | 564 ++++++++++++++++++++++++++-
 arch/arm64/kvm/vgic/vgic-v5-tables.h |  38 ++
 arch/arm64/kvm/vgic/vgic-v5.c        | 183 +++++++++
 arch/arm64/kvm/vgic/vgic.h           |   2 +
 include/linux/irqchip/arm-gic-v5.h   |   7 +
 5 files changed, 791 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
index 77fc5fb27f30d..8e909100485bf 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
@@ -431,6 +431,13 @@ int vgic_v5_vmte_init(struct kvm *kvm)
 	if (ret)
 		goto out_fail;
 
+	/*
+	 * If we are restoring the state of a guest, we need to re-inject any
+	 * IRQs where were pending when the state of the guest was originally
+	 * saved. We use the pending_irqs list for this.
+	 */
+	INIT_LIST_HEAD(&vmi->pending_irqs);
+
 	/* Allocate and assign the VM Descriptor, if required. */
 	if (vmt_info->vmd_size != 0) {
 		vmd = kzalloc(vmt_info->vmd_size, GFP_KERNEL);
@@ -547,9 +554,6 @@ int vgic_v5_vmte_release(struct kvm *kvm)
 	if (WARN_ON_ONCE(!vmi))
 		goto no_vmi;
 
-	kfree(vmi->vmd_base);
-	kfree(vmi->vpet_base);
-
 	/* If we have an LPI IST, free it */
 	if (vmi->h_lpi_ist)
 		ret = vgic_v5_lpi_ist_free(kvm);
@@ -562,6 +566,19 @@ int vgic_v5_vmte_release(struct kvm *kvm)
 	if (ret)
 		return ret;
 
+	kfree(vmi->vmd_base);
+	kfree(vmi->vpet_base);
+
+	/* Unlikely, but possible. Avoid leaking the memory. */
+	if (!list_empty(&vmi->pending_irqs)) {
+		struct pending_irq *pirq, *tmp;
+
+		list_for_each_entry_safe(pirq, tmp, &vmi->pending_irqs, next) {
+			list_del(&pirq->next);
+			kfree(pirq);
+		}
+	}
+
 	xa_erase(&vm_info, vm_id);
 	kfree(vmi);
 
@@ -1191,6 +1208,7 @@ int vgic_v5_lpi_ist_alloc(struct kvm *kvm, unsigned int id_bits)
 	return ret;
 }
 
+
 /* Free the LPI IST again */
 int vgic_v5_lpi_ist_free(struct kvm *kvm)
 {
@@ -1206,3 +1224,543 @@ int vgic_v5_lpi_ist_free(struct kvm *kvm)
 	else
 		return vgic_v5_two_level_ist_free(kvm, false);
 }
+
+/*
+ * Save the SPI IST to userspace-provided memory.
+ *
+ * Userspace (should have) has provided us with an appropriately sized buffer
+ * that we can dump the SPI IST to. We only need to write out the architected
+ * 32-bits of the IST, and can skip any and all metadata as that is
+ * implementation specific.
+ *
+ * We only ever allocate linear ISTs for SPIs, so we stride through the IST on
+ * the host (taking metadata into account, i.e., skipping it) and write the
+ * lower 32-bits of each ISTE to the host provided buffer.
+ */
+int vgic_v5_save_spi_ist(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	u32 __user *uaddr = (u32 __user *)(unsigned long)attr->addr;
+	unsigned int host_id_bits, host_istsz, host_l2sz;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	struct vmtl2_entry *vmte;
+	void *host_ist_base;
+	__le32 h_iste;
+	__le64 tmp;
+	int ret;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -ENXIO;
+
+	host_ist_base = vmi->h_spi_ist;
+
+	/* We don't have SPIs, but userspace is trying to save them.  */
+	if (!host_ist_base && attr->addr)
+		return -ENOENT;
+
+	/* We have SPIs but userspace isn't trying to save them. */
+	if (host_ist_base && !attr->addr)
+		return -EINVAL;
+
+	/* No SPIs and no userspace buffer: nothing to do. */
+	if (!host_ist_base && !attr->addr)
+		return 0;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	tmp = le64_to_cpu(READ_ONCE(vmte->val[3]));
+	host_id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
+	host_istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
+	host_l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
+
+	/* We always use a Linear SPI IST on the host */
+	for (int i = 0; i < BIT(host_id_bits); ++i) {
+		/*
+		 * We're explictitly using a void pointer here, and reinterpret
+		 * it as __le64 as we only care about the lower 32 bits of the
+		 * entry, and not the metadata if present. This lets us stride
+		 * through the IST wil skipping the metadata.
+		 */
+		__le32 *h_iste_addr = host_ist_base + i * BIT(host_istsz + 2);
+
+		h_iste = READ_ONCE(*h_iste_addr);
+		ret = put_user(h_iste, uaddr);
+		if (ret)
+			return ret;
+
+		uaddr++;
+	}
+
+	return ret;
+}
+
+/*
+ * Save the LPI IST to guest memory
+ *
+ * When a guest is using LPIs, it has allocated memory for the LPI IST. We don't
+ * let the host's IRQ directly use that memory, and instead reallocate the IST
+ * on the host. However, we're able to use the memory that the guest has
+ * allocated to save the LPI IST. There should be sufficient storage there, and
+ * if the guest hasn't done things properly, then that's on the guest - there's
+ * nothing we can do.
+ *
+ * We only store the lower 32-bits of each host ISTE as the upper bits contain
+ * the metadata, which needs to be explcitly zeroed on restore anyhow.
+ *
+ * This is a bit more complex than for the SPIs. We intentionally don't tell the
+ * guest that it is allowed to create two-level ISTs, so it should have created
+ * a linear IST for LPIs. This means that we have a contigious range in GPA
+ * space that we can iterate over when writing. HOWEVER, we (KVM) have the
+ * option of allocating a linear IST or a two-level IST. Hence, iteration is a
+ * little more complex.
+ */
+int vgic_v5_save_lpi_ist(struct kvm *kvm)
+{
+	unsigned int host_id_bits, host_istsz, host_l2sz;
+	size_t n, l2bits, h_l1_index, h_l2_index;
+	int ret, h_l1_entries, h_l2_entries;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	struct vmtl2_entry *vmte;
+	void *h_l2_ist_base;
+	void *host_ist_base;
+	gpa_t g_entry_addr;
+	__le32 h_iste;
+	__le64 tmp;
+
+	ret = vgic_v5_check_vm_id(vm_id);
+	if (ret)
+		return ret;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -ENXIO;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	/* If there is no IST to save, return without error */
+	if (!kvm->arch.vgic.vgic_v5_irs_data->ist_baser.valid &&
+		!FIELD_GET(GICV5_VMTEL2E_VALID, vmte->val[2])) {
+		return 0;
+	}
+
+	/* Host says an LPI IST exists, but we have no backing object. */
+	if (FIELD_GET(GICV5_VMTEL2E_IST_VALID, vmte->val[2]) && !vmi->h_lpi_ist)
+		return -ENXIO;
+
+	if (vmi->h_lpi_ist_structure && !vmi->h_lpi_l2_ists)
+		return -ENXIO;
+
+	/*
+	 * Assumption: the guest IST is Linear. This gives us a simple way to iterate
+	 * over the guest's memory.
+	 *
+	 * Get the base address of the IST in GPA space.
+	 */
+	g_entry_addr = kvm->arch.vgic.vgic_v5_irs_data->ist_baser.addr;
+
+	tmp = le64_to_cpu(READ_ONCE(vmte->val[2]));
+	host_id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
+	host_istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
+	host_l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
+
+	/* Linear IST on the host - the simple case */
+	if (!vmi->h_lpi_ist_structure) {
+		h_l2_entries = BIT(host_id_bits);
+		host_ist_base = vmi->h_lpi_ist;
+
+		for (h_l2_index = 0; h_l2_index < h_l2_entries; ++h_l2_index) {
+			__le32 *h_iste_addr = host_ist_base + h_l2_index * BIT(host_istsz + 2);
+
+			h_iste = *h_iste_addr;
+
+			ret = vgic_write_guest_lock(kvm, g_entry_addr, &h_iste, sizeof(h_iste));
+			if (ret)
+				return ret;
+
+			/* Advance to the next guest entry */
+			g_entry_addr += sizeof(h_iste);
+		}
+	} else {
+		/* And the two level case */
+		n = max(2, host_id_bits - ((10 - host_istsz) + (2 * host_l2sz)) + 3 - 1);
+		l2bits = (10 - host_istsz) + (2 * host_l2sz);
+		h_l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
+		h_l2_entries = BIT(l2bits);
+
+		/* For each L1 ISTE */
+		for (h_l1_index = 0; h_l1_index < h_l1_entries; ++h_l1_index) {
+			/*
+			 * We don't do dynamic L2 IST allocation for guest ISTs
+			 * - all of the memory is provisioned up-front to
+			 * simplify the process. If we encounter an invalid L1
+			 * ISTE things have gone wrong!
+			 */
+			if (!FIELD_GET(GICV5_ISTL1E_VALID, vmi->h_lpi_ist[h_l1_index]))
+				return -ENXIO;
+
+			/* If valid, process the L2 table. For each L2 ISTE. */
+			for (h_l2_index = 0; h_l2_index < h_l2_entries; ++h_l2_index) {
+
+				h_l2_ist_base = vmi->h_lpi_l2_ists[h_l1_index];
+				if (!h_l2_ist_base)
+					return -ENXIO;
+
+				h_iste = *(__le32 *)(h_l2_ist_base +
+						     h_l2_index *
+						     BIT(2 + host_l2sz));
+
+				ret = vgic_write_guest_lock(kvm, g_entry_addr,
+							    &h_iste, sizeof(h_iste));
+				if (ret)
+					return ret;
+
+				/* Advance to the next guest entry */
+				g_entry_addr += sizeof(__le32);
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Track any SPIs and LPIs where were marked as pending at the point where the
+ * IST was restored.
+ *
+ * Append any previously pending IRQs to the pending list as we need to mark
+ * them as non-pending when restoring the ISTs. These are then reinjected them
+ * using VDPEND prior to running the guest for the first time.
+ */
+static int vgic_v5_track_pending_irq(struct list_head *pending_irqs, u32 intid,
+				     u32 type)
+{
+	struct pending_irq *pirq;
+
+	pirq = kzalloc_obj(*pirq, GFP_KERNEL);
+	if (pirq == NULL)
+		return -ENOMEM;
+
+	/* Make it in to a proper GICv5 IntID */
+	pirq->irq = FIELD_PREP(GICV5_HWIRQ_TYPE, type) |
+		    FIELD_PREP(GICV5_HWIRQ_ID, intid);
+
+	INIT_LIST_HEAD(&pirq->next);
+	list_add_tail(&pirq->next, pending_irqs);
+
+	return 0;
+}
+
+/*
+ * Process and sanitise each restored ISTE.
+ *
+ * When restoring the ISTs, each ISTE needs to be processed. The HWU field needs
+ * to be explicitly zeroed - it is for hardware usage, and we might well be on
+ * different hardware now, which may use the field differently.
+ *
+ * If interrupts are marked as pending on restore, then they need to be tracked
+ * as such, and the pending state cleared. The alternative would be that the
+ * hardware needs to iterate over the whole IST post restoring, but this way is
+ * cleaner and ensures that everything is tracked correctly. The pending state
+ * for each interrupt is restored prior to running the guest for the first time.
+ */
+static int vgic_v5_process_iste(__le32 *iste, struct list_head *pending_irqs,
+				u32 intid, u32 type)
+{
+	__le64 tmp = le64_to_cpu(READ_ONCE(*iste));
+	int ret = 0;
+
+	/* Clean up the ISTE - Zero the HWU field. */
+	tmp &= ~GICV5_ISTL2E_HWU;
+
+	if (FIELD_GET(GICV5_ISTL2E_PENDING, tmp)) {
+		ret = vgic_v5_track_pending_irq(pending_irqs, intid, type);
+		if (ret)
+			return ret;
+
+		/* Now that we've tracked it, clear the pending state */
+		tmp &= ~GICV5_ISTL2E_PENDING;
+	}
+
+	WRITE_ONCE(*iste, cpu_to_le64(tmp));
+
+	return ret;
+}
+
+/*
+ * Restore the SPI IST from userspace-provided buffer to the host-allocated IST.
+ *
+ * The SPI has previously been saved to userspace-provided memory. Now,
+ * userspace has provided us with a buffer containing the SPI to restore. We
+ * need to iterate over this, and restore it to the linear SPI IST allocated by
+ * the host.
+ */
+int vgic_v5_restore_spi_ist(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	u32 __user *uaddr = (u32 __user *)(unsigned long)attr->addr;
+	unsigned int host_id_bits, host_istsz, host_l2sz;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	struct vmtl2_entry *vmte;
+	void *host_ist_base;
+	__le32 h_iste;
+	int ret = 0;
+	u64 tmp;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -ENXIO;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	host_ist_base = vmi->h_spi_ist;
+
+	/* We don't have SPIs, but userspace is trying to restore them.  */
+	if (!host_ist_base && attr->addr)
+		return -ENOENT;
+
+	/* We have SPIs but userspace isn't trying to restore them. */
+	if (host_ist_base && !attr->addr)
+		return -EINVAL;
+
+	/* No SPIs and no userspace buffer: nothing to do. */
+	if (!host_ist_base && !attr->addr)
+		return 0;
+
+	tmp = le64_to_cpu(READ_ONCE(vmte->val[3]));
+	host_id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
+	host_istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
+	host_l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
+
+	/*
+	 * The guest's SPI IST is always linear. When the SPI IST is saved, only
+	 * the architected 4 bytes for the ISTE are saved, and metadata is
+	 * not. This means that we can just linearly read the memory provided by
+	 * userspace when restoring the IST. We stride through the
+	 * host-allocated memory using the actual ISTE size, i.e, skipping
+	 * metadata sections, if present.
+	 */
+	for (int i = 0; i < BIT(host_id_bits); ++i) {
+		size_t host_iste_size = BIT(host_istsz + 2);
+		void *h_iste_addr = host_ist_base + i * host_iste_size;
+
+		/* Read the entry from userspace memory */
+		ret = get_user(h_iste, uaddr);
+		if (ret)
+			return ret;
+
+		/*
+		 * Clean up the entry (zeroing HWU, pending state) and track if
+		 * the interrupt was pending so that it can be re-injected
+		 * later.
+		 */
+		ret = vgic_v5_process_iste(&h_iste, &vmi->pending_irqs,
+					   i, GICV5_HWIRQ_TYPE_SPI);
+		if (ret)
+			return ret;
+
+		/* Finally, write the entry to the host IST, and flush it. */
+		memset(h_iste_addr, 0, host_iste_size);
+		WRITE_ONCE(*(__le32 *)h_iste_addr, h_iste);
+		vgic_v5_clean_inval(h_iste_addr, host_iste_size, true, true);
+
+		/* Advance to the next entry in userspace memory */
+		uaddr++;
+	}
+
+	return ret;
+}
+
+/*
+ * Restore the LPI IST from guest memory to the host-allocated LPI IST.
+ *
+ * We iterate over the guest's memory to read out the saved LPI IST. KVM tells
+ * the guest that it is only allowed to create a linear IST, so the guest memory
+ * for the IST should be linear in GPA space.
+ *
+ * The host IST, on the other hand, is allowed to be two-level (but doesn't need
+ * to be). Therefore, some care needs to be taken when restoring the entries to
+ * the host's IST.
+ *
+ * Only the lower 32-bits of each ISTE are restored.
+ */
+int vgic_v5_restore_lpi_ist(struct kvm *kvm)
+{
+	unsigned int host_id_bits, host_istsz, host_l2sz;
+	size_t h_l1_index, h_l2_index, l2bits, n;
+	void *h_l2_ist_base, *host_ist_base;
+	int h_l1_entries, h_l2_entries, ret;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct vgic_v5_vm_info *vmi;
+	struct vmtl2_entry *vmte;
+	gpa_t g_entry_addr;
+	__le32 h_iste;
+	__le64 tmp;
+
+	ret = vgic_v5_check_vm_id(vm_id);
+	if (ret)
+		return ret;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -ENXIO;
+
+	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
+	if (ret)
+		return ret;
+
+	/* If there is no IST to restore, return without error */
+	if (!kvm->arch.vgic.vgic_v5_irs_data->ist_baser.valid &&
+		!FIELD_GET(GICV5_VMTEL2E_VALID, vmte->val[2])) {
+		return 0;
+	}
+
+	/* Host says an LPI IST exists, but we have no backing object. */
+	if (FIELD_GET(GICV5_VMTEL2E_IST_VALID, vmte->val[2]) && !vmi->h_lpi_ist)
+		return -ENXIO;
+
+	if (!vmi->h_lpi_ist)
+		return -ENXIO;
+
+	if (vmi->h_lpi_ist_structure && !vmi->h_lpi_l2_ists)
+		return -ENXIO;
+
+	/* The GPA of the guest's Linear LPI IST */
+	g_entry_addr = kvm->arch.vgic.vgic_v5_irs_data->ist_baser.addr;
+
+	tmp = le64_to_cpu(READ_ONCE(vmte->val[2]));
+	host_id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
+	host_istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
+	host_l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
+
+	/* We have a Linear IST on the host */
+	if (!vmi->h_lpi_ist_structure) {
+		h_l2_entries = BIT(host_id_bits);
+		host_ist_base = vmi->h_lpi_ist;
+
+		for (h_l2_index = 0; h_l2_index < h_l2_entries; ++h_l2_index) {
+			size_t host_iste_size = BIT(host_istsz + 2);
+			void *h_iste_addr = host_ist_base + h_l2_index * host_iste_size;
+
+			ret = kvm_read_guest_lock(kvm, g_entry_addr, &h_iste, sizeof(h_iste));
+			if (ret)
+				return ret;
+
+
+			/* Clear HWU, pending, and track if it WAS pending */
+			ret = vgic_v5_process_iste(&h_iste, &vmi->pending_irqs,
+						   h_l2_index, GICV5_HWIRQ_TYPE_LPI);
+			if (ret)
+				return ret;
+
+			/* Restore the entry to the host IST */
+			memset(h_iste_addr, 0, host_iste_size);
+			WRITE_ONCE(*(__le32 *)h_iste_addr, h_iste);
+			vgic_v5_clean_inval(h_iste_addr, host_iste_size, true, true);
+
+			/* Advance to the next guest entry */
+			g_entry_addr += sizeof(h_iste);
+		}
+	} else {
+		/* A two-level host IST - the harder case */
+		n = max(2, host_id_bits - ((10 - host_istsz) + (2 * host_l2sz)) + 3 - 1);
+		l2bits = (10 - host_istsz) + (2 * host_l2sz);
+		h_l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
+		h_l2_entries = BIT(l2bits);
+
+		for (h_l1_index = 0; h_l1_index < h_l1_entries; ++h_l1_index) {
+			/*
+			 * If the L1 ISTE is not marked valid, something is
+			 * wrong; we don't do dynamic L2 IST allocation! Give up
+			 * immediately.
+			 */
+			if (!FIELD_GET(GICV5_ISTL1E_VALID, vmi->h_lpi_ist[h_l1_index]))
+				return -ENXIO;
+
+			h_l2_ist_base = vmi->h_lpi_l2_ists[h_l1_index];
+
+			for (h_l2_index = 0; h_l2_index < h_l2_entries; ++h_l2_index) {
+				size_t host_iste_size = BIT(host_istsz + 2);
+				void *h_iste_addr = h_l2_ist_base + h_l2_index * host_iste_size;
+
+				/* Read the guest's ISTE */
+				ret = kvm_read_guest_lock(kvm, g_entry_addr,
+							  &h_iste, sizeof(h_iste));
+				if (ret)
+					return ret;
+
+				/*
+				 * Clear HWU, pending, and track if it WAS
+				 * pending.
+				 */
+				ret = vgic_v5_process_iste(&h_iste, &vmi->pending_irqs,
+							   h_l1_index * h_l2_entries + h_l2_index,
+							   GICV5_HWIRQ_TYPE_LPI);
+				if (ret)
+					return ret;
+
+				/* Write the entry to the host's IST */
+				memset(h_iste_addr, 0, host_iste_size);
+				WRITE_ONCE(*(__le32 *)h_iste_addr, h_iste);
+				vgic_v5_clean_inval(h_iste_addr, host_iste_size, true, true);
+
+				/* Advance to the next guest entry */
+				g_entry_addr += sizeof(h_iste);
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Any previously pending IRQs were made non-pending when restoring guest IST
+ * state. Now that we're ready to run, we reinject that pending state for each
+ * using VDPEND.
+ */
+int vgic_v5_restore_pending_irqs(struct kvm *kvm)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct pending_irq *pirq, *tmp;
+	struct vgic_v5_vm_info *vmi;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return -ENXIO;
+
+	list_for_each_entry_safe(pirq, tmp, &vmi->pending_irqs, next) {
+		kvm_call_hyp(__vgic_v5_vdpend, pirq->irq, 1,
+			     kvm->arch.vgic.gicv5_vm.vm_id);
+
+		list_del(&pirq->next);
+		kfree(pirq);
+	}
+
+	return 0;
+}
+
+/*
+ * Called on restore failure to clean up straggling pending state.
+ */
+void vgic_v5_scrap_pending_irqs(struct kvm *kvm)
+{
+	u16 vm_id = vgic_v5_vm_id(kvm);
+	struct pending_irq *pirq, *tmp;
+	struct vgic_v5_vm_info *vmi;
+
+	vmi = xa_load(&vm_info, vm_id);
+	if (WARN_ON_ONCE(!vmi))
+		return;
+
+	list_for_each_entry_safe(pirq, tmp, &vmi->pending_irqs, next) {
+		list_del(&pirq->next);
+		kfree(pirq);
+	}
+}
diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
index 25e1c9fff87b4..23417e68ee24f 100644
--- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
+++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
@@ -7,6 +7,7 @@
 #define __KVM_ARM_VGICV5_TABLES_H__
 
 #include <linux/irqchip/arm-gic-v5.h>
+#include <linux/list.h>
 
 #define VM_ID_BITS_MIN	8
 #define VM_ID_BITS_MAX	16
@@ -68,6 +69,33 @@ typedef __le64 vpe_entry;
 #define GICV5_VPED_ADDR_SHIFT		3ULL
 #define GICV5_VPED_ADDR			GENMASK_ULL(55, 3)
 
+// L2 IST Entry
+#define GICV5_ISTL2E_PENDING	BIT(0)
+#define GICV5_ISTL2E_ACTIVE	BIT(1)
+#define GICV5_ISTL2E_HM		BIT(2)
+#define GICV5_ISTL2E_ENABLE	BIT(3)
+#define GICV5_ISTL2E_IRM	BIT(4)
+#define GICV5_ISTL2E_HWU	GENMASK(10, 9)
+#define GICV5_ISTL2E_PRIORITY	GENMASK(15, 11)
+#define GICV5_ISTL2E_IAFFID	GENMASK(31, 16)
+
+/*
+ * Save Restore Header Format
+ *
+ * Track what has been saved into the guest's IST. Specifically, we track if the
+ * SPI and LPI ISTs have been stored, and the number of ID bits for each. This
+ * can be used to figure out where these start and end in the guest's memory.
+ */
+#define GICV5_SAVE_TABLES_IRS_IST_HEADER_SPI_IST	BIT(0)
+#define GICV5_SAVE_TABLES_IRS_IST_HEADER_SPI_ID_BITS	GENMASK(5, 1)
+#define GICV5_SAVE_TABLES_IRS_IST_HEADER_LPI_IST	BIT(6)
+#define GICV5_SAVE_TABLES_IRS_IST_HEADER_LPI_ID_BITS	GENMASK(11, 7)
+
+struct pending_irq {
+	u32 irq;
+	struct list_head next;
+};
+
 struct vgic_v5_vm_info {
 	void __iomem		*vmd_base;
 	vpe_entry __iomem	*vpet_base;
@@ -79,6 +107,9 @@ struct vgic_v5_vm_info {
 	__le64			*h_lpi_ist;
 	__le64			**h_lpi_l2_ists;
 	__le64			*h_spi_ist;
+
+	/* Tracking of pending interrupts as part of IST restore */
+	struct list_head	pending_irqs;
 };
 
 struct vgic_v5_vmt {
@@ -171,4 +202,11 @@ void vgic_v5_free_allocated_spi_ist(struct kvm *kvm);
 int vgic_v5_lpi_ist_alloc(struct kvm *kvm, unsigned int id_bits);
 int vgic_v5_lpi_ist_free(struct kvm *kvm);
 
+int vgic_v5_save_spi_ist(struct kvm *kvm, struct kvm_device_attr *attr);
+int vgic_v5_save_lpi_ist(struct kvm *kvm);
+int vgic_v5_restore_spi_ist(struct kvm *kvm, struct kvm_device_attr *attr);
+int vgic_v5_restore_lpi_ist(struct kvm *kvm);
+int vgic_v5_restore_pending_irqs(struct kvm *kvm);
+void vgic_v5_scrap_pending_irqs(struct kvm *kvm);
+
 #endif
diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
index 3e435a31b463e..ff3500a634b62 100644
--- a/arch/arm64/kvm/vgic/vgic-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-v5.c
@@ -580,6 +580,189 @@ static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	}
 }
 
+/*
+ * Wait for completion a write to IRS_SAVE_VMR.
+ */
+static int vgic_v5_irs_wait_for_save_vm_op(void)
+{
+	int ret;
+	u32 statusr;
+
+	ret = readl_relaxed_poll_timeout_atomic(
+		irs_base + GICV5_IRS_SAVE_VM_STATUSR, statusr,
+		FIELD_GET(GICV5_IRS_SAVE_VM_STATUSR_IDLE, statusr), 1,
+		USEC_PER_SEC);
+
+	if (ret == -ETIMEDOUT) {
+		pr_err_ratelimited("Time out waiting for IRS Save VM Op\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static bool vgic_v5_irs_is_quiesced(u16 vm_id)
+{
+	int err;
+	u64 save_vmr;
+	u32 statusr;
+
+	save_vmr = FIELD_PREP(GICV5_IRS_SAVE_VMR_VM_ID, vm_id);
+	save_vmr |= FIELD_PREP(GICV5_IRS_SAVE_VMR_Q, 1);
+	save_vmr |= FIELD_PREP(GICV5_IRS_SAVE_VMR_S, 0);
+	irs_writeq_relaxed(save_vmr, GICV5_IRS_SAVE_VMR);
+
+	/* Wait for the operation */
+	err = vgic_v5_irs_wait_for_save_vm_op();
+	if (err)
+		return false;
+
+	statusr = irs_readl_relaxed(GICV5_IRS_SAVE_VM_STATUSR);
+
+	return statusr & GICV5_IRS_SAVE_VM_STATUSR_Q;
+}
+
+int vgic_v5_irs_save_ists(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret = 0;
+	u64 save_vmr;
+	u16 vm_id = vgic_v5_vm_id(kvm);
+
+
+	mutex_lock(&kvm->lock);
+
+	if (kvm_trylock_all_vcpus(kvm)) {
+		mutex_unlock(&kvm->lock);
+		pr_err("Failed to lock VCPUs");
+		return -EBUSY;
+	}
+
+	mutex_lock(&kvm->arch.config_lock);
+
+	save_vmr = FIELD_PREP(GICV5_IRS_SAVE_VMR_VM_ID, vm_id);
+	save_vmr |= FIELD_PREP(GICV5_IRS_SAVE_VMR_Q, 1);
+	save_vmr |= FIELD_PREP(GICV5_IRS_SAVE_VMR_S, 1);
+	irs_writeq_relaxed(save_vmr, GICV5_IRS_SAVE_VMR);
+
+	/* Wait for the operation */
+	ret = vgic_v5_irs_wait_for_save_vm_op();
+	if (ret) {
+		pr_err("Timed out");
+		goto out_unlock;
+	}
+
+	if (!vgic_v5_irs_is_quiesced(vm_id)) {
+		pr_err("Cannot save; VM not quiesced after IRS_VM_SAVER write\n");
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * Serialise the SPI IST to the userspace-provided memory (address in
+	 * attr).
+	 */
+	ret = vgic_v5_save_spi_ist(kvm, attr);
+	if (ret) {
+		pr_err("Failed to save the SPI IST!");
+		goto out_unlock;
+	}
+
+	if (!vgic_v5_irs_is_quiesced(vm_id)) {
+		pr_err("VM is not quiesed; failed to save IST(s)\n");
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* Serialise the LPI IST to the guest's IST */
+	ret = vgic_v5_save_lpi_ist(kvm);
+	if (ret) {
+		pr_err("Failed to save the LPI IST!");
+		goto out_unlock;
+	}
+
+	if (!vgic_v5_irs_is_quiesced(vm_id)) {
+		pr_err("VM is not quiesed; failed to save IST(s)\n");
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+out_unlock:
+	mutex_unlock(&kvm->arch.config_lock);
+	kvm_unlock_all_vcpus(kvm);
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
+int vgic_v5_irs_restore_ists(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret = 0;
+	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
+
+	mutex_lock(&kvm->lock);
+
+	if (kvm_trylock_all_vcpus(kvm)) {
+		mutex_unlock(&kvm->lock);
+		return -EBUSY;
+	}
+
+	mutex_lock(&kvm->arch.config_lock);
+
+	/*
+	 * The ISTs should not be written by us while the VM (or IST) is
+	 * valid. In order to safely restore, and make sure that the GIC sees
+	 * the latest and greatest state, make the VM invalid prior to
+	 * restoring.
+	 */
+	ret = vgic_v5_send_command(vcpu0, VMTE_MAKE_INVALID);
+	if (ret) {
+		/*
+		 * If we go wrong here, things are rather broken. VM is likely
+		 * unrunnable.
+		 */
+		goto out_unlock;
+	}
+
+	/*
+	 * Unserialise the SPI IST from the userspace-provided memory (address
+	 * in attr).
+	 */
+	ret = vgic_v5_restore_spi_ist(kvm, attr);
+	if (ret) {
+		pr_err("Failed to restore the SPI IST!");
+		goto out_unlock;
+	}
+
+	/* Unserialise the ISTs from the guest's IST */
+	ret = vgic_v5_restore_lpi_ist(kvm);
+	if (ret) {
+		pr_err("Failed to restore the LPI IST!");
+		goto out_unlock;
+	}
+
+	/* ... and make the VM Valid again */
+	ret = vgic_v5_send_command(vcpu0, VMTE_MAKE_VALID);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * As part of restoring the ISTs, and previously pending interrupts have
+	 * been tracked and made non-pending. Now that the ISTs have been
+	 * restored, and the VM is valid again, restore the pending interrupts.
+	 */
+	ret = vgic_v5_restore_pending_irqs(kvm);
+
+out_unlock:
+	if (ret)
+		vgic_v5_scrap_pending_irqs(kvm);
+
+	mutex_unlock(&kvm->arch.config_lock);
+	kvm_unlock_all_vcpus(kvm);
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
 /*
  * This set of irq_chip functions is specific for doorbells.
  */
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 36604f911e089..35eb048b5a0f7 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -387,6 +387,8 @@ int vgic_v5_cpu_sysregs_uaccess(struct kvm_vcpu *vcpu,
 				struct kvm_device_attr *attr, bool is_write);
 int vgic_v5_has_cpu_sysregs_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 const struct sys_reg_desc *vgic_v5_get_sysreg_table(unsigned int *sz);
+int vgic_v5_irs_save_ists(struct kvm *kvm, struct kvm_device_attr *attr);
+int vgic_v5_irs_restore_ists(struct kvm *kvm, struct kvm_device_attr *attr);
 
 #define for_each_visible_v5_ppi(__i, __k)		\
 	for_each_set_bit(__i, (__k)->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS)
diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
index 9ea3674a6613b..431aca67f4d5f 100644
--- a/include/linux/irqchip/arm-gic-v5.h
+++ b/include/linux/irqchip/arm-gic-v5.h
@@ -322,6 +322,13 @@
 #define GICV5_IRS_VMAP_VPER_VM_ID	GENMASK_ULL(47, 32)
 #define GICV5_IRS_VMAP_VPER_VPE_ID	GENMASK_ULL(15, 0)
 
+#define GICV5_IRS_SAVE_VMR_VM_ID	GENMASK_ULL(15, 0)
+#define GICV5_IRS_SAVE_VMR_Q		BIT_ULL(62)
+#define GICV5_IRS_SAVE_VMR_S		BIT_ULL(63)
+
+#define GICV5_IRS_SAVE_VM_STATUSR_IDLE	BIT(0)
+#define GICV5_IRS_SAVE_VM_STATUSR_Q	BIT(1)
+
 #define GICV5_ISTL1E_VALID		BIT_ULL(0)
 #define GICV5_IRS_ISTL1E_SIZE		8UL
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 37/43] KVM: arm64: gic-v5: Handle userspace accesses to IRS MMIO region
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (35 preceding siblings ...)
  2026-04-27 16:18 ` [PATCH 36/43] KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs Sascha Bischoff
@ 2026-04-27 16:18 ` Sascha Bischoff
  2026-04-27 16:19 ` [PATCH 38/43] KVM: arm64: gic-v5: Add VGIC_GRP_IRS_REGS/VGIC_GRP_IST to UAPI Sascha Bischoff
                   ` (5 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:18 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

As part of saving and restoring state of a GICv5-based system,
userspace is required to save/restore the IRS MMIO registers. These
include important information such as guest IST configuration, and in
general KVM needs to present consistent state to the guest.

Provide accessors to read and write the IRS MMIO state. This is
modelled on what is already done for the GICv3 ITS as the idea is
broadly the same.

Where possible, the existing access mechanisms are used, but for some
registers the access is handled a bit differently as they have wider
effects. For example, some writes need to be sanitised to make sure
that the hardware is capable (IST capabilities presented to the guest,
for example). Similar things apply to the SPI config where we block
userspace from setting anything that doesn't match what has been set
already.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-irs-v5.c     | 494 +++++++++++++++++++-------
 arch/arm64/kvm/vgic/vgic-kvm-device.c |  24 ++
 arch/arm64/kvm/vgic/vgic.h            |   6 +
 3 files changed, 401 insertions(+), 123 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-irs-v5.c b/arch/arm64/kvm/vgic/vgic-irs-v5.c
index 3a2539bf99d0d..761878fc956de 100644
--- a/arch/arm64/kvm/vgic/vgic-irs-v5.c
+++ b/arch/arm64/kvm/vgic/vgic-irs-v5.c
@@ -26,15 +26,58 @@ static struct vgic_v5_irs *vgic_v5_get_irs(struct kvm_vcpu *vcpu)
 	return vcpu->kvm->arch.vgic.vgic_v5_irs_data;
 }
 
+static int vgic_v5_read_irs_pe_cr0(struct kvm_vcpu *vcpu, u64 *val)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	struct gicv5_cmd_info cmd_info;
+	struct kvm_vcpu *target_vcpu;
+	int rc;
+
+	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid);
+	if (!target_vcpu)
+		return -EINVAL;
+
+	/*
+	 * Read the corresponding IRS_VPE_CR0. We do so via the doorbell for the
+	 * specific vcpu we have in the PE_SELR.
+	 */
+	cmd_info.cmd_type = VPE_CR0_READ;
+	rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(target_vcpu), &cmd_info);
+	if (rc)
+		return rc;
+
+	*val = cmd_info.data;
+	return 0;
+}
+
+static int vgic_v5_write_irs_pe_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	struct gicv5_cmd_info cmd_info;
+	struct kvm_vcpu *target_vcpu;
+	int rc;
+
+	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid);
+	if (!target_vcpu)
+		return -EINVAL;
+
+	/*
+	 * Write the corresponding IRS_VPE_CR0. We do so via the doorbell for
+	 * the specific vcpu we have in the PE_SELR.
+	 */
+	cmd_info.cmd_type = VPE_CR0_WRITE;
+	cmd_info.data = val;
+	rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(target_vcpu), &cmd_info);
+
+	return rc;
+}
+
 static unsigned long vgic_v5_mmio_read_irs_misc(struct kvm_vcpu *vcpu,
 						gpa_t addr, unsigned int len)
 {
 	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
 	const size_t offset = addr & (SZ_64K - 1);
-	struct gicv5_cmd_info cmd_info;
-	struct kvm_vcpu *target_vcpu;
 	u64 value = 0;
-	int rc;
 
 	switch (offset) {
 	case GICV5_IRS_IDR0:
@@ -127,32 +170,8 @@ static unsigned long vgic_v5_mmio_read_irs_misc(struct kvm_vcpu *vcpu,
 		}
 		break;
 	case GICV5_IRS_PE_CR0:
-		/*
-		 * Make sure that we are doing something reasonable first.
-		 * Remember, the IAFFID is the same as the VPE_ID
-		 */
-		target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid);
-		if (!target_vcpu) {
-			kvm_err("Guest programmed invalid IAFFID (0x%x) into the IRS_PE_SELR\n",
-				irs->pe_selr.iaffid);
-			break;
-		}
-
-		mutex_lock(&vcpu->kvm->arch.config_lock);
-
-		/*
-		 * Read the corresponding IRS_VPE_CR0. We do so via the doorbell
-		 * for the specific vcpu we have in the PE_SELR.
-		 */
-		cmd_info.cmd_type = VPE_CR0_READ;
-		rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(target_vcpu), &cmd_info);
-		if (rc)
-			kvm_err("Could not read VPE_CR0 in IRS: %d\n", rc);
-		else
-			value = cmd_info.data;
-
-		mutex_unlock(&vcpu->kvm->arch.config_lock);
-
+		scoped_guard(mutex, &vcpu->kvm->arch.config_lock)
+			WARN_ON_ONCE(vgic_v5_read_irs_pe_cr0(vcpu, &value));
 		break;
 	default:
 		return 0;
@@ -167,9 +186,6 @@ static void vgic_v5_mmio_write_irs_misc(struct kvm_vcpu *vcpu, gpa_t addr,
 	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
 	struct vgic_dist *vgic = vgic_v5_get_vgic(vcpu);
 	const size_t offset = addr & (SZ_64K - 1);
-	struct gicv5_cmd_info cmd_info;
-	struct kvm_vcpu *target_vcpu;
-	int rc;
 
 	switch (offset) {
 	case GICV5_IRS_CR0:
@@ -216,27 +232,8 @@ static void vgic_v5_mmio_write_irs_misc(struct kvm_vcpu *vcpu, gpa_t addr,
 		irs->pe_selr.iaffid = FIELD_GET(GICV5_IRS_PE_SELR_IAFFID, val);
 		return;
 	case GICV5_IRS_PE_CR0:
-		/*
-		 * Make sure that we are doing something reasonable first.
-		 * Remember, the IAFFID is the same as the VPE_ID.
-		 */
-		target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, irs->pe_selr.iaffid);
-		if (!target_vcpu)
-			return;
-
-		mutex_lock(&vcpu->kvm->arch.config_lock);
-
-		/*
-		 * Write the corresponding IRS_VPE_CR0. We do so via the
-		 * doorbell for the specific vcpu we have in the PE_SELR.
-		 */
-		cmd_info.cmd_type = VPE_CR0_WRITE;
-		cmd_info.data = val;
-		rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(target_vcpu), &cmd_info);
-		if (rc)
-			kvm_err("Could not update VPE_CR0 in IRS: %d\n", rc);
-
-		mutex_unlock(&vcpu->kvm->arch.config_lock);
+		scoped_guard(mutex, &vcpu->kvm->arch.config_lock)
+			WARN_ON_ONCE(vgic_v5_write_irs_pe_cr0(vcpu, val));
 		return;
 	default:
 		return;
@@ -411,13 +408,56 @@ static unsigned long vgic_v5_mmio_read_irs_ist(struct kvm_vcpu *vcpu,
 	return value;
 }
 
+static int vgic_v5_write_irs_ist_baser(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
+	struct gicv5_cmd_info cmd_info;
+	bool valid = FIELD_GET(GICV5_IRS_IST_BASER_VALID, val);
+	int rc;
+
+	/* Valid -> Invalid */
+	if (irs->ist_baser.valid && !valid) {
+		/* Make the LPI IST invalid and then ... */
+		cmd_info.cmd_type = LPI_VIST_MAKE_INVALID;
+		rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu), &cmd_info);
+		if (rc)
+			return rc;
+
+		/*
+		 * ... free the host IST if we successfully marked the
+		 * IST as invalid. Frankly, if we failed to make the
+		 * guest's IST as invalid, we're cooked because it means
+		 * that the IRS may still be using the memory that we
+		 * want to free. Hence, we leave it allocated and skip
+		 * the clearing of valid bit in the baser.
+		 */
+		rc = vgic_v5_lpi_ist_free(vcpu->kvm);
+		if (rc)
+			return rc;
+	} else if (!irs->ist_baser.valid && valid) { /* Invalid -> Valid */
+		if (!vgic_v5_ist_cfgr_valid(irs)) {
+			kvm_err("Guest programmed invalid IRS_IST_CFGR\n");
+			return -EINVAL;
+		}
+
+		rc = vgic_v5_lpi_ist_alloc(vcpu->kvm, irs->ist_cfgr.lpi_id_bits);
+		if (rc)
+			return rc;
+	}
+
+	/* Now that we've handled the edges, update the valid bit and addr */
+	irs->ist_baser.valid = FIELD_GET(GICV5_IRS_IST_BASER_VALID, val);
+	irs->ist_baser.addr = FIELD_GET(GICV5_IRS_IST_BASER_ADDR_MASK, val)
+		<< GICV5_IRS_IST_BASER_ADDR_SHIFT;
+
+	return 0;
+}
+
 static void vgic_v5_mmio_write_irs_ist(struct kvm_vcpu *vcpu, gpa_t addr,
 				       unsigned int len, unsigned long val)
 {
 	struct vgic_v5_irs *irs = vgic_v5_get_irs(vcpu);
 	const size_t offset = addr & (SZ_64K - 1);
-	struct gicv5_cmd_info cmd_info;
-	int rc;
 
 	switch (offset) {
 	case GICV5_IRS_IST_CFGR:
@@ -427,77 +467,191 @@ static void vgic_v5_mmio_write_irs_ist(struct kvm_vcpu *vcpu, gpa_t addr,
 		irs->ist_cfgr.structure = FIELD_GET(GICV5_IRS_IST_CFGR_STRUCTURE, val);
 		return;
 	case GICV5_IRS_IST_BASER: {
-		bool valid = FIELD_GET(GICV5_IRS_IST_BASER_VALID, val);
-
 		guard(mutex)(&vcpu->kvm->arch.config_lock);
+		if (WARN_ON_ONCE(vgic_v5_write_irs_ist_baser(vcpu, val)))
+			break;
+		return;
+	}
+	default:
+		return;
+	}
+}
 
-		/* Valid -> Invalid */
-		if (irs->ist_baser.valid && !valid) {
-			/* Make the LPI IST invalid and then ... */
-			cmd_info.cmd_type = LPI_VIST_MAKE_INVALID;
-			rc = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu), &cmd_info);
-			if (WARN_ON_ONCE(rc))
-				break;
+static unsigned long vgic_v5_mmio_uaccess_read_irs(struct kvm_vcpu *vcpu,
+						   gpa_t addr, unsigned int len)
+{
+	const size_t offset = addr & (SZ_64K - 1);
+	u64 val = 0;
 
-			/*
-			 * ... free the host IST if we successfully marked the
-			 * IST as invalid. Frankly, if we failed to make the
-			 * guest's IST as invalid, we're cooked because it means
-			 * that the IRS may still be using the memory that we
-			 * want to free. Hence, we leave it allocated and skip
-			 * the clearing of valid bit in the baser.
-			 */
-			rc = vgic_v5_lpi_ist_free(vcpu->kvm);
-			if (WARN_ON_ONCE(rc))
-				break;
-		} else if (!irs->ist_baser.valid && valid) { /* Invalid -> Valid */
-			if (!vgic_v5_ist_cfgr_valid(irs)) {
-				kvm_err("Guest programmed invalid IRS_IST_CFGR\n");
-				break;
-			}
-
-			rc = vgic_v5_lpi_ist_alloc(vcpu->kvm,
-						   irs->ist_cfgr.lpi_id_bits);
-			if (WARN_ON_ONCE(rc))
-				break;
-		}
+	switch (offset) {
+	case GICV5_IRS_PE_CR0:
+		/* Read without trying to take the config_lock again */
+		if (vgic_v5_read_irs_pe_cr0(vcpu, &val))
+			return 0;
+		return val;
+	default:
+		return 0;
+	}
+}
 
-		/* Now that we've handled the edges, update the valid bit and addr */
-		irs->ist_baser.valid = FIELD_GET(GICV5_IRS_IST_BASER_VALID, val);
-		irs->ist_baser.addr = FIELD_GET(GICV5_IRS_IST_BASER_ADDR_MASK, val)
-			<< GICV5_IRS_IST_BASER_ADDR_SHIFT;
+static int vgic_v5_mmio_uaccess_write_irs(struct kvm_vcpu *vcpu, gpa_t addr,
+					  unsigned int len, unsigned long val)
+{
+	struct vgic_dist *vgic = &vcpu->kvm->arch.vgic;
+	struct vgic_v5_irs *irs_data = vgic->vgic_v5_irs_data;
+	size_t offset = addr & (SZ_64K - 1);
 
-		return;
-	}
+	/*
+	 * The following registers are ONLY settable via uaccesses. The guest
+	 * cannot write them!
+	 */
+
+	switch (offset) {
+	case GICV5_IRS_IDR0:
+		if (FIELD_GET(GICV5_IRS_IDR0_DOM, val) !=
+		    GICV5_IRS_IDR0_DOMAIN_NON_SECURE)
+			return -EINVAL;
+
+		if (FIELD_GET(GICV5_IRS_IDR0_VIRT, val) ||
+		    FIELD_GET(GICV5_IRS_IDR0_ONEOFN, val) ||
+		    FIELD_GET(GICV5_IRS_IDR0_VIRT1OFN, val) ||
+		    FIELD_GET(GICV5_IRS_IDR0_SETLPI, val) ||
+		    FIELD_GET(GICV5_IRS_IDR0_MEC, val) ||
+		    FIELD_GET(GICV5_IRS_IDR0_MPAM, val) ||
+		    FIELD_GET(GICV5_IRS_IDR0_SWE, val))
+			return -EINVAL;
+
+		irs_data->idr0.domain = FIELD_GET(GICV5_IRS_IDR0_DOM, val);
+		irs_data->idr0.pa_range = FIELD_GET(GICV5_IRS_IDR0_PA_RANGE, val);
+		irs_data->idr0.virt = FIELD_GET(GICV5_IRS_IDR0_VIRT, val);
+		irs_data->idr0.one_of_n = FIELD_GET(GICV5_IRS_IDR0_ONEOFN, val);
+		irs_data->idr0.virt_one_of_n = FIELD_GET(GICV5_IRS_IDR0_VIRT1OFN, val);
+		irs_data->idr0.setlpi = FIELD_GET(GICV5_IRS_IDR0_SETLPI, val);
+		irs_data->idr0.mec = FIELD_GET(GICV5_IRS_IDR0_MEC, val);
+		irs_data->idr0.mpam = FIELD_GET(GICV5_IRS_IDR0_MPAM, val);
+		irs_data->idr0.swe = FIELD_GET(GICV5_IRS_IDR0_SWE, val);
+		irs_data->idr0.irs_id = FIELD_GET(GICV5_IRS_IDR0_IRSID, val);
+		break;
+	case GICV5_IRS_IDR1:
+		/* Ignore writes to PE_CNT as this is populated from num vcpus */
+
+		/*
+		 * The number of IAFFID bits supported. If userspace tries to
+		 * set something more than what we support, reject the write.
+		 */
+		if (FIELD_GET(GICV5_IRS_IDR1_IAFFID_BITS, val) > vgic_v5_vmte_vpe_id_bits(vcpu))
+			return -EINVAL;
+
+		if (FIELD_GET(GICV5_IRS_IDR1_PRIORITY_BITS, val) > 0b100)
+			return -EINVAL;
+
+		irs_data->idr1.priority_bits = FIELD_GET(GICV5_IRS_IDR1_PRIORITY_BITS, val);
+		break;
+	case GICV5_IRS_IDR2:
+		/* We always support LPIs */
+		if (!FIELD_GET(GICV5_IRS_IDR2_LPI, val))
+			return -EINVAL;
+
+		/* We only support LPIs with linear, non-metadata guest ISTs */
+		if (FIELD_GET(GICV5_IRS_IDR2_IST_LEVELS, val))
+			return -EINVAL;
+
+		if (FIELD_GET(GICV5_IRS_IDR2_ISTMD, val) ||
+		    FIELD_GET(GICV5_IRS_IDR2_ISTMD_SZ, val))
+			return -EINVAL;
+
+		/* We can't present more bits than we have support for in HW */
+		if (FIELD_GET(GICV5_IRS_IDR2_ID_BITS, val) > vgic_v5_host_caps()->ist_id_bits)
+			return -EINVAL;
+
+		/* Min LPI ID bits must be greater than or equal to the HW */
+		if (FIELD_GET(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, val) <
+		    vgic_v5_host_caps()->min_lpi_id_bits)
+			return -EINVAL;
+
+		if (FIELD_GET(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, val) >
+		    FIELD_GET(GICV5_IRS_IDR2_ID_BITS, val))
+			return -EINVAL;
+
+		irs_data->idr2.istmd_sz = FIELD_GET(GICV5_IRS_IDR2_ISTMD_SZ, val);
+		irs_data->idr2.istmd = FIELD_GET(GICV5_IRS_IDR2_ISTMD, val);
+		irs_data->idr2.ist_l2sz = FIELD_GET(GICV5_IRS_IDR2_IST_L2SZ, val);
+		irs_data->idr2.ist_levels = FIELD_GET(GICV5_IRS_IDR2_IST_LEVELS, val);
+		irs_data->idr2.min_lpi_id_bits = FIELD_GET(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, val);
+		irs_data->idr2.id_bits = FIELD_GET(GICV5_IRS_IDR2_ID_BITS, val);
+		break;
+	case GICV5_IRS_IDR5:
+		if (FIELD_GET(GICV5_IRS_IDR5_SPI_RANGE, val) != irs_data->idr5.spi_range)
+			return -EINVAL;
+		break;
+	case GICV5_IRS_IDR6:
+		if (FIELD_GET(GICV5_IRS_IDR6_SPI_IRS_RANGE, val) != irs_data->idr6.spi_irs_range)
+			return -EINVAL;
+		break;
+	case GICV5_IRS_IDR7:
+		if (FIELD_GET(GICV5_IRS_IDR7_SPI_BASE, val) != irs_data->idr7.spi_base)
+			return -EINVAL;
+		break;
+	case GICV5_IRS_IST_BASER:
+		return vgic_v5_write_irs_ist_baser(vcpu, val);
+	case GICV5_IRS_SPI_CFGR:
+		fallthrough;
+	case GICV5_IRS_PE_SELR:
+		fallthrough;
+	case GICV5_IRS_IIDR:
+		fallthrough;
+	case GICV5_IRS_AIDR:
+		break;
+	case GICV5_IRS_PE_CR0:
+		/* Write without trying to take the config_lock again */
+		return vgic_v5_write_irs_pe_cr0(vcpu, val);
 	default:
-		return;
+		return -EINVAL;
 	}
+
+	return 0;
 }
 
 static const struct vgic_register_region vgic_v5_irs_registers[] = {
 	/*
 	 * This is the IRS_CONFIG_FRAME.
 	 */
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR0, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR1, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR2, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IDR0, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IDR1, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IDR2, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR3, vgic_mmio_read_raz,
 				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR4, vgic_mmio_read_raz,
 				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR5, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR6, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IDR7, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IIDR, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_AIDR, vgic_v5_mmio_read_irs_misc,
-				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IDR5, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IDR6, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IDR7, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_IIDR, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_AIDR, vgic_v5_mmio_read_irs_misc,
+					  vgic_mmio_write_wi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_CR0, vgic_v5_mmio_read_irs_misc,
 				  vgic_v5_mmio_write_irs_misc, 4,
 				  VGIC_ACCESS_32bit),
@@ -516,30 +670,34 @@ static const struct vgic_register_region vgic_v5_irs_registers[] = {
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_SELR, vgic_v5_mmio_read_irs_spi,
 				  vgic_v5_mmio_write_irs_spi, 4,
 				  VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(
-		GICV5_IRS_SPI_DOMAINR, vgic_v5_mmio_read_irs_spi,
-		vgic_v5_mmio_write_irs_spi, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_DOMAINR, vgic_v5_mmio_read_irs_spi,
+				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_RESAMPLER, vgic_mmio_read_raz,
 				  vgic_mmio_write_wi, 4,
 				  VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_CFGR, vgic_v5_mmio_read_irs_spi,
-				  vgic_v5_mmio_write_irs_spi, 4,
-				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(GICV5_IRS_SPI_CFGR,
+					  vgic_v5_mmio_read_irs_spi,
+					  vgic_v5_mmio_write_irs_spi, NULL,
+					  vgic_v5_mmio_uaccess_write_irs, 4,
+					  VGIC_ACCESS_32bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_SPI_STATUSR,
 				  vgic_v5_mmio_read_irs_spi, vgic_mmio_write_wi,
 				  4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_PE_SELR, vgic_v5_mmio_read_irs_misc,
-				  vgic_v5_mmio_write_irs_misc, 4,
-				  VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(
+		GICV5_IRS_PE_SELR, vgic_v5_mmio_read_irs_misc,
+		vgic_v5_mmio_write_irs_misc, NULL,
+		vgic_v5_mmio_uaccess_write_irs, 4, VGIC_ACCESS_32bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_PE_STATUSR,
 				  vgic_v5_mmio_read_irs_misc,
 				  vgic_mmio_write_wi, 4, VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_PE_CR0, vgic_v5_mmio_read_irs_misc,
-				  vgic_v5_mmio_write_irs_misc, 4,
-				  VGIC_ACCESS_32bit),
-	REGISTER_DESC_WITH_LENGTH(
+	REGISTER_DESC_WITH_LENGTH_UACCESS(
+		GICV5_IRS_PE_CR0, vgic_v5_mmio_read_irs_misc,
+		vgic_v5_mmio_write_irs_misc, vgic_v5_mmio_uaccess_read_irs,
+		vgic_v5_mmio_uaccess_write_irs, 4, VGIC_ACCESS_32bit),
+	REGISTER_DESC_WITH_LENGTH_UACCESS(
 		GICV5_IRS_IST_BASER, vgic_v5_mmio_read_irs_ist,
-		vgic_v5_mmio_write_irs_ist, 8, VGIC_ACCESS_64bit),
+		vgic_v5_mmio_write_irs_ist, NULL,
+		vgic_v5_mmio_uaccess_write_irs, 8, VGIC_ACCESS_64bit),
 	REGISTER_DESC_WITH_LENGTH(GICV5_IRS_IST_CFGR, vgic_v5_mmio_read_irs_ist,
 				  vgic_v5_mmio_write_irs_ist, 4,
 				  VGIC_ACCESS_32bit),
@@ -825,3 +983,93 @@ int kvm_vgic_v5_irs_init(struct kvm *kvm, unsigned int nr_spis)
 
 	return 0;
 }
+
+int vgic_v5_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	const struct vgic_register_region *region;
+	struct vgic_reg_attr reg_attr;
+	struct kvm_vcpu *vcpu;
+	gpa_t addr, offset;
+	int ret, align;
+
+	ret = vgic_v5_parse_attr(dev, attr, &reg_attr);
+	if (ret)
+		return ret;
+
+	vcpu = reg_attr.vcpu;
+	addr = reg_attr.addr;
+
+	if (attr->group == KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS)
+		return vgic_v5_has_cpu_sysregs_attr(vcpu, attr);
+
+	offset = attr->attr;
+
+	if (IS_VGIC_ADDR_UNDEF(dev->kvm->arch.vgic.vgic_v5_irs_data->vgic_v5_irs_base))
+		return -ENXIO;
+
+	region = vgic_find_mmio_region(vgic_v5_irs_registers,
+				       ARRAY_SIZE(vgic_v5_irs_registers),
+				       offset);
+	if (!region)
+		return -ENXIO;
+
+	align = region->access_flags & VGIC_ACCESS_64bit ? 0x7 : 0x3;
+	if (offset & align)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Access the IRS MMIO Regs. Relevant locks have been taken by the calling code.
+ */
+int vgic_v5_irs_attr_regs_access(struct kvm_device *dev,
+				 struct kvm_device_attr *attr,
+				 u64 *reg, bool is_write)
+{
+	const struct vgic_register_region *region;
+	gpa_t addr, offset;
+	unsigned int len;
+	int align, ret = 0;
+
+	offset = attr->attr;
+
+	if (IS_VGIC_ADDR_UNDEF(dev->kvm->arch.vgic.vgic_v5_irs_data->vgic_v5_irs_base))
+		return -ENXIO;
+
+	region = vgic_find_mmio_region(vgic_v5_irs_registers,
+				       ARRAY_SIZE(vgic_v5_irs_registers),
+				       offset);
+	if (!region)
+		return -ENXIO;
+
+	/*
+	 * Although the spec supports upper/lower 32-bit accesses to
+	 * 64-bit IRS registers, the userspace ABI requires 64-bit
+	 * accesses to all 64-bit wide registers. We therefore only
+	 * support 32-bit accesses to 32-bit-wide registers.
+	 */
+	align = region->access_flags & VGIC_ACCESS_64bit ? 0x7 : 0x3;
+	len = region->access_flags & VGIC_ACCESS_64bit ? 8 : 4;
+
+	if (offset & align)
+		return -EINVAL;
+
+	addr = dev->kvm->arch.vgic.vgic_v5_irs_data->vgic_v5_irs_base + offset;
+
+	if (is_write) {
+		if (region->uaccess_write)
+			ret = region->uaccess_write(kvm_get_vcpu(dev->kvm, 0),
+						    addr, len, *reg);
+		else
+			region->write(kvm_get_vcpu(dev->kvm, 0), addr, len, *reg);
+	} else {
+		if (region->uaccess_read)
+			*reg = region->uaccess_read(kvm_get_vcpu(dev->kvm, 0),
+						    addr, len);
+		else
+			*reg = region->read(kvm_get_vcpu(dev->kvm, 0), addr, len);
+	}
+
+	return ret;
+}
diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c
index fd3b9cfbc2660..944923121e080 100644
--- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
+++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
@@ -764,6 +764,30 @@ static int vgic_v5_get_userspace_ppis(struct kvm_device *dev,
 	return ret;
 }
 
+int vgic_v5_parse_attr(struct kvm_device *dev,
+			struct kvm_device_attr *attr,
+			struct vgic_reg_attr *reg_attr)
+{
+	unsigned long vgic_mpidr, mpidr_reg;
+
+	if (attr->group == KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS) {
+		vgic_mpidr = (attr->attr & KVM_DEV_ARM_VGIC_V3_MPIDR_MASK) >>
+			KVM_DEV_ARM_VGIC_V3_MPIDR_SHIFT;
+
+		mpidr_reg = VGIC_TO_MPIDR(vgic_mpidr);
+		reg_attr->vcpu = kvm_mpidr_to_vcpu(dev->kvm, mpidr_reg);
+	} else {
+		reg_attr->vcpu = kvm_get_vcpu(dev->kvm, 0);
+	}
+
+	if (!reg_attr->vcpu)
+		return -EINVAL;
+
+	reg_attr->addr = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK;
+
+	return 0;
+}
+
 static int vgic_v5_set_attr(struct kvm_device *dev,
 			    struct kvm_device_attr *attr)
 {
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 35eb048b5a0f7..736b538b40461 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -252,6 +252,8 @@ struct ap_list_summary {
 #define irqs_active_outside_lrs(s)		\
 	((s)->nr_act &&	irqs_outside_lrs(s))
 
+int vgic_v5_parse_attr(struct kvm_device *dev, struct kvm_device_attr *attr,
+		       struct vgic_reg_attr *reg_attr);
 int vgic_v3_parse_attr(struct kvm_device *dev, struct kvm_device_attr *attr,
 		       struct vgic_reg_attr *reg_attr);
 int vgic_v2_parse_attr(struct kvm_device *dev, struct kvm_device_attr *attr,
@@ -389,6 +391,10 @@ int vgic_v5_has_cpu_sysregs_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *
 const struct sys_reg_desc *vgic_v5_get_sysreg_table(unsigned int *sz);
 int vgic_v5_irs_save_ists(struct kvm *kvm, struct kvm_device_attr *attr);
 int vgic_v5_irs_restore_ists(struct kvm *kvm, struct kvm_device_attr *attr);
+int vgic_v5_irs_attr_regs_access(struct kvm_device *dev,
+				 struct kvm_device_attr *attr,
+				 u64 *reg, bool is_write);
+int vgic_v5_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr);
 
 #define for_each_visible_v5_ppi(__i, __k)		\
 	for_each_set_bit(__i, (__k)->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 38/43] KVM: arm64: gic-v5: Add VGIC_GRP_IRS_REGS/VGIC_GRP_IST to UAPI
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (36 preceding siblings ...)
  2026-04-27 16:18 ` [PATCH 37/43] KVM: arm64: gic-v5: Handle userspace accesses to IRS MMIO region Sascha Bischoff
@ 2026-04-27 16:19 ` Sascha Bischoff
  2026-04-27 16:19 ` [PATCH 39/43] KVM: arm64: gic-v5: Plumb in has/set/get_attr for sysregs & IRS MMIO regs Sascha Bischoff
                   ` (4 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:19 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Add two new VGIC groups for setting/getting the IRS MMIO registers and
the Interrupt State Tables for a guest. These are both GICv5 specific,
and are required as GICv5's MMIO registers and ISTs don't fit into any
existing UAPI interfaces.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/include/uapi/asm/kvm.h       | 2 ++
 tools/arch/arm64/include/uapi/asm/kvm.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index d1b2ca317f586..e2847134df176 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -427,6 +427,8 @@ enum {
 			(0x3fffffULL << KVM_DEV_ARM_VGIC_LINE_LEVEL_INFO_SHIFT)
 #define KVM_DEV_ARM_VGIC_LINE_LEVEL_INTID_MASK	0x3ff
 #define VGIC_LEVEL_INFO_LINE_LEVEL	0
+#define KVM_DEV_ARM_VGIC_GRP_IRS_REGS	10
+#define KVM_DEV_ARM_VGIC_GRP_IST	11
 
 #define   KVM_DEV_ARM_VGIC_CTRL_INIT		0
 #define   KVM_DEV_ARM_ITS_SAVE_TABLES           1
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h
index 1c13bfa2d38aa..e68d1b097ba9c 100644
--- a/tools/arch/arm64/include/uapi/asm/kvm.h
+++ b/tools/arch/arm64/include/uapi/asm/kvm.h
@@ -422,6 +422,8 @@ enum {
 			(0x3fffffULL << KVM_DEV_ARM_VGIC_LINE_LEVEL_INFO_SHIFT)
 #define KVM_DEV_ARM_VGIC_LINE_LEVEL_INTID_MASK	0x3ff
 #define VGIC_LEVEL_INFO_LINE_LEVEL	0
+#define KVM_DEV_ARM_VGIC_GRP_IRS_REGS	10
+#define KVM_DEV_ARM_VGIC_GRP_IST	11
 
 #define   KVM_DEV_ARM_VGIC_CTRL_INIT		0
 #define   KVM_DEV_ARM_ITS_SAVE_TABLES           1
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 39/43] KVM: arm64: gic-v5: Plumb in has/set/get_attr for sysregs & IRS MMIO regs
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (37 preceding siblings ...)
  2026-04-27 16:19 ` [PATCH 38/43] KVM: arm64: gic-v5: Add VGIC_GRP_IRS_REGS/VGIC_GRP_IST to UAPI Sascha Bischoff
@ 2026-04-27 16:19 ` Sascha Bischoff
  2026-04-27 16:19 ` [PATCH 40/43] Documentation: KVM: Document KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS for VGICv5 Sascha Bischoff
                   ` (3 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:19 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

The support for saving/restoring the GICv5 IRS's MMIO registers has
been added, as has the support for saving/restoring the ISTs. They are
however not yet callable from userspace. This commit changes that, and
actually plumbs these into the userspace API handlers.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 arch/arm64/kvm/vgic/vgic-kvm-device.c | 120 ++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c
index 944923121e080..2da93fb31769b 100644
--- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
+++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
@@ -533,7 +533,7 @@ int vgic_v3_parse_attr(struct kvm_device *dev, struct kvm_device_attr *attr,
  * Allow access to certain ID-like registers prior to VGIC initialization,
  * thereby allowing the VMM to provision the features / sizing of the VGIC.
  */
-static bool reg_allowed_pre_init(struct kvm_device_attr *attr)
+static bool v3_reg_allowed_pre_init(struct kvm_device_attr *attr)
 {
 	if (attr->group != KVM_DEV_ARM_VGIC_GRP_DIST_REGS)
 		return false;
@@ -596,7 +596,7 @@ static int vgic_v3_attr_regs_access(struct kvm_device *dev,
 
 	mutex_lock(&dev->kvm->arch.config_lock);
 
-	if (!(vgic_initialized(dev->kvm) || reg_allowed_pre_init(attr))) {
+	if (!(vgic_initialized(dev->kvm) || v3_reg_allowed_pre_init(attr))) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -788,14 +788,115 @@ int vgic_v5_parse_attr(struct kvm_device *dev,
 	return 0;
 }
 
+/*
+ * Some registers can potentially be read before the core GIC & IRS has been
+ * initialised. Right now, everything is required to be post-init.
+ */
+static bool v5_reg_allowed_pre_init(struct kvm_device_attr *attr)
+{
+	return false;
+}
+
+/*
+ * vgic_v5_attr_regs_access - allows user space to access VGIC v5 state
+ *
+ * @dev:      kvm device handle
+ * @attr:     kvm device attribute
+ * @is_write: true if userspace is writing a register
+ */
+static int vgic_v5_attr_regs_access(struct kvm_device *dev,
+				    struct kvm_device_attr *attr,
+				    bool is_write)
+{
+	u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
+	struct vgic_reg_attr reg_attr;
+	gpa_t addr;
+	struct kvm_vcpu *vcpu;
+	bool uaccess;
+	u64 val;
+	int ret;
+
+	ret = vgic_v5_parse_attr(dev, attr, &reg_attr);
+	if (ret)
+		return ret;
+
+	vcpu = reg_attr.vcpu;
+	addr = reg_attr.addr;
+
+	switch (attr->group) {
+	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
+		/* Sysregs uaccess is performed by the sysreg handling code */
+		uaccess = false;
+		break;
+	case KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+		fallthrough;
+	default:
+		uaccess = true;
+	}
+
+	if (uaccess && is_write) {
+		if (get_user(val, uaddr))
+			return -EFAULT;
+	}
+
+	mutex_lock(&dev->kvm->lock);
+
+	if (kvm_trylock_all_vcpus(dev->kvm)) {
+		mutex_unlock(&dev->kvm->lock);
+		return -EBUSY;
+	}
+
+	mutex_lock(&dev->kvm->arch.config_lock);
+
+	if (!(vgic_initialized(dev->kvm) || v5_reg_allowed_pre_init(attr))) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	switch (attr->group) {
+	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
+		ret = vgic_v5_cpu_sysregs_uaccess(vcpu, attr, is_write);
+		break;
+	case KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+		/*
+		 * The IRS registers are a mixture of 32-bit and 64-bit
+		 * registers. Internally, we always perform the correctly sized
+		 * access, but the UAPI is defined in such a way that we are
+		 * always provided a __u64 by userspace. When userspace writes,
+		 * the upper 32-bits are ignored for 32-bit accesses, and on a
+		 * read any 32-bit accesses are written back to user memory
+		 * using the full 64-bits.
+		 */
+		ret = vgic_v5_irs_attr_regs_access(dev, attr, &val, is_write);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	mutex_unlock(&dev->kvm->arch.config_lock);
+	kvm_unlock_all_vcpus(dev->kvm);
+	mutex_unlock(&dev->kvm->lock);
+
+	if (!ret && uaccess && !is_write)
+		ret = put_user(val, uaddr);
+
+	return ret;
+}
+
 static int vgic_v5_set_attr(struct kvm_device *dev,
 			    struct kvm_device_attr *attr)
 {
 	switch (attr->group) {
 	case KVM_DEV_ARM_VGIC_GRP_ADDR:
 		break;
+	case KVM_DEV_ARM_VGIC_GRP_IST:
+		return vgic_v5_irs_restore_ists(dev->kvm, attr);
+	case KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+		fallthrough;
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
-		return -ENXIO;
+		return vgic_v5_attr_regs_access(dev, attr, true);
 	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
 		break;
 	case KVM_DEV_ARM_VGIC_GRP_CTRL:
@@ -820,8 +921,12 @@ static int vgic_v5_get_attr(struct kvm_device *dev,
 	switch (attr->group) {
 	case KVM_DEV_ARM_VGIC_GRP_ADDR:
 		break;
+	case KVM_DEV_ARM_VGIC_GRP_IST:
+		return vgic_v5_irs_save_ists(dev->kvm, attr);
+	case KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+		fallthrough;
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
-		return -ENXIO;
+		return vgic_v5_attr_regs_access(dev, attr, false);
 	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
 		break;
 	case KVM_DEV_ARM_VGIC_GRP_CTRL:
@@ -851,8 +956,10 @@ static int vgic_v5_has_attr(struct kvm_device *dev,
 			return 0;
 		}
 		return -ENXIO;
+	case KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+		fallthrough;
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
-		return -ENXIO;
+		return vgic_v5_has_attr_regs(dev, attr);
 	case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
 		return 0;
 	case KVM_DEV_ARM_VGIC_GRP_CTRL:
@@ -865,6 +972,9 @@ static int vgic_v5_has_attr(struct kvm_device *dev,
 		default:
 			return -ENXIO;
 		}
+		break;
+	case KVM_DEV_ARM_VGIC_GRP_IST:
+		return 0;
 	default:
 		return -ENXIO;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 40/43] Documentation: KVM: Document KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS for VGICv5
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (38 preceding siblings ...)
  2026-04-27 16:19 ` [PATCH 39/43] KVM: arm64: gic-v5: Plumb in has/set/get_attr for sysregs & IRS MMIO regs Sascha Bischoff
@ 2026-04-27 16:19 ` Sascha Bischoff
  2026-04-27 16:20 ` [PATCH 41/43] Documentation: KVM: Add KVM_DEV_ARM_VGIC_GRP_IRS_REGS to VGICv5 docs Sascha Bischoff
                   ` (2 subsequent siblings)
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:19 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

The virtual GICv5 adopts the same mechanism as GICv3 for userspace
read and writes of the system registers, albeit operating on a
different set of registers, of course.

Document KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS for GICv5 in the VGICv5
documentation, explicitly calling out the registers it operates
on. The main body of documentation has been directly copied from the
VGICv3 documentation as it has identical operation.

One key thing to note is that for two sets of GICv5 registers - those
pertaining to Active and Pending state - the operation of the
interface is different to how the actual registers operate. Both of
these registers have C and S variants (to set and clear bits) in
hardware. However for this interface, we ONLY implement the S variant,
AND treat it as a raw write. This simplifies the act of reading or
writing the state.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 .../virt/kvm/devices/arm-vgic-v5.rst          | 66 +++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
index 2417eba01830c..46ede907f2b32 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
@@ -64,3 +64,69 @@ Groups:
     -EFAULT  Invalid guest ram access
     -EBUSY   One or more VCPUS are running
     =======  ========================================================
+
+  KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
+   Attributes:
+
+    The attr field of kvm_device_attr encodes two values::
+
+      bits:     | 63      ....       32 | 31  ....  16 | 15  ....  0 |
+      values:   |         mpidr         |      RES     |    instr    |
+
+    The mpidr field encodes the CPU ID based on the affinity information in the
+    architecture defined MPIDR, and the field is encoded as follows::
+
+      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
+      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
+
+    The instr field encodes the system register to access based on the fields
+    defined in the A64 instruction set encoding for system register access
+    (RES means the bits are reserved for future use and should be zero)::
+
+      | 15 ... 14 | 13 ... 11 | 10 ... 7 | 6 ... 3 | 2 ... 0 |
+      |   Op 0    |    Op1    |    CRn   |   CRm   |   Op2   |
+
+    All system regs accessed through this API are (rw, 64-bit) and
+    kvm_device_attr.addr points to a __u64 value.
+
+    KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS accesses the CPU interface registers for the
+    CPU specified by the mpidr field.
+
+    The available registers are:
+
+    ======================  ====================================================
+    ICC_ICSR_EL1
+    ICC_PPI_ENABLER0_EL1
+    ICC_PPI_ENABLER1_EL1
+    ICC_PPI_SACTIVER0_EL1   ICC_PPI_CACTIVER0_EL1 is not supported. Writes to
+                            ICC_PPI_SACTIVER0_EL1 are treated as RAW writes of
+			    the underlying state.
+    ICC_PPI_SACTIVER1_EL1   ICC_PPI_CACTIVER1_EL1 is not supported. Writes to
+                            ICC_PPI_SACTIVER1_EL1 are treated as RAW writes of
+			    the underlying state.
+    ICC_PPI_SPENDR0_EL1     ICC_PPI_CPENDR0_EL1 is not supported. Writes to
+                            ICC_PPI_SPENDR0_EL1 are treated as RAW writes of
+			    the underlying state.
+    ICC_PPI_SPENDR1_EL1     ICC_PPI_CPENDR1_EL1 is not supported. Writes to
+                            ICC_PPI_SPENDR1_EL1 are treated as RAW writes of
+			    the underlying state.
+    ICC_PPI_PRIORITYR0_EL1
+    ICC_PPI_PRIORITYR1_EL1
+    ICC_PPI_PRIORITYR2_EL1
+    ICC_PPI_PRIORITYR3_EL1
+    ICC_PPI_PRIORITYR4_EL1
+    ICC_PPI_PRIORITYR5_EL1
+    ICC_PPI_PRIORITYR6_EL1
+    ICC_PPI_PRIORITYR7_EL1
+    ICC_PPI_PRIORITYR8_EL1
+    ICC_PPI_PRIORITYR9_EL1
+    ICC_PPI_PRIORITYR10_EL1
+    ICC_PPI_PRIORITYR11_EL1
+    ICC_PPI_PRIORITYR12_EL1
+    ICC_PPI_PRIORITYR13_EL1
+    ICC_PPI_PRIORITYR14_EL1
+    ICC_PPI_PRIORITYR15_EL1
+    ICC_APR_EL1
+    ICC_CR0_EL1
+    ICC_PCR_EL1
+    =======================  ===================================================
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 41/43] Documentation: KVM: Add KVM_DEV_ARM_VGIC_GRP_IRS_REGS to VGICv5 docs
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (39 preceding siblings ...)
  2026-04-27 16:19 ` [PATCH 40/43] Documentation: KVM: Document KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS for VGICv5 Sascha Bischoff
@ 2026-04-27 16:20 ` Sascha Bischoff
  2026-04-27 16:20 ` [PATCH 42/43] Documentation: KVM: Add docs for KVM_DEV_ARM_VGIC_GRP_IST Sascha Bischoff
  2026-04-27 16:20 ` [PATCH 43/43] Documentation: KVM: Add the VGICv5 IRS save/restore sequences Sascha Bischoff
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:20 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Add documentation for the KVM_DEV_ARM_VGIC_GRP_IRS_REGS ioctl used to
read/write the virtual IRS's MMIO register state. This is the same
interface as is already present for handling a VGICv3's ITS, but hiding
behind a different, more appropriately named ioctl (that doesn't
contain ITS, but instead contains IRS).

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 .../virt/kvm/devices/arm-vgic-v5.rst          | 31 +++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
index 46ede907f2b32..9fc6afe16154a 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
@@ -130,3 +130,34 @@ Groups:
     ICC_CR0_EL1
     ICC_PCR_EL1
     =======================  ===================================================
+
+  KVM_DEV_ARM_VGIC_GRP_IRS_REGS
+    Attributes:
+      The attr field of kvm_device_attr encodes the offset of the IRS register,
+      relative to the IRS CONFIG_FRAME base address. This is the address that
+      was provided via KVM_VGIC_V5_ADDR_TYPE_IRS when creating VGICv5 in the
+      first place.
+
+      kvm_device_attr.addr points to a __u64 value whatever the width
+      of the addressed register (32/64 bits). 64 bit registers can only
+      be accessed with full length.
+
+      Writes to read-only registers are ignored by the kernel except for:
+
+      - IRS_IDR0 - IRS_IDR2 and IRS_IDR5 - IRS_IDR7: These are sanity checked to
+        ensure that they match a sane config.
+      - IRS_IDR3 and IRS_IDR4: These are RAZ/WI as nested virtualization is not
+        supported.
+
+      For other registers, getting or setting a register has the same
+      effect as reading/writing the register on real hardware.
+
+  Errors:
+
+    =======  =================================================================
+    -ENXIO   Offset does not correspond to any supported register
+    -EFAULT  Invalid user pointer for attr->addr
+    -EINVAL  Offset is not 64-bit aligned for 32-bit MMIO registers, or 64-bit
+             aligned for 64-bit registers
+    -EBUSY   one or more VCPUS are running
+    =======  =================================================================
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 42/43] Documentation: KVM: Add docs for KVM_DEV_ARM_VGIC_GRP_IST
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (40 preceding siblings ...)
  2026-04-27 16:20 ` [PATCH 41/43] Documentation: KVM: Add KVM_DEV_ARM_VGIC_GRP_IRS_REGS to VGICv5 docs Sascha Bischoff
@ 2026-04-27 16:20 ` Sascha Bischoff
  2026-04-27 16:20 ` [PATCH 43/43] Documentation: KVM: Add the VGICv5 IRS save/restore sequences Sascha Bischoff
  42 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:20 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Document the IST save/restore userspace interface for the VGICv5
device, KVM_DEV_ARM_VGIC_GRP_IST.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 .../virt/kvm/devices/arm-vgic-v5.rst          | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
index 9fc6afe16154a..38eef7cc63e3e 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
@@ -161,3 +161,43 @@ Groups:
              aligned for 64-bit registers
     -EBUSY   one or more VCPUS are running
     =======  =================================================================
+
+  KVM_DEV_ARM_VGIC_GRP_IST
+    Attributes:
+      This interface is used to either save the state of the IRS's Interrupt
+      State Tables (ISTs) or to restore them, by doing either a get or a set,
+      respectively. As part of saving the IST state, the system is also quiesced
+      in order to ensure that there are no in-flight interrupts at the time of
+      doing so.
+
+      The LPI IST is written to or read from guest-allocated memory, assuming
+      that the guest has provisioned a virtual IST as there is sufficient
+      storage to serialise/unserialise the ISTs into/from guest memory in this
+      case. However, there is no such storage present for the SPI IST in guest
+      memory, and hence userspace must provide the storage for these.
+
+      The addr field of kvm_device_attr is used to provide the base address of
+      an appropriately-sized userspace buffer (32-bits per SPI) which is used to
+      either write the contents of the SPI IST to or to read it from. The number
+      of SPIs can be queried by first inspecting the guest's IRS_IDR5, IRS_IDR6,
+      and IRS_IDR7 to determine the number of SPIs that the guest can use. In
+      the case where the guest doesn't use SPIs at all, this address can be
+      omitted, but it is otherwise mandatory for both set and get operations via
+      this interface.
+
+    Errors:
+
+      ===========  ============================================================
+      -EBUSY       One or more VCPUs are running, or the save operation failed
+                   to quiesce the VM
+      -EINVAL      The VM is not valid, or a userspace SPI IST buffer was not
+                   supplied when one is required
+      -ENOENT      A userspace SPI IST buffer was supplied, but there is no SPI
+                   IST to serialise/unserialise
+      -EFAULT      Invalid user pointer for attr->addr, or the guest memory
+                   backing the LPI IST could not be accessed
+      -ENXIO       Required per-VM VGICv5/IST backing state is missing or
+                   inconsistent
+      -ENOMEM      Restoring IST state failed while tracking pending interrupts
+      -ETIMEDOUT   An IRS save/VM operation timed out
+      ===========  ============================================================
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 43/43] Documentation: KVM: Add the VGICv5 IRS save/restore sequences
  2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
                   ` (41 preceding siblings ...)
  2026-04-27 16:20 ` [PATCH 42/43] Documentation: KVM: Add docs for KVM_DEV_ARM_VGIC_GRP_IST Sascha Bischoff
@ 2026-04-27 16:20 ` Sascha Bischoff
  2026-04-30  8:57   ` Peter Maydell
  42 siblings, 1 reply; 66+ messages in thread
From: Sascha Bischoff @ 2026-04-27 16:20 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

When saving/restoring the state of the GICv5 IRS, it is important that
it happens in the correct order. Failure to do so will almost
certainly result in failing to restore a guest that is capable of
handling interrupts correctly.

On a save, the ISTs must be saved prior to saving the guest's memory
as the guest's LPI IST is written to guest memory. Conversely, on
restore the guest's memory must be restored prior to restoring the
ISTs.

It is important to restore the IRS MMIO registers by first restoring
the IRS_IDx registers as they define the capabilities of the IRS, and
are used as part of creating and managing ISTs and SPIs.

In order to restore the ISTs themselves, the IRS_IST_CFGR must be
restored prior to the IRS_IST_BASER. This is because KVM extracts
fields from the CFGR to determine the size and structure of the IRS
created by the guest. The IST itself is created as part of the write
to the IRS_IST_BASER. At this stage the remaining MMIO registers can
be restored.

Once the LPI IST has been created (by the aforementioned write to the
IRS_IST_BASER), the IST state can be restored using
KVM_DEV_ARM_VGIC_GRP_IST. The SPI IST gets extracted from a userspace
provided buffer, and is transferred to the host-allocated SPI IST. The
LPI IST is extracted from guest memory, and is written to the
host-allocated LPI IST.

As a general rule, the IRS_*_STATUSR registers can be ignored on
restore. They are not userspace writable.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
---
 .../virt/kvm/devices/arm-vgic-v5.rst          | 63 +++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
index 38eef7cc63e3e..1c55f5040757d 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
@@ -201,3 +201,66 @@ Groups:
       -ENOMEM      Restoring IST state failed while tracking pending interrupts
       -ETIMEDOUT   An IRS save/VM operation timed out
       ===========  ============================================================
+
+IRS Save Sequence:
+------------------
+
+The following ordering should be followed when saving the virtual GICv5 and
+IRS:
+
+a) Save the ISTs by issuing KVM_GET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IST.
+   This MUST happen before the guest's memory is serialised as the LPI IST is
+   stored directly to guest memory.
+
+b) Save the IRS MMIO register state in the following order by issuing
+   KVM_GET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+
+     1. Save IRS_IDR0-2 and IRS_IDR5-7 registers.
+     2. Save IRS_IST_CFGR.
+     3. Save IRS_IST_BASER.
+     4. Save the remaining global IRS MMIO registers.
+     5. For each PE:
+        - write IRS_PE_SELR
+        - save IRS_PE_CR0
+     6. For each SPI:
+        - write IRS_SPI_SELR
+        - save IRS_SPI_CFGR
+
+IRS Restore Sequence:
+---------------------
+
+The following ordering must be followed when restoring the virtual GICv5 and
+IRS:
+
+a) restore all guest memory and create vcpus
+b) provide the IRS base address by issuing KVM_SET_DEVICE_ATTR on
+   KVM_DEV_ARM_VGIC_GRP_ADDR
+c) initialise the GIC - this sets up the default state and creates the SPI
+   IST - by issuing KVM_SET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_CTRL with
+   KVM_DEV_ARM_VGIC_CTRL_INIT
+d) restore the IRS MMIO register state in the following order by issuing
+   KVM_SET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+
+     1. Restore IRS_IDR0-2 and IRS_IDR5-7 registers.
+     2. Restore IRS_IST_CFGR.
+     3. Restore IRS_IST_BASER - this triggers KVM to create the LPI IST.
+
+e) restore the ISTs by issuing KVM_SET_DEVICE_ATTR on
+   KVM_DEV_ARM_VGIC_GRP_IST.
+f) restore the remaining IRS MMIO register state in the following order by
+   issuing KVM_SET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
+
+     1. Restore the remaining global IRS MMIO registers.
+     2. For each PE:
+        - write IRS_PE_SELR
+        - restore IRS_PE_CR0
+     3. For each SPI:
+        - write IRS_SPI_SELR
+        - restore IRS_SPI_CFGR
+
+The various *_STATUSR registers are observational state in the current KVM
+implementation. Userspace may save them for validation or debugging purposes,
+but they are not required restore input and do not need to be replayed during
+restore.
+
+Then vcpus can be started.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls
  2026-04-27 16:07 ` [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls Sascha Bischoff
@ 2026-04-28 14:28   ` Marc Zyngier
  2026-05-01 16:40     ` Sascha Bischoff
  0 siblings, 1 reply; 66+ messages in thread
From: Marc Zyngier @ 2026-04-28 14:28 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:07:03 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> So far the KVM GICv5 support has been limited to PPIs. These only go
> as far out as the CPU interface, and have no interaction with the
> host's IRS. Therefore, PPIs can be directly used for guests without
> host IRS involvement. However, in order to support both SPIs and LPIs
> IRS involvement is required.

Please keep this in the cover letter. It isn't required to understand
this patch.

> 
> GICv5 introduces the concept of VPE residency - a VPE can be either
> resident or non-resident. When the VPE is resident, the IRS is allowed
> to select interrupts that target that VPE (or the VM) as the HPPI
> (Highest Priority Pending Interrupt). As the IRS handles both SPIs and
> LPIs, these will only be picked as the IRS's HPPI when a VPE is
> resident.
> 
> A GICv5 VPE is made resident by writing to the ICH_CONTEXTR_EL2 with a
> valid VM and VPE ID, and marking it valid in the process. This informs
> the IRS that a specific VPE is running, and that it can begin HPPI
> selection for that VPE. Making a VPE non-resident (by making the
> ICH_CONTEXTR_EL2 invalid) informs the IRS that the VPE is no longer
> running, and it stops HPPI selection for it.
> 
> This change introduces two new hyp calls - one to make a VPE resident
> and its counterpart to make a VPE non-resident. As part of making a
> VPE resident, the ICH_CONTEXTR_EL2.F bit is checked in order to catch
> faults, at which point the kernel will WARN. If everything is
> configured correctly, this should not happen.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/include/asm/kvm_asm.h   |  2 ++
>  arch/arm64/include/asm/kvm_hyp.h   |  2 ++
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c | 15 +++++++++++++++
>  arch/arm64/kvm/hyp/vgic-v5-sr.c    | 25 +++++++++++++++++++++++++
>  include/kvm/arm_vgic.h             |  3 +++
>  5 files changed, 47 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index fa033be6141ad..8c69f1f4de534 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -79,6 +79,8 @@ enum __kvm_host_smccc_func {
>  	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
>  	__KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs,
>  	__KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs,
> +	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_resident,
> +	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_non_resident,
>  	__KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr,
>  	__KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr,
>  
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 8d06b62e7188c..5f9184276b04e 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -88,6 +88,8 @@ void __vgic_v3_restore_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if);
>  int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu);
>  
>  /* GICv5 */
> +void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if);
> +void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if);
>  void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if);
>  void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if);
>  /* No hypercalls for the following */
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 9e44c05cf780e..804a9ffdc8594 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -672,6 +672,19 @@ static void handle___tracing_write_event(struct kvm_cpu_context *host_ctxt)
>  	trace_selftest(id);
>  }
>  
> +static void handle___vgic_v5_make_resident(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1);
> +
> +	__vgic_v5_make_resident(kern_hyp_va(cpu_if));
> +}
> +static void handle___vgic_v5_make_non_resident(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1);
> +
> +	__vgic_v5_make_non_resident(kern_hyp_va(cpu_if));
> +}
> +
>  static void handle___vgic_v5_save_apr(struct kvm_cpu_context *host_ctxt)
>  {
>  	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1);
> @@ -711,6 +724,8 @@ static const hcall_t host_hcall[] = {
>  	HANDLE_FUNC(__kvm_timer_set_cntvoff),
>  	HANDLE_FUNC(__vgic_v3_save_aprs),
>  	HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs),
> +	HANDLE_FUNC(__vgic_v5_make_resident),
> +	HANDLE_FUNC(__vgic_v5_make_non_resident),
>  	HANDLE_FUNC(__vgic_v5_save_apr),
>  	HANDLE_FUNC(__vgic_v5_restore_vmcr_apr),
>  
> diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c
> index 6d69dfe89a96c..04c5846b9abac 100644
> --- a/arch/arm64/kvm/hyp/vgic-v5-sr.c
> +++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c
> @@ -7,6 +7,31 @@
>  
>  #include <asm/kvm_hyp.h>
>  
> +void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if)
> +{
> +	write_sysreg_s(cpu_if->vgic_contextr, SYS_ICH_CONTEXTR_EL2);
> +	isb();
> +
> +	/* Catch any faults */
> +	cpu_if->vgic_contextr = read_sysreg_s(SYS_ICH_CONTEXTR_EL2);
> +	if (WARN_ON(FIELD_GET(ICH_CONTEXTR_EL2_F, cpu_if->vgic_contextr)))
> +		return;

I don't think this is particularly useful here. With non-VHE, this
results in a panic. If there is a problem, you need to fail the vcpu
entry early, and let the caller find out.

> +
> +	cpu_if->gicv5_vpe.resident = true;
> +}
> +
> +void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if)
> +{
> +	/*
> +	 * Make as non-resident before actually making non-resident. Avoids race
> +	 * with doorbell arriving.
> +	 */
> +	cpu_if->gicv5_vpe.resident = false;

Stores and sysreg accesses are not ordered without a DSB ST in between.

> +
> +	write_sysreg_s(cpu_if->vgic_contextr, SYS_ICH_CONTEXTR_EL2);

Is this really making anything non-resident? Who clears the V bit?

> +	isb();
> +}
> +
>  void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if)
>  {
>  	cpu_if->vgic_apr = read_sysreg_s(SYS_ICH_APR_EL2);
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index fe49fb56dc3c9..d14cf4771d606 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -495,6 +495,9 @@ struct vgic_v5_cpu_if {
>  	 */
>  	u64	vgic_icsr;
>  
> +	/* The contextr used to make VPEs resident and non-resident */
> +	u64	vgic_contextr;
> +
>  	struct gicv5_vpe gicv5_vpe;
>  };
>  

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables
  2026-04-27 16:08 ` [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables Sascha Bischoff
@ 2026-04-28 14:54   ` Vladimir Murzin
  2026-05-01 16:42     ` Sascha Bischoff
  2026-04-28 15:55   ` Joey Gouly
  2026-04-29 10:25   ` Marc Zyngier
  2 siblings, 1 reply; 66+ messages in thread
From: Vladimir Murzin @ 2026-04-28 14:54 UTC (permalink / raw)
  To: Sascha Bischoff, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Hi Sascha,

On 4/27/26 17:08, Sascha Bischoff wrote:
> +
> +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
> +		kfree(l2_table);
> +		return -EINVAL;
> +	}
> +
> +	vmt_info->l2.l2ptrs[l1_index] = l2_table;
> +
> +	/* Alignment issue! */
> +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
> +		kfree(l2_table);
> +		return -EFAULT;
> +	}
> +
> +	tmp = virt_to_phys(l2_table) & GICV5_VMTEL1E_L2_ADDR;
> +	WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], cpu_to_le64(tmp));

It seems the same l2_table alignment calculated (and checked) multiple times...

Cheers
Vladimir


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM
  2026-04-27 16:07 ` [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM Sascha Bischoff
@ 2026-04-28 14:56   ` Marc Zyngier
  2026-05-01 16:46     ` Sascha Bischoff
  0 siblings, 1 reply; 66+ messages in thread
From: Marc Zyngier @ 2026-04-28 14:56 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:07:24 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> KVM needs to interact with the host IRS in order to, for example, make
> VMs or VPEs valid. There are two potential approaches here. Either the
> host irqchip driver can provide an interface, or KVM can interact
> directly with the host IRS. The latter of these two is chosen as the
> set of MMIO registers that KVM needs to interact with is orthogonal to
> the set used by the host irqchip driver (with the exception of some of
> the read-only IRS_IDRx registers).
> 
> Pass KVM a pointer to an IRS config frame - the config frame belonging
> to ANY IRS is fine as long as one IRS's config frame is used
> consistently - in struct gic_kvm_info. Additionally, include a flag
> telling KVM whether the IRS is coherent or non-coherent in order to
> make sure that KVM can do the correct cache state management, if
> required.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  drivers/irqchip/irq-gic-v5-irs.c      | 26 ++++++++++++++++++++++++++
>  drivers/irqchip/irq-gic-v5.c          |  3 +++
>  include/linux/irqchip/arm-gic-v5.h    |  2 ++
>  include/linux/irqchip/arm-vgic-info.h |  5 +++++
>  4 files changed, 36 insertions(+)
> 
> diff --git a/drivers/irqchip/irq-gic-v5-irs.c b/drivers/irqchip/irq-gic-v5-irs.c
> index f3fce0b1e25d9..5dfa043cf9e34 100644
> --- a/drivers/irqchip/irq-gic-v5-irs.c
> +++ b/drivers/irqchip/irq-gic-v5-irs.c
> @@ -50,6 +50,32 @@ static void irs_writeq_relaxed(struct gicv5_irs_chip_data *irs_data,
>  	writeq_relaxed(val, irs_data->irs_base + reg_offset);
>  }
>  
> +void __iomem *gicv5_irs_get_config_frame_base(void)
> +{
> +	struct gicv5_irs_chip_data *irs_data = per_cpu(per_cpu_irs_data,
> +						       smp_processor_id());
> +
> +	if (!irs_data)
> +		return NULL;
> +
> +	return irs_data->irs_base;
> +}
> +
> +bool gicv5_irs_is_non_coherent(void)
> +{
> +	struct gicv5_irs_chip_data *irs_data = per_cpu(per_cpu_irs_data,
> +						       smp_processor_id());
> +
> +	if (!irs_data) {
> +		pr_err("Failed to look up IRS for CPU %d\n",
> +		       smp_processor_id());
> +		return false;
> +	}
> +
> +	return !!(irs_data->flags & IRS_FLAGS_NON_COHERENT);
> +}
> +

Frankly, we don't need these micro-helpers. Just add *one* that
returns a pointer to the IRS data for CPU0, and do the information
cherry-picking in the caller.

> +
>  /*
>   * The polling wait (in gicv5_wait_for_op_s_atomic()) on a GIC register
>   * provides the memory barriers (through MMIO accessors)
> diff --git a/drivers/irqchip/irq-gic-v5.c b/drivers/irqchip/irq-gic-v5.c
> index 58e457d4c1476..3329019722360 100644
> --- a/drivers/irqchip/irq-gic-v5.c
> +++ b/drivers/irqchip/irq-gic-v5.c
> @@ -1134,6 +1134,9 @@ static void __init gic_of_setup_kvm_info(struct device_node *node)
>  
>  	gic_v5_kvm_info.type = GIC_V5;
>  
> +	gic_v5_kvm_info.gicv5_irs.base = gicv5_irs_get_config_frame_base();
> +	gic_v5_kvm_info.gicv5_irs.non_coherent = gicv5_irs_is_non_coherent();
> +
>  	/* GIC Virtual CPU interface maintenance interrupt */
>  	gic_v5_kvm_info.no_maint_irq_mask = false;
>  	gic_v5_kvm_info.maint_irq = irq_of_parse_and_map(node, 0);

While you're at it, you may want to fix the check on the maintenance
interrupt, which gives up registering with KVM if no MI is found, even
in the absence of FEAT_GCIE_LEGACY.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame
  2026-04-27 16:07 ` [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame Sascha Bischoff
@ 2026-04-28 15:20   ` Marc Zyngier
  2026-05-01 16:44     ` Sascha Bischoff
  0 siblings, 1 reply; 66+ messages in thread
From: Marc Zyngier @ 2026-04-28 15:20 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:07:44 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> The host irqchip driver provides KVM with a pointer to an IRS's config
> frame, which allows KVM to directly interact with the host's IRS. The
> MMIO registers in the config frame are used to configure VMs (in
> addition to them being used by the host). The IRS's config frame also
> includes a set of ID registers which describe the capabilities that
> the IRS has.
> 
> Stash the pointer to the config frame, and extract the VM capabilities
> (from IRS_IDR3 & IRS_IDR4), as well as the IST
> capabilities/requirements (IRS_IDR2) from the IRS.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/Makefile              |  2 +-
>  arch/arm64/kvm/vgic/vgic-v5-tables.c |  8 +++++
>  arch/arm64/kvm/vgic/vgic-v5-tables.h | 41 ++++++++++++++++++++++
>  arch/arm64/kvm/vgic/vgic-v5.c        | 52 ++++++++++++++++++++++++++++
>  include/linux/irqchip/arm-gic-v5.h   | 10 ++++++
>  5 files changed, 112 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.c
>  create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.h
> 
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 59612d2f277c1..431de9b145ca1 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -24,7 +24,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
>  	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
>  	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
>  	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
> -	 vgic/vgic-v5.o
> +	 vgic/vgic-v5.o vgic/vgic-v5-tables.o
>  
>  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
>  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> new file mode 100644
> index 0000000000000..30e2b108b1aa3
> --- /dev/null
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> @@ -0,0 +1,8 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2025, 2026 Arm Ltd.
> + */
> +
> +#include "vgic-v5-tables.h"
> +
> +struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> new file mode 100644
> index 0000000000000..cf00a248eabd5
> --- /dev/null
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2025, 2026 Arm Ltd.
> + */
> +
> +#ifndef __KVM_ARM_VGICV5_TABLES_H__
> +#define __KVM_ARM_VGICV5_TABLES_H__
> +
> +#include <linux/irqchip/arm-gic-v5.h>
> +
> +struct vgic_v5_host_ist_caps {
> +	/* IST Capabilities */
> +
> +	/* Apply to LPIs and SPIs */
> +	u8	ist_id_bits;
> +	bool	ist_levels;
> +	u8	ist_l2sz;
> +	bool	istmd;
> +	u8	istmd_sz;
> +
> +	/* LPI only */
> +	u8	min_lpi_id_bits;
> +
> +	/* VM Table, VPE Table */
> +	bool	two_level_vmt_support;
> +	u32	max_vms;
> +	u32	max_vpes;
> +	u16	vmd_size;
> +	u16	vped_size;
> +
> +	/* Is the IRS coherent with us, or not? */
> +	bool	irs_non_coherent;
> +};
> +
> +extern struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
> +static inline struct vgic_v5_host_ist_caps *vgic_v5_host_caps(void)
> +{
> +	return &gicv5_host_ist_caps;
> +}

Err. No. Make gicv5_host_ist_caps static, and move the helper as
non-inline in vgic-v5-tables.c. It's not like this is anywhere near
performance-critical stuff, is it?

But also, if that's global information, we have kvm_vgic_global_state.
Isn't that where these things should live? Then the introduction of
vgic-v5-tables.[ch] can be moved to the point where it actually
matters.

> +
> +#endif
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index d4789ff3e7402..fd3d6299a2baa 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -9,6 +9,7 @@
>  #include <linux/irqchip/arm-vgic-info.h>
>  
>  #include "vgic.h"
> +#include "vgic-v5-tables.h"
>  
>  #define ppi_caps	kvm_vgic_global_state.vgic_v5_ppi_caps
>  
> @@ -34,6 +35,54 @@ static void vgic_v5_get_implemented_ppis(void)
>  	__assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3());
>  }
>  
> +static void __iomem *irs_base;

Global state?

> +
> +static u32 irs_readl_relaxed(const u32 reg_offset)
> +{
> +	return readl_relaxed(irs_base + reg_offset);
> +}
> +
> +static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
> +{
> +	u64 idr;
> +
> +	irs_base = info->gicv5_irs.base;
> +	if (!irs_base) {
> +		kvm_info("No GICv5 MMIO IRS address; no GICv5 support\n");
> +		return -ENODEV;
> +	}

Should you instead bail out early by not registering the gic_kvm_info
when the IRS base is unknown, making this sort of checks irrelevant?

Also, it's not like we can make it very far without an IRS...

> +
> +	vgic_v5_host_caps()->irs_non_coherent = info->gicv5_irs.non_coherent;
> +
> +	idr = irs_readl_relaxed(GICV5_IRS_IDR2);
> +
> +	/* We skip the LPI field as it only applies to physical LPIs */
> +	vgic_v5_host_caps()->ist_id_bits = FIELD_GET(GICV5_IRS_IDR2_ID_BITS, idr);
> +	vgic_v5_host_caps()->min_lpi_id_bits = FIELD_GET(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, idr);
> +	vgic_v5_host_caps()->ist_levels = !!FIELD_GET(GICV5_IRS_IDR2_IST_LEVELS, idr);
> +	vgic_v5_host_caps()->ist_l2sz = FIELD_GET(GICV5_IRS_IDR2_IST_L2SZ, idr);
> +	vgic_v5_host_caps()->istmd = !!FIELD_GET(GICV5_IRS_IDR2_ISTMD, idr);
> +	vgic_v5_host_caps()->istmd_sz = FIELD_GET(GICV5_IRS_IDR2_ISTMD_SZ, idr);
> +
> +	idr = irs_readl_relaxed(GICV5_IRS_IDR3);
> +
> +	vgic_v5_host_caps()->max_vms = BIT(FIELD_GET(GICV5_IRS_IDR3_VM_ID_BITS, idr));
> +	vgic_v5_host_caps()->two_level_vmt_support = !!FIELD_GET(GICV5_IRS_IDR3_VMT_LEVELS, idr);
> +
> +	if (FIELD_GET(GICV5_IRS_IDR3_VMD, idr))

The constant (ab)use of FIELD_GET() for fields that are single bit
wide is very hard to read. I'd like to see:

	vgic_v5_host_caps()->ist_levels = (idr & GICV5_IRS_IDR2_IST_LEVELS);
[...]
	vgic_v5_host_caps()->istmd = (idr & GICV5_IRS_IDR2_ISTMD);
[...]
	if (idr & GICV5_IRS_IDR3_VMD)
[...]

which is infinitely more readable.

> +		vgic_v5_host_caps()->vmd_size = BIT(FIELD_GET(GICV5_IRS_IDR3_VMD_SZ, idr));
> +	else
> +		vgic_v5_host_caps()->vmd_size = 0;
> +
> +	idr = irs_readl_relaxed(GICV5_IRS_IDR4);
> +
> +	vgic_v5_host_caps()->vped_size = BIT(FIELD_GET(GICV5_IRS_IDR4_VPED_SZ, idr));
> +	/* Field stores VPE_ID_BITS - 1 */
> +	vgic_v5_host_caps()->max_vpes = BIT(FIELD_GET(GICV5_IRS_IDR4_VPE_ID_BITS, idr) + 1);
> +
> +	return 0;
> +}
> +
>  /*
>   * Probe for a vGICv5 compatible interrupt controller, returning 0 on success.
>   */
> @@ -61,6 +110,9 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  		goto skip_v5;
>  	}
>  
> +	if (gicv5_irs_extract_vm_caps(info))
> +		goto skip_v5;
> +

We shouldn't "skip_v5" anymore. If we can't initialise KVM with GICv5,
we're done, and we should not even try to register v3.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables
  2026-04-27 16:08 ` [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables Sascha Bischoff
  2026-04-28 14:54   ` Vladimir Murzin
@ 2026-04-28 15:55   ` Joey Gouly
  2026-04-29 10:25   ` Marc Zyngier
  2 siblings, 0 replies; 66+ messages in thread
From: Joey Gouly @ 2026-04-28 15:55 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, maz@kernel.org, oliver.upton@linux.dev,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

A lot to look into here, just some first read through nits.

On Mon, Apr 27, 2026 at 04:08:25PM +0000, Sascha Bischoff wrote:
> GICv5 uses a set of in-memory tables to track and manage VM
> state. These must be allocated by the hypervisor, and provided to the
> IRS to use.
> 
> The VMT (Virtual Machine Table) is a linear or two level table
> comprising VMT Entries (VMTE). Each VMTE describes the state for a
> single VM. This state includes things such as the SPI and LPI IST
> configuration (coming in a future commit), an implementation-defined
> VM Descriptor, and a VPE Table (VPET).
> 
> The VPET contains one entry per VPE belonging to a VM, and is used to
> mark a VPE as valid, as well as providing the address of an
> implementation-defined VPE Descriptor, which is used by the hardware
> to track and manage VPE state.
> 
> This commit adds support for allocating the VMT, and managing the
> VMTEs. The VMTEs can be initialised or released for re-use. Allocation
> and tracking of unused VMTEs is handled with an IDA.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5-tables.c | 628 +++++++++++++++++++++++++++
>  arch/arm64/kvm/vgic/vgic-v5-tables.h | 108 +++++
>  include/kvm/arm_vgic.h               |   2 +
>  include/linux/irqchip/arm-gic-v5.h   |  13 +
>  4 files changed, 751 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> index 30e2b108b1aa3..502d05d46cccf 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> @@ -3,6 +3,634 @@
>   * Copyright (C) 2025, 2026 Arm Ltd.
>   */
>  
> +#include <kvm/arm_vgic.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sizes.h>
> +#include <linux/slab.h>
> +#include <linux/xarray.h>
> +#include <asm/kvm_mmu.h>
> +
> +#include "vgic.h"
>  #include "vgic-v5-tables.h"
>  
>  struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
> +
> +static struct vgic_v5_vmt *vmt_info;
> +DEFINE_XARRAY(vm_info);
> +
> +static bool vgic_v5_vmt_allocated(void)
> +{
> +	return vmt_info != NULL;
> +}
> +
> +static int vgic_v5_check_vm_id(u16 vm_id)
> +{
> +	if (vm_id >= vmt_info->num_entries)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +/*
> + * Our IRS might be coherent or non-coherent. If coherent, we can just emit a
> + * DSB to ensure that we're in sync. However, when non-coherent, we need to
> + * manage our cached data explicitly.
> + *
> + * This helper is used to handle both coherent and non-coherent IRSes, and
> + * handles all combinations of cleaning and invalidating to the PoC.
> + */
> +static void vgic_v5_clean_inval(void *va, size_t size, bool clean, bool inval)
> +{
> +	unsigned long base = (unsigned long)va;
> +
> +	/* Catch any accidental NOPs */
> +	BUILD_BUG_ON(!(clean || inval));
> +
> +	/* Coherent; emit DSB. */
> +	if (!gicv5_host_ist_caps.irs_non_coherent) {
> +		dsb(ishst);
> +		return;
> +	}
> +
> +	if (clean && inval)
> +		dcache_clean_inval_poc(base, base + size);
> +	else if (clean)
> +		dcache_clean_poc(base, base + size);
> +	else if (inval)
> +		dcache_inval_poc(base, base + size);
> +}
> +
> +/*
> + * Create a linear VM table, rounding up the number of entries to at least one
> + * whole page to give us nicer alignment.
> + *
> + * Note: We don't update the number of entries tracked in our tracking structure
> + * as this might be higher than the number of bits supported by the HW.
> + */
> +static int vgic_v5_alloc_vmt_linear(unsigned int num_entries)
> +{
> +	unsigned int l2_entries_per_page;
> +	size_t alloc_size;
> +
> +	/* Potentially throw away a bit of memory for the sake of alignment! */
> +	l2_entries_per_page = PAGE_SIZE / GICV5_VMTEL2E_SIZE;
> +	if (num_entries < l2_entries_per_page)
> +		num_entries = l2_entries_per_page;
> +
> +	alloc_size = num_entries * sizeof(struct vmtl2_entry);
> +
> +	vmt_info->linear.vmt_base = kzalloc(alloc_size, GFP_KERNEL);
> +	if (vmt_info->linear.vmt_base == NULL)
> +		return -ENOMEM;
> +
> +	vgic_v5_clean_inval(vmt_info->linear.vmt_base, alloc_size, true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate the first level of a two-level VM table. The second-level VM tables
> + * are allocated on demand (by vgic_v5_alloc_l2_vmt()).
> + *
> + * Note: If there are too few entries, these are rounded up to the size of an L2
> + * table (4k) to ensure sane alignment. As with the linear table, the tracked
> + * number of entries is not increased to avoid the case of going above what the
> + * hardware supports.
> + */
> +static int vgic_v5_alloc_vmt_two_level(unsigned int num_entries)
> +{
> +	size_t alloc_size;
> +
> +	/* Potentially throw away a bit of memory for the sake of alignment! */
> +	if (num_entries < GICV5_VMT_L2_TABLE_ENTRIES)
> +		num_entries = GICV5_VMT_L2_TABLE_ENTRIES;
> +
> +	/*
> +	 * Let's make sure that we always allocate a whole power of 2
> +	 * of entries. Note that we need to subtract 1 from the fls()
> +	 * result in order to give the correct number of bits as we
> +	 * are operating on a whole power of 2.

This fls() seems unrelated? There is an fls() - 1 in vgic_v5_vmte_init().

> +	 */
> +	num_entries = roundup_pow_of_two(num_entries);
> +
> +	vmt_info->l2.num_l1_ents = (num_entries / GICV5_VMT_L2_TABLE_ENTRIES);
> +	alloc_size = vmt_info->l2.num_l1_ents * sizeof(vmtl1_entry);
> +
> +	vmt_info->l2.vmt_base = kzalloc(alloc_size, GFP_KERNEL);
> +	if (vmt_info->l2.vmt_base == NULL)
> +		return -ENOMEM;
> +
> +	vgic_v5_clean_inval(vmt_info->l2.vmt_base, alloc_size, true, true);
> +
> +	vmt_info->l2.l2ptrs = kzalloc_objs(*vmt_info->l2.l2ptrs,
> +					   vmt_info->l2.num_l1_ents,
> +					   GFP_KERNEL);
> +	if (vmt_info->l2.l2ptrs == NULL) {
> +		kfree(vmt_info->l2.vmt_base);
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +/*
> + * Allocate a second level VMT, if required. This can be called eagerly, and
> + * will only perform the allocation if required.
> + */
> +static int vgic_v5_alloc_l2_vmt(struct kvm *kvm)
> +{
> +	unsigned int l1_index;
> +	struct vmtl2_entry *l2_table;
> +	vmtl1_entry tmp;
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
> +	struct gicv5_cmd_info cmd_info;
> +	int ret;
> +
> +	if (!vgic_v5_vmt_allocated())
> +		return -ENXIO;
> +
> +	/* Nothing to do if we have linear tables! */
> +	if (!vmt_info->two_level)
> +		return 0;
> +
> +	ret = vgic_v5_check_vm_id(vm_id);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * We have 4k-sized L2 tables - this is mandated by the spec for
> +	 * two-level VMTs. This means that we have 128 entries per L1 VMTE.
> +	 */
> +	l1_index = vm_id / GICV5_VMT_L2_TABLE_ENTRIES;
> +
> +	if (l1_index > vmt_info->l2.num_l1_ents)
> +		return -E2BIG;
> +
> +	/* Already valid? Great! */
> +	if (vmt_info->l2.l2ptrs[l1_index] != NULL)
> +		return 0;
> +
> +	l2_table = kzalloc(GICV5_VMT_L2_TABLE_SIZE, GFP_KERNEL);
> +	if (l2_table == NULL)
> +		return -ENOMEM;
> +
> +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
> +		kfree(l2_table);
> +		return -EINVAL;
> +	}

Duplicated code below this. I guess move the comment up and delete the same checks below.
> +
> +	vmt_info->l2.l2ptrs[l1_index] = l2_table;
> +
> +	/* Alignment issue! */
> +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
> +		kfree(l2_table);
> +		return -EFAULT;
> +	}
> +
> +	tmp = virt_to_phys(l2_table) & GICV5_VMTEL1E_L2_ADDR;

Can we rename tmp to vmte_l2_addr or something? (Although reading on I see tmp
is used more often.. so maybe it's consistent here)

> +	WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], cpu_to_le64(tmp));
> +
> +	vgic_v5_clean_inval(l2_table, GICV5_VMT_L2_TABLE_SIZE, true, true);
> +	/* Skip inval for now - wait until table is made valid by HW */
> +	vgic_v5_clean_inval(vmt_info->l2.vmt_base + l1_index,
> +			    sizeof(vmtl1_entry), true, false);
> +
> +	/* VMAP in the L2 VMT via the IRS */
> +	cmd_info.cmd_type = VMT_L2_MAP;
> +	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
> +
> +	/* We've failed to make the L2 VMT valid - things are very broken! */
> +	if (ret) {
> +		/* Remove the pointer from L1 table */
> +		WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], 0);
> +
> +		kfree(l2_table);
> +		vmt_info->l2.l2ptrs[l1_index] = NULL;
> +
> +		return ret;
> +	}
> +
> +	/* Table updated; inval our copy */
> +	vgic_v5_clean_inval(vmt_info->l2.vmt_base + l1_index,
> +			    sizeof(vmtl1_entry), false, true);
> +
> +	return ret;
> +}
> +
> +/*
> + * Allocate the top-level VMT. This can either be linear or two-level.
> + */
> +int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
> +			 size_t vmd_size, size_t vped_size,
> +			 unsigned int max_vpes)
> +{
> +	int ret = 0;
> +
> +	if (vgic_v5_vmt_allocated())
> +		return -EBUSY;
> +
> +	/* VMD is optional; using 0 to signal that it not needed. */
> +	if (vmd_size != 0 &&
> +	    (vmd_size < VMD_MIN_SIZE || vmd_size > VMD_MAX_SIZE))
> +		return -EINVAL;
> +
> +	if (vped_size < VPED_MIN_SIZE || vped_size > VPED_MAX_SIZE)
> +		return -EINVAL;
> +
> +	/* Allocate the tracking structure */
> +	vmt_info = kzalloc_obj(*vmt_info, GFP_KERNEL);
> +	if (vmt_info == NULL)
> +		return -ENOMEM;
> +
> +	ida_init(&vmt_info->vm_id_ida);
> +	vmt_info->max_vpes = max_vpes;
> +	vmt_info->vmd_size = vmd_size;
> +	vmt_info->vped_size = vped_size;
> +	vmt_info->two_level = two_level;
> +	vmt_info->num_entries = num_entries;
> +
> +	if (!two_level)
> +		ret = vgic_v5_alloc_vmt_linear(num_entries);
> +	else
> +		ret = vgic_v5_alloc_vmt_two_level(num_entries);
> +
> +	/* If anything failed, free our tracking structure before returning */
> +	if (ret) {
> +		kfree(vmt_info);
> +		vmt_info = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Free the VMT and associated tracking structures. This isn't strictly expected
> + * to be called in general operation, but instead exists for completeness.
> + */
> +int vgic_v5_vmt_free(void)
> +{
> +	if (!vgic_v5_vmt_allocated())
> +		return -EINVAL;
> +
> +	if (!vmt_info->two_level) {
> +		kfree(vmt_info->linear.vmt_base);
> +	} else {
> +		/* Free the L2 tables; kfree(NULL) is safe */
> +		for (int i = 0; i < vmt_info->l2.num_l1_ents; ++i)
> +			kfree(vmt_info->l2.l2ptrs[i]);
> +		kfree(vmt_info->l2.l2ptrs);
> +
> +		/* And now free the L1 table */
> +		kfree(vmt_info->l2.vmt_base);
> +	}
> +
> +	ida_destroy(&vmt_info->vm_id_ida);
> +	kfree(vmt_info);
> +	vmt_info = NULL;
> +
> +	return 0;
> +}
> +
> +/*
> + * Look up a VMT Entry by VM ID.
> + */
> +static int vgic_v5_get_l2_vmte(u16 vm_id, struct vmtl2_entry **vmte)
> +{
> +	unsigned int l1_index, l2_index;
> +	struct vmtl2_entry *l2_table;
> +	int ret;
> +
> +	ret = vgic_v5_check_vm_id(vm_id);
> +	if (ret)
> +		return ret;
> +
> +	if (!vmt_info->two_level) {
> +		/* All entries always valid for Linear table */
> +		*vmte = &vmt_info->linear.vmt_base[vm_id];
> +	} else {
> +		l1_index = vm_id / GICV5_VMT_L2_TABLE_ENTRIES;
> +		l2_index = vm_id % GICV5_VMT_L2_TABLE_ENTRIES;
> +
> +		if (l1_index > vmt_info->l2.num_l1_ents)
> +			return -E2BIG;
> +
> +		if (vmt_info->l2.l2ptrs[l1_index] == NULL)
> +			return -EINVAL;
> +
> +		l2_table = vmt_info->l2.l2ptrs[l1_index];
> +		*vmte = &l2_table[l2_index];
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Zero a VMT Entry, and flush & invalidate to the PoC, if required.
> + */
> +static int vgic_v5_reset_vmte(struct kvm *kvm)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vmtl2_entry *vmte;
> +	int ret;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	WRITE_ONCE(vmte->val[0], 0ULL);
> +	WRITE_ONCE(vmte->val[1], 0ULL);
> +	WRITE_ONCE(vmte->val[2], 0ULL);
> +	WRITE_ONCE(vmte->val[3], 0ULL);
> +
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Use the IDA to allocate a new VM ID, and track it in the gicv5_vm data
> + * structure. If we're out of VM IDs, the IDA catches that, and we return the
> + * error (-ENOSPC).
> + */
> +int vgic_v5_allocate_vm_id(struct kvm *kvm)
> +{
> +	int id;
> +
> +	id = ida_alloc_max(&vmt_info->vm_id_ida, vmt_info->num_entries - 1u,
> +			   GFP_KERNEL);
> +	if (id < 0)
> +		return id;
> +
> +	kvm->arch.vgic.gicv5_vm.vm_id = id;
> +	kvm->arch.vgic.gicv5_vm.vm_id_valid = true;
> +
> +	return 0;
> +}
> +
> +/*
> + * Release the VM ID to allow it to be reallocated in the future.
> + */
> +void vgic_v5_release_vm_id(struct kvm *kvm)
> +{
> +	ida_free(&vmt_info->vm_id_ida, kvm->arch.vgic.gicv5_vm.vm_id);
> +	kvm->arch.vgic.gicv5_vm.vm_id_valid = false;
> +}
> +
> +/*
> + * Initialise an entry in the VMT based on the index of the VM.
> + *
> + * Note: We don't mark the VMTE as valid as this needs to be done by
> + * the hardware.
> + */
> +int vgic_v5_vmte_init(struct kvm *kvm)
> +{
> +	int nr_cpus = atomic_read(&kvm->online_vcpus);
> +	struct vgic_v5_vm_info *vmi = NULL;
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	void *vmd = NULL, *vpet = NULL;
> +	struct vmtl2_entry *vmte;
> +	void **vped_ptrs = NULL;
> +	size_t vpet_alloc_size;
> +	int ret;
> +	u64 tmp;
> +
> +	if (nr_cpus > vmt_info->max_vpes)
> +		return -E2BIG;
> +
> +	/*
> +	 * If we're using two-level VMTs, L2 is allocated on demand. For linear
> +	 * VMTs, this is a NOP.
> +	 */
> +	if (vgic_v5_alloc_l2_vmt(kvm))
> +		return -EIO;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	/* If the entry is already valid, something went wrong */
> +	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +		return -EINVAL;
> +	}
> +
> +	ret = vgic_v5_reset_vmte(kvm);
> +	if (ret)
> +		return ret;
> +
> +	vmi = kzalloc_objs(*vmi, GFP_KERNEL);
> +	if (vmi == NULL) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +
> +	ret = xa_insert(&vm_info, vm_id, vmi, GFP_KERNEL);
> +	if (ret)
> +		goto out_fail;
> +
> +	/* Allocate and assign the VM Descriptor, if required. */
> +	if (vmt_info->vmd_size != 0) {
> +		vmd = kzalloc(vmt_info->vmd_size, GFP_KERNEL);
> +		if (!vmd) {
> +			ret = -ENOMEM;
> +			goto out_fail;
> +		}
> +
> +		/* Stash the VA so we can free it later */
> +		vmi->vmd_base = vmd;
> +
> +		tmp = FIELD_PREP(GICV5_VMTEL2E_VMD_ADDR,
> +				virt_to_phys(vmd) >>
> +				GICV5_VMTEL2E_VMD_ADDR_SHIFT);
> +		WRITE_ONCE(vmte->val[0], cpu_to_le64(tmp));
> +	}
> +
> +	/*
> +	 * Allocate and assign the VPE Table. We can only describe the number of
> +	 * VPE ID Bits in the VMTE, and therefore we round up the number of CPUs
> +	 * to a whole power of two.

This comment is a bit confusing, would try reword it somewhere along the lines of:

     Round up the number of CPUs to a whole power of two, as the VMTE describes the ID bits as log2()

Something like that. "We can only describe the number of VPE ID Bits in the
VMTE", the emphasis on 'Bits' is hard to make out.

> +	 */
> +	nr_cpus = roundup_pow_of_two(nr_cpus);
> +	vmi->vpe_id_bits = fls(nr_cpus) - 1;
> +
> +	vpet_alloc_size = sizeof(vpe_entry) * nr_cpus;
> +	vpet = kzalloc(vpet_alloc_size, GFP_KERNEL);

kzalloc_objs()? Or is there some alignment thing here?

> +	if (!vpet) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +
> +	/* Stash the VA so we can free it later */
> +	vmi->vpet_base = vpet;
> +
> +	tmp = FIELD_PREP(GICV5_VMTEL2E_VPET_ADDR,
> +			virt_to_phys(vpet) >> GICV5_VMTEL2E_VPET_ADDR_SHIFT);
> +	tmp |= FIELD_PREP(GICV5_VMTEL2E_VPE_ID_BITS, vmi->vpe_id_bits);
> +	WRITE_ONCE(vmte->val[1], cpu_to_le64(tmp));
> +
> +	vped_ptrs = kzalloc_objs(*vped_ptrs, nr_cpus, GFP_KERNEL);
> +	if (vped_ptrs == NULL) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +	vmi->vped_ptrs = vped_ptrs;
> +
> +	if (vmd)
> +		vgic_v5_clean_inval(vmd, vmt_info->vmd_size, true, true);
> +	vgic_v5_clean_inval(vpet, vpet_alloc_size, true, true);
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +
> +	kvm->arch.vgic.gicv5_vm.vmte_allocated = true;
> +
> +	return 0;
> +
> +out_fail:
> +	/* kfree(NULL) is safe so we can just kfree() at leisure */
> +	kfree(vmd);
> +	kfree(vpet);
> +	kfree(vped_ptrs);
> +	if (vmi)
> +		xa_erase(&vm_info, vm_id);
> +	kfree(vmi);
> +
> +	vgic_v5_reset_vmte(kvm);
> +
> +	return ret;
> +}
> +
> +/*
> + * Release the VMT Entry, freeing up any allocated data structures before
> + * zeroing the VMTE.
> + *
> + * The VMTE must be marked as invalid before it is released.
> + */
> +int vgic_v5_vmte_release(struct kvm *kvm)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	struct vmtl2_entry *vmte;
> +	int ret;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	/* Reject if the VMTE has not been marked as invalid! */
> +	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +		return -EINVAL;
> +	}
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		goto no_vmi;
> +
> +	kfree(vmi->vmd_base);
> +	kfree(vmi->vpet_base);
> +
> +	xa_erase(&vm_info, vm_id);
> +	kfree(vmi);
> +
> +no_vmi:
> +	/*
> +	 * If we didn't get far enough into allocating a VMTE to create the VM
> +	 * info structure, then we just zero the VMTE and move on. There's
> +	 * nothing else we can realistically do here.
> +	 */
> +	ret = vgic_v5_reset_vmte(kvm);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate a VPE descriptor and provide it to the hardware via the VPE Table.
> + */
> +int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu)
> +{
> +	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
> +	u16 vpe_id = vgic_v5_vpe_id(vcpu);
> +	struct vgic_v5_vm_info *vmi;
> +	vpe_entry tmp, *vpet_base;
> +	void *vped;
> +
> +	/* Make sure we're not over what the hardware supports */
> +	if (vpe_id >= vmt_info->max_vpes)
> +		return -E2BIG;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return -EINVAL;
> +
> +	if (vpe_id >= 1 << vmi->vpe_id_bits)
> +		return -E2BIG;
> +
> +	vpet_base = vmi->vpet_base;
> +
> +	/* If the VPETE for this CPU is already valid we've gone wrong */
> +	if (FIELD_GET(GICV5_VPE_VALID, le64_to_cpu(READ_ONCE(vpet_base[vpe_id])))) {
> +		vgic_v5_clean_inval(&vpet_base[vpe_id], sizeof(*vpet_base), true, true);
> +		return -EBUSY;
> +	}
> +
> +	/* Alloc VPE Descriptor. Only used by IRS. */
> +	vped = kzalloc(vmt_info->vped_size, GFP_KERNEL);
> +	if (vped == NULL)
> +		return -ENOMEM;
> +
> +	vmi->vped_ptrs[vpe_id] = vped;
> +
> +	tmp = FIELD_PREP(GICV5_VPED_ADDR, virt_to_phys(vped) >> GICV5_VPED_ADDR_SHIFT);
> +	WRITE_ONCE(vpet_base[vpe_id], cpu_to_le64(tmp));
> +
> +	vgic_v5_clean_inval(vped, vmt_info->vped_size, true, true);
> +	vgic_v5_clean_inval(vpet_base + vpe_id, sizeof(vpe_entry), true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Free the memory allocated for the VPE descriptor.
> + */
> +int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
> +{
> +	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
> +	u16 vpe_id = vgic_v5_vpe_id(vcpu);
> +	struct vgic_v5_vm_info *vmi;
> +	struct vmtl2_entry *vmte;
> +	vpe_entry *vpet_base;
> +	void *vped;
> +	int ret;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +		return -EBUSY;
> +	}
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (!vmi)
> +		return -EINVAL;
> +
> +	if (vpe_id >= 1 << vmi->vpe_id_bits)
> +		return -E2BIG;
> +
> +	vpet_base = vmi->vpet_base;
> +	WRITE_ONCE(vpet_base[vpe_id], 0ULL);
> +
> +	vgic_v5_clean_inval(vpet_base + vpe_id, sizeof(vpe_entry), true, true);
> +
> +	/* Free VPE Descriptor. Only used by IRS. */
> +	vped = vmi->vped_ptrs[vpe_id];
> +	vmi->vped_ptrs[vpe_id] = NULL;
> +	kfree(vped);
> +
> +	return 0;
> +}
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> index cf00a248eabd5..5501a44308362 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> @@ -8,6 +8,86 @@
>  
>  #include <linux/irqchip/arm-gic-v5.h>
>  
> +#define VM_ID_BITS_MIN	8
> +#define VM_ID_BITS_MAX	16
> +#define VMD_MIN_SIZE	8
> +#define VMD_MAX_SIZE	4096
> +#define VPED_MIN_SIZE	8
> +#define VPED_MAX_SIZE	4096
> +#define VPE_ID_BITS_MIN	8
> +#define VPE_ID_BITS_MAX	16
> +
> +/* Level 1 Virtual Machine Table Entry */
> +typedef __le64 vmtl1_entry;
> +#define GICV5_VMTEL1E_VALID		BIT_ULL(0)
> +/* Note that there is no shift for the address by design */
> +#define GICV5_VMTEL1E_L2_ADDR		GENMASK(51, 12)
> +
> +#define GICV5_VMTEL2E_SIZE		32ULL
> +/* An L2 table (two-level VMT) is ALWAYS 4kB! */
> +#define GICV5_VMT_L2_TABLE_SIZE		4096ULL
> +#define GICV5_VMT_L2_TABLE_ENTRIES	(GICV5_VMT_L2_TABLE_SIZE / GICV5_VMTEL2E_SIZE)
> +
> +/* Level 2 Virtual Machine Table Entry */
> +struct vmtl2_entry {
> +	__le64 val[4];
> +};
> +
> +/*
> + * As the L2 VMTE is a large data structure, we are splitting it into 4 parts.
> + * We only mask and shift WITHIN each part for simplicity.
> + */
> +/* First 64-bit chunk */
> +#define GICV5_VMTEL2E_VALID		BIT_ULL(0)
> +#define GICV5_VMTEL2E_VMD_ADDR_SHIFT	3ULL
> +#define GICV5_VMTEL2E_VMD_ADDR		GENMASK_ULL(55, 3)
> +/* Second 64-bit chunk */
> +#define GICV5_VMTEL2E_VPET_ADDR_SHIFT	3ULL
> +#define GICV5_VMTEL2E_VPET_ADDR		GENMASK_ULL(55, 3)
> +#define GICV5_VMTEL2E_VPE_ID_BITS	GENMASK_ULL(63, 59)
> +/* Third & fourth 64-bit chunks (the encodings are the same for each) */
> +#define GICV5_VMTEL2E_IST_VALID		BIT_ULL(0)
> +#define GICV5_VMTEL2E_IST_L2SZ		GENMASK_ULL(2, 1)
> +#define GICV5_VMTEL2E_IST_ADDR_SHIFT	6ULL
> +#define GICV5_VMTEL2E_IST_ADDR		GENMASK_ULL(55, 6)
> +#define GICV5_VMTEL2E_IST_ISTSZ		GENMASK_ULL(57, 56)
> +#define GICV5_VMTEL2E_IST_STRUCTURE	BIT_ULL(58)
> +#define GICV5_VMTEL2E_IST_ID_BITS	GENMASK_ULL(63, 59)
> +
> +/* Virtual PE Table Entry */
> +typedef __le64 vpe_entry;
> +#define GICV5_VPE_VALID			BIT_ULL(0)
> +/* Note that there is no shift for the address by design. */
> +#define GICV5_VPED_ADDR_SHIFT		3ULL
> +#define GICV5_VPED_ADDR			GENMASK_ULL(55, 3)
> +
> +struct vgic_v5_vm_info {
> +	void __iomem		*vmd_base;
> +	vpe_entry __iomem	*vpet_base;
> +	void __iomem		**vped_ptrs;
> +	u8			vpe_id_bits;
> +};
> +
> +struct vgic_v5_vmt {
> +	union {
> +		struct {
> +			struct vmtl2_entry *vmt_base;
> +			unsigned int num_ents;
> +		} linear;
> +		struct {
> +			vmtl1_entry *vmt_base;
> +			struct vmtl2_entry **l2ptrs;
> +			unsigned int num_l1_ents;
> +		} l2;
> +	};
> +	bool		two_level;
> +	unsigned int	num_entries;
> +	unsigned int	max_vpes;
> +	size_t		vmd_size;
> +	size_t		vped_size;
> +	struct ida	vm_id_ida;
> +};
> +
>  struct vgic_v5_host_ist_caps {
>  	/* IST Capabilities */
>  
> @@ -38,4 +118,32 @@ static inline struct vgic_v5_host_ist_caps *vgic_v5_host_caps(void)
>  	return &gicv5_host_ist_caps;
>  }
>  
> +static inline u16 vgic_v5_vm_id(struct kvm *kvm)
> +{
> +	return kvm->arch.vgic.gicv5_vm.vm_id;
> +}
> +
> +static inline u16 vgic_v5_vpe_id(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id;
> +}
> +
> +static inline int vgic_v5_vpe_db(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db;
> +}
> +
> +int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
> +			 size_t vmd_size, size_t vped_size,
> +			 unsigned int vpe_id_bits);
> +int vgic_v5_vmt_free(void);
> +
> +int vgic_v5_allocate_vm_id(struct kvm *kvm);
> +void vgic_v5_release_vm_id(struct kvm *kvm);
> +
> +int vgic_v5_vmte_init(struct kvm *kvm);
> +int vgic_v5_vmte_release(struct kvm *kvm);
> +int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu);
> +int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu);
> +
>  #endif
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 05dbd01f6fd21..0bcbc751593cc 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -372,6 +372,8 @@ struct vgic_v5_vm {
>  	int			vpe_db_base;
>  	int			nr_vpes;
>  	u16			vm_id;
> +	bool			vm_id_valid;
> +	bool			vmte_allocated;
>  };
>  
>  struct vgic_dist {
> diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
> index 087d94f739672..89579ee04f5d1 100644
> --- a/include/linux/irqchip/arm-gic-v5.h
> +++ b/include/linux/irqchip/arm-gic-v5.h
> @@ -182,6 +182,7 @@
>  #define GICV5_IRS_MAP_L2_ISTR_ID	GENMASK(23, 0)
>  
>  #define GICV5_ISTL1E_VALID		BIT_ULL(0)
> +#define GICV5_IRS_ISTL1E_SIZE		8UL
>  
>  #define GICV5_ISTL1E_L2_ADDR_MASK	GENMASK_ULL(55, 12)
>  
> @@ -444,4 +445,16 @@ void gicv5_free_lpi(u32 lpi);
>  
>  void __init gicv5_its_of_probe(struct device_node *parent);
>  void __init gicv5_its_acpi_probe(void);
> +
> +enum gicv5_vcpu_info_cmd_type {
> +	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
> +	VMTE_MAKE_VALID,	/* Make the VMTE valid */
> +	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
> +};
> +
> +struct gicv5_cmd_info {
> +	enum gicv5_vcpu_info_cmd_type	cmd_type;
> +	u64				data;
> +};
> +
>  #endif

Thanks,
Joey


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain
  2026-04-27 16:08 ` [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain Sascha Bischoff
@ 2026-04-28 16:40   ` Marc Zyngier
  2026-05-01 16:54     ` Sascha Bischoff
  0 siblings, 1 reply; 66+ messages in thread
From: Marc Zyngier @ 2026-04-28 16:40 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:08:05 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> GICv5 supports two types of doorbell - VPE doorbells and VM
> doorbells. In KVM we only support Targeted interrupts, and do not
> support 1ofN target selection. This means that we only implement VPE
> doorbells. These doorbells are implemented as host LPIs which are
> generated when a non-resident VPE has a pending interrupt of
> sufficient priority and the doorbell has been requested as part of
> making the VPE non-resident.

This is mostly a repeat of the architecture spec. I don't think we
need to paraphrase it.

> 
> VPE doorbells allow KVM to wake VPEs (so, vcpus) as soon as the
> hardware determines that sufficient conditions for the interrupt to be
> signalled have been met. This simplifies the wake-up path for vcpus
> with GICv5 for LPIs and SPIs. NOTE: PPI pending state must still be
> checked explicitly as the IRS never sees them.

Drop the note, it serves no purpose here.

> 
> This change introduces support for the vgic_v5 doorbell domain. One
> doorbell domain is created per GICv5 VM, and all VPEs have their own
> doorbell within this domain. When the doorbell fires, this is tracked
> (in gicv5_vpe.db_fired) and the corresponding vcpu is kicked.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-init.c    |   5 +-
>  arch/arm64/kvm/vgic/vgic-v5.c      | 143 +++++++++++++++++++++++++++++
>  arch/arm64/kvm/vgic/vgic.h         |   1 +
>  include/kvm/arm_vgic.h             |   6 ++
>  include/linux/irqchip/arm-gic-v5.h |   2 +
>  5 files changed, 156 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index 907057881b26a..984908a271c8d 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -500,8 +500,11 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm)
>  		dist->vgic_cpu_base = VGIC_ADDR_UNDEF;
>  	}
>  
> -	if (vgic_supports_direct_irqs(kvm))
> +	if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 &&
> +	    vgic_supports_direct_irqs(kvm))
>  		vgic_v4_teardown(kvm);
> +	else if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5)
> +		vgic_v5_teardown(kvm);

nit: switch/case instead?

>
>  	xa_destroy(&dist->lpi_xa);
>  }
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index fd3d6299a2baa..4e0d52b309628 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -7,6 +7,7 @@
>  
>  #include <linux/bitops.h>
>  #include <linux/irqchip/arm-vgic-info.h>
> +#include <linux/irqdomain.h>
>  
>  #include "vgic.h"
>  #include "vgic-v5-tables.h"
> @@ -162,6 +163,138 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  	return 0;
>  }
>  
> +/*
> + * This set of irq_chip functions is specific for doorbells.
> + */
> +static struct irq_chip vgic_v5_db_irq_chip = {

const?

> +	.name = "GICv5-DB",
> +	.irq_mask = irq_chip_mask_parent,
> +	.irq_unmask = irq_chip_unmask_parent,
> +	.irq_eoi = irq_chip_eoi_parent,
> +	.irq_set_affinity = irq_chip_set_affinity_parent,
> +	.irq_get_irqchip_state = irq_chip_get_parent_state,
> +	.irq_set_irqchip_state = irq_chip_set_parent_state,
> +	.flags = IRQCHIP_SET_TYPE_MASKED | IRQCHIP_SKIP_SET_WAKE |
> +		 IRQCHIP_MASK_ON_SUSPEND,
> +};
> +
> +static int vgic_v5_irq_db_domain_map(struct irq_domain *d, unsigned int virq,
> +				     u16 vpe_id)
> +{
> +	int ret;
> +	u32 lpi;
> +	irq_hw_number_t hwirq;
> +	struct irq_chip *chip = &vgic_v5_db_irq_chip;
> +	struct irq_data *irqd = irq_desc_get_irq_data(irq_to_desc(virq));
> +
> +	/*
> +	 * For the DB domain, we don't use the same hwirq as for LPIs.
> +	 */
> +	hwirq = vpe_id;
> +
> +	ret = gicv5_alloc_lpi();

NAK. Allocating LPIs is the task of the underlying domain that manages
LPIs, and absolutely not the vgic code.

> +	if (ret < 0)
> +		return ret;
> +	lpi = ret;
> +
> +	ret = irq_domain_alloc_irqs_parent(d, virq, 1, &lpi);

Why? I'd expect to see an irq_domain_alloc_irqs() for the whole VM,
and be done with it.

The whole allocation/freeing of LPIs is upside down. You really should
not have to do this, and I'd strongly suggest you align the way the
doorbell domain is constructed with the way GICv4 does it.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables
  2026-04-27 16:08 ` [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables Sascha Bischoff
  2026-04-28 14:54   ` Vladimir Murzin
  2026-04-28 15:55   ` Joey Gouly
@ 2026-04-29 10:25   ` Marc Zyngier
  2 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-29 10:25 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:08:25 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> GICv5 uses a set of in-memory tables to track and manage VM
> state. These must be allocated by the hypervisor, and provided to the
> IRS to use.
> 
> The VMT (Virtual Machine Table) is a linear or two level table
> comprising VMT Entries (VMTE). Each VMTE describes the state for a
> single VM. This state includes things such as the SPI and LPI IST
> configuration (coming in a future commit), an implementation-defined
> VM Descriptor, and a VPE Table (VPET).
> 
> The VPET contains one entry per VPE belonging to a VM, and is used to
> mark a VPE as valid, as well as providing the address of an
> implementation-defined VPE Descriptor, which is used by the hardware
> to track and manage VPE state.
> 
> This commit adds support for allocating the VMT, and managing the
> VMTEs. The VMTEs can be initialised or released for re-use. Allocation
> and tracking of unused VMTEs is handled with an IDA.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5-tables.c | 628 +++++++++++++++++++++++++++
>  arch/arm64/kvm/vgic/vgic-v5-tables.h | 108 +++++
>  include/kvm/arm_vgic.h               |   2 +
>  include/linux/irqchip/arm-gic-v5.h   |  13 +
>  4 files changed, 751 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> index 30e2b108b1aa3..502d05d46cccf 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> @@ -3,6 +3,634 @@
>   * Copyright (C) 2025, 2026 Arm Ltd.
>   */
>  
> +#include <kvm/arm_vgic.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sizes.h>
> +#include <linux/slab.h>
> +#include <linux/xarray.h>
> +#include <asm/kvm_mmu.h>
> +
> +#include "vgic.h"
>  #include "vgic-v5-tables.h"
>  
>  struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
> +
> +static struct vgic_v5_vmt *vmt_info;
> +DEFINE_XARRAY(vm_info);

Can this be made static?

> +
> +static bool vgic_v5_vmt_allocated(void)
> +{
> +	return vmt_info != NULL;
> +}
> +
> +static int vgic_v5_check_vm_id(u16 vm_id)
> +{
> +	if (vm_id >= vmt_info->num_entries)
> +		return -EINVAL;
> +
> +	return 0;
> +}

Under what circumstance do we have to issue these checks? This looks
like debug code to me.

> +
> +/*
> + * Our IRS might be coherent or non-coherent. If coherent, we can just emit a
> + * DSB to ensure that we're in sync. However, when non-coherent, we need to
> + * manage our cached data explicitly.
> + *
> + * This helper is used to handle both coherent and non-coherent IRSes, and
> + * handles all combinations of cleaning and invalidating to the PoC.
> + */
> +static void vgic_v5_clean_inval(void *va, size_t size, bool clean, bool inval)
> +{
> +	unsigned long base = (unsigned long)va;
> +
> +	/* Catch any accidental NOPs */
> +	BUILD_BUG_ON(!(clean || inval));
> +
> +	/* Coherent; emit DSB. */
> +	if (!gicv5_host_ist_caps.irs_non_coherent) {
> +		dsb(ishst);
> +		return;
> +	}

You need a DSB in all situations. Otherwise, the subsequent CMOs are
not ordered.

> +
> +	if (clean && inval)
> +		dcache_clean_inval_poc(base, base + size);
> +	else if (clean)
> +		dcache_clean_poc(base, base + size);
> +	else if (inval)
> +		dcache_inval_poc(base, base + size);

You could also do yourself a favour and just use clean+invalidate,
which conveniently works everywhere. There is very little point in
spreading that complexity all over the map.

> +}
> +
> +/*
> + * Create a linear VM table, rounding up the number of entries to at least one
> + * whole page to give us nicer alignment.

Define nicer. What are the constraints? Does GICv5 have a different
notion of "page size" from the CPU like on v3?

> + *
> + * Note: We don't update the number of entries tracked in our tracking structure
> + * as this might be higher than the number of bits supported by the HW.
> + */
> +static int vgic_v5_alloc_vmt_linear(unsigned int num_entries)
> +{
> +	unsigned int l2_entries_per_page;
> +	size_t alloc_size;
> +
> +	/* Potentially throw away a bit of memory for the sake of alignment! */
> +	l2_entries_per_page = PAGE_SIZE / GICV5_VMTEL2E_SIZE;
> +	if (num_entries < l2_entries_per_page)
> +		num_entries = l2_entries_per_page;

Don't you want to allocate full pages, irrespective of the number of
entries? I don't see the reason why you'd treat the first page
differently from the rest.

> +
> +	alloc_size = num_entries * sizeof(struct vmtl2_entry);
> +
> +	vmt_info->linear.vmt_base = kzalloc(alloc_size, GFP_KERNEL);

Consider using the new allocation helpers:

	vmt_info->vmt_base = kzalloc_objs(struct vmtl2_entry, num_entries, GFP_KERNEL)

But if you are supposed to use full pages anyway, you might as well
switch to the page allocator.

> +	if (vmt_info->linear.vmt_base == NULL)
> +		return -ENOMEM;
> +
> +	vgic_v5_clean_inval(vmt_info->linear.vmt_base, alloc_size, true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate the first level of a two-level VM table. The second-level VM tables
> + * are allocated on demand (by vgic_v5_alloc_l2_vmt()).
> + *
> + * Note: If there are too few entries, these are rounded up to the size of an L2
> + * table (4k) to ensure sane alignment. As with the linear table, the tracked
> + * number of entries is not increased to avoid the case of going above what the
> + * hardware supports.
> + */
> +static int vgic_v5_alloc_vmt_two_level(unsigned int num_entries)
> +{
> +	size_t alloc_size;
> +
> +	/* Potentially throw away a bit of memory for the sake of alignment! */
> +	if (num_entries < GICV5_VMT_L2_TABLE_ENTRIES)
> +		num_entries = GICV5_VMT_L2_TABLE_ENTRIES;
> +
> +	/*
> +	 * Let's make sure that we always allocate a whole power of 2
> +	 * of entries. Note that we need to subtract 1 from the fls()
> +	 * result in order to give the correct number of bits as we
> +	 * are operating on a whole power of 2.
> +	 */
> +	num_entries = roundup_pow_of_two(num_entries);
> +
> +	vmt_info->l2.num_l1_ents = (num_entries / GICV5_VMT_L2_TABLE_ENTRIES);
> +	alloc_size = vmt_info->l2.num_l1_ents * sizeof(vmtl1_entry);
> +
> +	vmt_info->l2.vmt_base = kzalloc(alloc_size, GFP_KERNEL);
> +	if (vmt_info->l2.vmt_base == NULL)
> +		return -ENOMEM;
> +
> +	vgic_v5_clean_inval(vmt_info->l2.vmt_base, alloc_size, true, true);
> +
> +	vmt_info->l2.l2ptrs = kzalloc_objs(*vmt_info->l2.l2ptrs,
> +					   vmt_info->l2.num_l1_ents,
> +					   GFP_KERNEL);
> +	if (vmt_info->l2.l2ptrs == NULL) {
> +		kfree(vmt_info->l2.vmt_base);
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +/*
> + * Allocate a second level VMT, if required. This can be called eagerly, and
> + * will only perform the allocation if required.
> + */
> +static int vgic_v5_alloc_l2_vmt(struct kvm *kvm)
> +{
> +	unsigned int l1_index;
> +	struct vmtl2_entry *l2_table;
> +	vmtl1_entry tmp;
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
> +	struct gicv5_cmd_info cmd_info;
> +	int ret;
> +
> +	if (!vgic_v5_vmt_allocated())
> +		return -ENXIO;

Shouldn't that be checked in the caller rather than being buried in a
helper? It is also odd to check for "vmt being allocated" in a
function that "allocates a vmt". My guts feeling is that vmt_allocated
is not exactly the correct name.

> +
> +	/* Nothing to do if we have linear tables! */
> +	if (!vmt_info->two_level)
> +		return 0;
> +
> +	ret = vgic_v5_check_vm_id(vm_id);
> +	if (ret)
> +		return ret;

Why should we check the vm id at this stage? Surely we should be able
to trust the value passed around.

> +
> +	/*
> +	 * We have 4k-sized L2 tables - this is mandated by the spec for
> +	 * two-level VMTs. This means that we have 128 entries per L1 VMTE.
> +	 */
> +	l1_index = vm_id / GICV5_VMT_L2_TABLE_ENTRIES;
> +
> +	if (l1_index > vmt_info->l2.num_l1_ents)
> +		return -E2BIG;

Same for these checks. If this is something we got form the vm_id
allocator, we should be able to trust it.

> +
> +	/* Already valid? Great! */
> +	if (vmt_info->l2.l2ptrs[l1_index] != NULL)
> +		return 0;
> +
> +	l2_table = kzalloc(GICV5_VMT_L2_TABLE_SIZE, GFP_KERNEL);
> +	if (l2_table == NULL)

nit:

	if (!l2_table)

is the more idiomatic way of writing this.

> +		return -ENOMEM;
> +
> +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {

Are you second-guessing the memory allocator? We have a guarantee that
the result will be aligned to the size of the next power of 2.

> +		kfree(l2_table);
> +		return -EINVAL;
> +	}
> +
> +	vmt_info->l2.l2ptrs[l1_index] = l2_table;
> +
> +	/* Alignment issue! */
> +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {

Doing it twice just to be sure?

> +		kfree(l2_table);
> +		return -EFAULT;
> +	}
> +
> +	tmp = virt_to_phys(l2_table) & GICV5_VMTEL1E_L2_ADDR;

You have just verified twice that the bottom bits were 0. I sense a
certain level of paranoia! ;-)

> +	WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], cpu_to_le64(tmp));
> +
> +	vgic_v5_clean_inval(l2_table, GICV5_VMT_L2_TABLE_SIZE, true, true);
> +	/* Skip inval for now - wait until table is made valid by HW */
> +	vgic_v5_clean_inval(vmt_info->l2.vmt_base + l1_index,
> +			    sizeof(vmtl1_entry), true, false);
> +
> +	/* VMAP in the L2 VMT via the IRS */
> +	cmd_info.cmd_type = VMT_L2_MAP;

cmd_info.data is left uninitialised.

> +	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);

Who implements the irq_set_vcpu_affinity() callback? I don't see it in
any of the previous patches, and I'd like to be sure that this
callback doesn't snapshot the pointer...

It is also slightly odd to do the VM init using a vcpu-specific
handle. Not necessarily something to change, but at least document it.

> +
> +	/* We've failed to make the L2 VMT valid - things are very broken! */
> +	if (ret) {
> +		/* Remove the pointer from L1 table */
> +		WRITE_ONCE(vmt_info->l2.vmt_base[l1_index], 0);
> +
> +		kfree(l2_table);
> +		vmt_info->l2.l2ptrs[l1_index] = NULL;
> +
> +		return ret;
> +	}
> +
> +	/* Table updated; inval our copy */
> +	vgic_v5_clean_inval(vmt_info->l2.vmt_base + l1_index,
> +			    sizeof(vmtl1_entry), false, true);
> +
> +	return ret;
> +}
> +
> +/*
> + * Allocate the top-level VMT. This can either be linear or two-level.
> + */
> +int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
> +			 size_t vmd_size, size_t vped_size,
> +			 unsigned int max_vpes)
> +{
> +	int ret = 0;
> +
> +	if (vgic_v5_vmt_allocated())
> +		return -EBUSY;
> +
> +	/* VMD is optional; using 0 to signal that it not needed. */
> +	if (vmd_size != 0 &&
> +	    (vmd_size < VMD_MIN_SIZE || vmd_size > VMD_MAX_SIZE))
> +		return -EINVAL;
> +
> +	if (vped_size < VPED_MIN_SIZE || vped_size > VPED_MAX_SIZE)
> +		return -EINVAL;
> +
> +	/* Allocate the tracking structure */
> +	vmt_info = kzalloc_obj(*vmt_info, GFP_KERNEL);
> +	if (vmt_info == NULL)
> +		return -ENOMEM;
> +
> +	ida_init(&vmt_info->vm_id_ida);
> +	vmt_info->max_vpes = max_vpes;
> +	vmt_info->vmd_size = vmd_size;
> +	vmt_info->vped_size = vped_size;
> +	vmt_info->two_level = two_level;
> +	vmt_info->num_entries = num_entries;
> +
> +	if (!two_level)
> +		ret = vgic_v5_alloc_vmt_linear(num_entries);
> +	else
> +		ret = vgic_v5_alloc_vmt_two_level(num_entries);
> +
> +	/* If anything failed, free our tracking structure before returning */
> +	if (ret) {
> +		kfree(vmt_info);
> +		vmt_info = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Free the VMT and associated tracking structures. This isn't strictly expected
> + * to be called in general operation, but instead exists for completeness.
> + */
> +int vgic_v5_vmt_free(void)
> +{
> +	if (!vgic_v5_vmt_allocated())
> +		return -EINVAL;
> +
> +	if (!vmt_info->two_level) {
> +		kfree(vmt_info->linear.vmt_base);
> +	} else {
> +		/* Free the L2 tables; kfree(NULL) is safe */
> +		for (int i = 0; i < vmt_info->l2.num_l1_ents; ++i)
> +			kfree(vmt_info->l2.l2ptrs[i]);
> +		kfree(vmt_info->l2.l2ptrs);
> +
> +		/* And now free the L1 table */
> +		kfree(vmt_info->l2.vmt_base);
> +	}
> +
> +	ida_destroy(&vmt_info->vm_id_ida);
> +	kfree(vmt_info);
> +	vmt_info = NULL;
> +
> +	return 0;
> +}
> +
> +/*
> + * Look up a VMT Entry by VM ID.
> + */
> +static int vgic_v5_get_l2_vmte(u16 vm_id, struct vmtl2_entry **vmte)
> +{

A much more idiomatic way to do that would be to just return a
pointer, and to encode the potential error with ERR_PTR(), checking
for IS_ERR()/PTR_ERR(). Or just return NULL on error, which is a
pretty standard behaviour.

> +	unsigned int l1_index, l2_index;
> +	struct vmtl2_entry *l2_table;
> +	int ret;
> +
> +	ret = vgic_v5_check_vm_id(vm_id);
> +	if (ret)
> +		return ret;
> +
> +	if (!vmt_info->two_level) {
> +		/* All entries always valid for Linear table */
> +		*vmte = &vmt_info->linear.vmt_base[vm_id];
> +	} else {
> +		l1_index = vm_id / GICV5_VMT_L2_TABLE_ENTRIES;
> +		l2_index = vm_id % GICV5_VMT_L2_TABLE_ENTRIES;
> +
> +		if (l1_index > vmt_info->l2.num_l1_ents)
> +			return -E2BIG;
> +
> +		if (vmt_info->l2.l2ptrs[l1_index] == NULL)
> +			return -EINVAL;
> +
> +		l2_table = vmt_info->l2.l2ptrs[l1_index];
> +		*vmte = &l2_table[l2_index];
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Zero a VMT Entry, and flush & invalidate to the PoC, if required.
> + */
> +static int vgic_v5_reset_vmte(struct kvm *kvm)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vmtl2_entry *vmte;
> +	int ret;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	WRITE_ONCE(vmte->val[0], 0ULL);
> +	WRITE_ONCE(vmte->val[1], 0ULL);
> +	WRITE_ONCE(vmte->val[2], 0ULL);
> +	WRITE_ONCE(vmte->val[3], 0ULL);

Why the WRITE_ONCE()? Is there anything observing these writes
concurrently? Also, missing endianness conversions.

> +
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Use the IDA to allocate a new VM ID, and track it in the gicv5_vm data
> + * structure. If we're out of VM IDs, the IDA catches that, and we return the
> + * error (-ENOSPC).
> + */
> +int vgic_v5_allocate_vm_id(struct kvm *kvm)
> +{
> +	int id;
> +
> +	id = ida_alloc_max(&vmt_info->vm_id_ida, vmt_info->num_entries - 1u,
> +			   GFP_KERNEL);
> +	if (id < 0)
> +		return id;
> +
> +	kvm->arch.vgic.gicv5_vm.vm_id = id;
> +	kvm->arch.vgic.gicv5_vm.vm_id_valid = true;

Do we need this extra flag? Why can't that be a specific id value?

> +
> +	return 0;
> +}
> +
> +/*
> + * Release the VM ID to allow it to be reallocated in the future.
> + */
> +void vgic_v5_release_vm_id(struct kvm *kvm)
> +{
> +	ida_free(&vmt_info->vm_id_ida, kvm->arch.vgic.gicv5_vm.vm_id);
> +	kvm->arch.vgic.gicv5_vm.vm_id_valid = false;
> +}
> +
> +/*
> + * Initialise an entry in the VMT based on the index of the VM.
> + *
> + * Note: We don't mark the VMTE as valid as this needs to be done by
> + * the hardware.
> + */
> +int vgic_v5_vmte_init(struct kvm *kvm)
> +{
> +	int nr_cpus = atomic_read(&kvm->online_vcpus);
> +	struct vgic_v5_vm_info *vmi = NULL;
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	void *vmd = NULL, *vpet = NULL;
> +	struct vmtl2_entry *vmte;
> +	void **vped_ptrs = NULL;
> +	size_t vpet_alloc_size;
> +	int ret;
> +	u64 tmp;
> +
> +	if (nr_cpus > vmt_info->max_vpes)
> +		return -E2BIG;
> +
> +	/*
> +	 * If we're using two-level VMTs, L2 is allocated on demand. For linear
> +	 * VMTs, this is a NOP.
> +	 */
> +	if (vgic_v5_alloc_l2_vmt(kvm))
> +		return -EIO;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	/* If the entry is already valid, something went wrong */
> +	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +		return -EINVAL;
> +	}
> +
> +	ret = vgic_v5_reset_vmte(kvm);
> +	if (ret)
> +		return ret;
> +
> +	vmi = kzalloc_objs(*vmi, GFP_KERNEL);

Errr. No. For a single object, this is *kzalloc_obj*, no trailing 's'.
Here, you are allocating GFP_KERNEL objects, which may or may not be
sensible...

Also, GFP_KERNEL is the default, so you can write this as:

	vmi = kzalloc_obj(*vmi);

> +	if (vmi == NULL) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +
> +	ret = xa_insert(&vm_info, vm_id, vmi, GFP_KERNEL);
> +	if (ret)
> +		goto out_fail;
> +
> +	/* Allocate and assign the VM Descriptor, if required. */
> +	if (vmt_info->vmd_size != 0) {
> +		vmd = kzalloc(vmt_info->vmd_size, GFP_KERNEL);

kzalloc_obj()

> +		if (!vmd) {
> +			ret = -ENOMEM;
> +			goto out_fail;
> +		}
> +
> +		/* Stash the VA so we can free it later */
> +		vmi->vmd_base = vmd;
> +
> +		tmp = FIELD_PREP(GICV5_VMTEL2E_VMD_ADDR,
> +				virt_to_phys(vmd) >>
> +				GICV5_VMTEL2E_VMD_ADDR_SHIFT);
> +		WRITE_ONCE(vmte->val[0], cpu_to_le64(tmp));
> +	}
> +
> +	/*
> +	 * Allocate and assign the VPE Table. We can only describe the number of
> +	 * VPE ID Bits in the VMTE, and therefore we round up the number of CPUs
> +	 * to a whole power of two.
> +	 */
> +	nr_cpus = roundup_pow_of_two(nr_cpus);
> +	vmi->vpe_id_bits = fls(nr_cpus) - 1;
> +
> +	vpet_alloc_size = sizeof(vpe_entry) * nr_cpus;
> +	vpet = kzalloc(vpet_alloc_size, GFP_KERNEL);
> +	if (!vpet) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +
> +	/* Stash the VA so we can free it later */
> +	vmi->vpet_base = vpet;
> +
> +	tmp = FIELD_PREP(GICV5_VMTEL2E_VPET_ADDR,
> +			virt_to_phys(vpet) >> GICV5_VMTEL2E_VPET_ADDR_SHIFT);
> +	tmp |= FIELD_PREP(GICV5_VMTEL2E_VPE_ID_BITS, vmi->vpe_id_bits);
> +	WRITE_ONCE(vmte->val[1], cpu_to_le64(tmp));
> +
> +	vped_ptrs = kzalloc_objs(*vped_ptrs, nr_cpus, GFP_KERNEL);
> +	if (vped_ptrs == NULL) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +	vmi->vped_ptrs = vped_ptrs;
> +
> +	if (vmd)
> +		vgic_v5_clean_inval(vmd, vmt_info->vmd_size, true, true);
> +	vgic_v5_clean_inval(vpet, vpet_alloc_size, true, true);
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +
> +	kvm->arch.vgic.gicv5_vm.vmte_allocated = true;
> +
> +	return 0;
> +
> +out_fail:
> +	/* kfree(NULL) is safe so we can just kfree() at leisure */
> +	kfree(vmd);
> +	kfree(vpet);
> +	kfree(vped_ptrs);
> +	if (vmi)
> +		xa_erase(&vm_info, vm_id);
> +	kfree(vmi);
> +
> +	vgic_v5_reset_vmte(kvm);
> +
> +	return ret;
> +}
> +
> +/*
> + * Release the VMT Entry, freeing up any allocated data structures before
> + * zeroing the VMTE.
> + *
> + * The VMTE must be marked as invalid before it is released.
> + */
> +int vgic_v5_vmte_release(struct kvm *kvm)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	struct vmtl2_entry *vmte;
> +	int ret;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	/* Reject if the VMTE has not been marked as invalid! */
> +	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);

What is this invalidation achieving?

> +		return -EINVAL;
> +	}
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		goto no_vmi;
> +
> +	kfree(vmi->vmd_base);
> +	kfree(vmi->vpet_base);
> +
> +	xa_erase(&vm_info, vm_id);
> +	kfree(vmi);
> +
> +no_vmi:
> +	/*
> +	 * If we didn't get far enough into allocating a VMTE to create the VM
> +	 * info structure, then we just zero the VMTE and move on. There's
> +	 * nothing else we can realistically do here.
> +	 */
> +	ret = vgic_v5_reset_vmte(kvm);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate a VPE descriptor and provide it to the hardware via the VPE Table.
> + */
> +int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu)
> +{
> +	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
> +	u16 vpe_id = vgic_v5_vpe_id(vcpu);
> +	struct vgic_v5_vm_info *vmi;
> +	vpe_entry tmp, *vpet_base;
> +	void *vped;
> +
> +	/* Make sure we're not over what the hardware supports */
> +	if (vpe_id >= vmt_info->max_vpes)
> +		return -E2BIG;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return -EINVAL;
> +
> +	if (vpe_id >= 1 << vmi->vpe_id_bits)
> +		return -E2BIG;
> +
> +	vpet_base = vmi->vpet_base;
> +
> +	/* If the VPETE for this CPU is already valid we've gone wrong */
> +	if (FIELD_GET(GICV5_VPE_VALID, le64_to_cpu(READ_ONCE(vpet_base[vpe_id])))) {
> +		vgic_v5_clean_inval(&vpet_base[vpe_id], sizeof(*vpet_base), true, true);

Same question as above.

> +		return -EBUSY;
> +	}
> +
> +	/* Alloc VPE Descriptor. Only used by IRS. */
> +	vped = kzalloc(vmt_info->vped_size, GFP_KERNEL);
> +	if (vped == NULL)
> +		return -ENOMEM;
> +
> +	vmi->vped_ptrs[vpe_id] = vped;
> +
> +	tmp = FIELD_PREP(GICV5_VPED_ADDR, virt_to_phys(vped) >> GICV5_VPED_ADDR_SHIFT);
> +	WRITE_ONCE(vpet_base[vpe_id], cpu_to_le64(tmp));
> +
> +	vgic_v5_clean_inval(vped, vmt_info->vped_size, true, true);
> +	vgic_v5_clean_inval(vpet_base + vpe_id, sizeof(vpe_entry), true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Free the memory allocated for the VPE descriptor.
> + */
> +int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
> +{
> +	u16 vm_id = vgic_v5_vm_id(vcpu->kvm);
> +	u16 vpe_id = vgic_v5_vpe_id(vcpu);
> +	struct vgic_v5_vm_info *vmi;
> +	struct vmtl2_entry *vmte;
> +	vpe_entry *vpet_base;
> +	void *vped;
> +	int ret;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	if (FIELD_GET(GICV5_VMTEL2E_VALID, le64_to_cpu(READ_ONCE(vmte->val[0])))) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);

Again.

> +		return -EBUSY;
> +	}
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (!vmi)
> +		return -EINVAL;
> +
> +	if (vpe_id >= 1 << vmi->vpe_id_bits)
> +		return -E2BIG;
> +
> +	vpet_base = vmi->vpet_base;
> +	WRITE_ONCE(vpet_base[vpe_id], 0ULL);
> +
> +	vgic_v5_clean_inval(vpet_base + vpe_id, sizeof(vpe_entry), true, true);
> +
> +	/* Free VPE Descriptor. Only used by IRS. */
> +	vped = vmi->vped_ptrs[vpe_id];
> +	vmi->vped_ptrs[vpe_id] = NULL;
> +	kfree(vped);
> +
> +	return 0;
> +}

I've only very lightly went through the rest of the allocations, but
what strikes me is the distinct lack of locking and interaction with
the HW. I guess this is taken care of somewhere else?

> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> index cf00a248eabd5..5501a44308362 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> @@ -8,6 +8,86 @@
>  
>  #include <linux/irqchip/arm-gic-v5.h>
>  
> +#define VM_ID_BITS_MIN	8
> +#define VM_ID_BITS_MAX	16
> +#define VMD_MIN_SIZE	8
> +#define VMD_MAX_SIZE	4096
> +#define VPED_MIN_SIZE	8
> +#define VPED_MAX_SIZE	4096
> +#define VPE_ID_BITS_MIN	8
> +#define VPE_ID_BITS_MAX	16
> +
> +/* Level 1 Virtual Machine Table Entry */
> +typedef __le64 vmtl1_entry;
> +#define GICV5_VMTEL1E_VALID		BIT_ULL(0)
> +/* Note that there is no shift for the address by design */
> +#define GICV5_VMTEL1E_L2_ADDR		GENMASK(51, 12)
> +
> +#define GICV5_VMTEL2E_SIZE		32ULL
> +/* An L2 table (two-level VMT) is ALWAYS 4kB! */
> +#define GICV5_VMT_L2_TABLE_SIZE		4096ULL
> +#define GICV5_VMT_L2_TABLE_ENTRIES	(GICV5_VMT_L2_TABLE_SIZE / GICV5_VMTEL2E_SIZE)
> +
> +/* Level 2 Virtual Machine Table Entry */
> +struct vmtl2_entry {
> +	__le64 val[4];
> +};
> +
> +/*
> + * As the L2 VMTE is a large data structure, we are splitting it into 4 parts.
> + * We only mask and shift WITHIN each part for simplicity.
> + */
> +/* First 64-bit chunk */
> +#define GICV5_VMTEL2E_VALID		BIT_ULL(0)
> +#define GICV5_VMTEL2E_VMD_ADDR_SHIFT	3ULL
> +#define GICV5_VMTEL2E_VMD_ADDR		GENMASK_ULL(55, 3)
> +/* Second 64-bit chunk */
> +#define GICV5_VMTEL2E_VPET_ADDR_SHIFT	3ULL
> +#define GICV5_VMTEL2E_VPET_ADDR		GENMASK_ULL(55, 3)
> +#define GICV5_VMTEL2E_VPE_ID_BITS	GENMASK_ULL(63, 59)
> +/* Third & fourth 64-bit chunks (the encodings are the same for each) */
> +#define GICV5_VMTEL2E_IST_VALID		BIT_ULL(0)
> +#define GICV5_VMTEL2E_IST_L2SZ		GENMASK_ULL(2, 1)
> +#define GICV5_VMTEL2E_IST_ADDR_SHIFT	6ULL
> +#define GICV5_VMTEL2E_IST_ADDR		GENMASK_ULL(55, 6)
> +#define GICV5_VMTEL2E_IST_ISTSZ		GENMASK_ULL(57, 56)
> +#define GICV5_VMTEL2E_IST_STRUCTURE	BIT_ULL(58)
> +#define GICV5_VMTEL2E_IST_ID_BITS	GENMASK_ULL(63, 59)
> +
> +/* Virtual PE Table Entry */
> +typedef __le64 vpe_entry;
> +#define GICV5_VPE_VALID			BIT_ULL(0)
> +/* Note that there is no shift for the address by design. */
> +#define GICV5_VPED_ADDR_SHIFT		3ULL
> +#define GICV5_VPED_ADDR			GENMASK_ULL(55, 3)

Are these definitions used outside of the .c file? If not, consider
moving them into it.

> +
> +struct vgic_v5_vm_info {
> +	void __iomem		*vmd_base;
> +	vpe_entry __iomem	*vpet_base;
> +	void __iomem		**vped_ptrs;
> +	u8			vpe_id_bits;
> +};
> +
> +struct vgic_v5_vmt {
> +	union {
> +		struct {
> +			struct vmtl2_entry *vmt_base;
> +			unsigned int num_ents;
> +		} linear;
> +		struct {
> +			vmtl1_entry *vmt_base;
> +			struct vmtl2_entry **l2ptrs;
> +			unsigned int num_l1_ents;
> +		} l2;
> +	};
> +	bool		two_level;
> +	unsigned int	num_entries;
> +	unsigned int	max_vpes;
> +	size_t		vmd_size;
> +	size_t		vped_size;
> +	struct ida	vm_id_ida;
> +};
> +
>  struct vgic_v5_host_ist_caps {
>  	/* IST Capabilities */
>  
> @@ -38,4 +118,32 @@ static inline struct vgic_v5_host_ist_caps *vgic_v5_host_caps(void)
>  	return &gicv5_host_ist_caps;
>  }
>  
> +static inline u16 vgic_v5_vm_id(struct kvm *kvm)
> +{
> +	return kvm->arch.vgic.gicv5_vm.vm_id;
> +}
> +
> +static inline u16 vgic_v5_vpe_id(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id;
> +}
> +
> +static inline int vgic_v5_vpe_db(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db;
> +}
> +
> +int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
> +			 size_t vmd_size, size_t vped_size,
> +			 unsigned int vpe_id_bits);
> +int vgic_v5_vmt_free(void);
> +
> +int vgic_v5_allocate_vm_id(struct kvm *kvm);
> +void vgic_v5_release_vm_id(struct kvm *kvm);
> +
> +int vgic_v5_vmte_init(struct kvm *kvm);
> +int vgic_v5_vmte_release(struct kvm *kvm);
> +int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu);
> +int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu);
> +
>  #endif
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 05dbd01f6fd21..0bcbc751593cc 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -372,6 +372,8 @@ struct vgic_v5_vm {
>  	int			vpe_db_base;
>  	int			nr_vpes;
>  	u16			vm_id;
> +	bool			vm_id_valid;
> +	bool			vmte_allocated;
>  };
>  
>  struct vgic_dist {
> diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
> index 087d94f739672..89579ee04f5d1 100644
> --- a/include/linux/irqchip/arm-gic-v5.h
> +++ b/include/linux/irqchip/arm-gic-v5.h
> @@ -182,6 +182,7 @@
>  #define GICV5_IRS_MAP_L2_ISTR_ID	GENMASK(23, 0)
>  
>  #define GICV5_ISTL1E_VALID		BIT_ULL(0)
> +#define GICV5_IRS_ISTL1E_SIZE		8UL
>  
>  #define GICV5_ISTL1E_L2_ADDR_MASK	GENMASK_ULL(55, 12)
>  
> @@ -444,4 +445,16 @@ void gicv5_free_lpi(u32 lpi);
>  
>  void __init gicv5_its_of_probe(struct device_node *parent);
>  void __init gicv5_its_acpi_probe(void);
> +
> +enum gicv5_vcpu_info_cmd_type {
> +	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
> +	VMTE_MAKE_VALID,	/* Make the VMTE valid */
> +	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
> +};
> +
> +struct gicv5_cmd_info {
> +	enum gicv5_vcpu_info_cmd_type	cmd_type;
> +	u64				data;
> +};
> +
>  #endif

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops
  2026-04-27 16:09 ` [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops Sascha Bischoff
@ 2026-04-29 12:50   ` Joey Gouly
  2026-04-29 16:04   ` Marc Zyngier
  1 sibling, 0 replies; 66+ messages in thread
From: Joey Gouly @ 2026-04-29 12:50 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, maz@kernel.org, oliver.upton@linux.dev,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, Apr 27, 2026 at 04:09:06PM +0000, Sascha Bischoff wrote:
> GICv5 has rules about which fields of a VMTE (or L1 VMT) may be
> directly written by the host once the table is valid. This ensures
> that no stale state is cached by the hardware, and provides a clear
> interface for making VMs, ISTs, etc, valid.
> 
> The hypervisor is responsible for populating the VMTE for a
> VM. However, it is not permitted to write the Valid bit (as the VM
> table is already valid). Instead, the VM is made valid via an IRS MMIO
> Op. The same applies to the ISTs - they must be made valid via the
> host IRS.
> 
> This commit adds support for:
> 
> * Making level 2 VMTs valid (only), allowing for dynamic level 2 table
>   allocation.
> * Making VMTEs (VMs) valid or invalid
> * Making SPI/LPI ISTs valid or invalid for a specific VM
> 
> When (successfully) probing for a GICv5, the VMT is allocated, and is
> made valid via the IRS's MMIO interface.
> 
> This commit also extends the doorbell domain to allow the doorbells
> themselves to act as a conduit for issuing commands - this is similar
> to what exists for GICv4 support. Effectively, irq_set_vcpu_affinity()
> becomes an ioctl-like interface for issuing commands specific to
> either a VM or the particular VPE that the doorbell belongs to. This
> change adds support for the following via the VPE doorbells:
> 
>         VMT_L2_MAP - Make a second level VM table valid
>         VMTE_MAKE_VALID - Make a single VMTE (and hence VM) valid
>         VMTE_MAKE_INVALID - Make a single VMTE (and hence VM) invalid
>         SPI_VIST_MAKE_VALID - Make the SPI IST valid
>         LPI_VIST_MAKE_VALID - Make the LPI IST valid
>         LPI_VIST_MAKE_INVALID - Make the LPI IST invalid
> 
> Note: It is intentional that there is no SPI_VIST_MAKE_INVALID - this
> cannot happen while the VM is live, and given that the SPI is
> allocated as part of VM creation, there is no need to make it invalid
> again until the VM is destroyed, at which point the VMTE is
> invalid. Therefore, there's no need to do this via the host's IRS MMIO
> interface, as it can be directly marked as invalid and freed. LPIs, on
> the other hand, are driven by the guest itself, and the guest is
> theoretically free to invalidate and free the LPI IST at any point.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5-tables.c |  25 +++
>  arch/arm64/kvm/vgic/vgic-v5-tables.h |   2 +
>  arch/arm64/kvm/vgic/vgic-v5.c        | 236 ++++++++++++++++++++++++++-
>  include/linux/irqchip/arm-gic-v5.h   |  30 ++++
>  4 files changed, 290 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> index de905f37b61a5..0120c3205dea6 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> @@ -666,6 +666,26 @@ int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +phys_addr_t vgic_v5_get_vmt_base(void)
> +{
> +	phys_addr_t vmt_base;
> +
> +	if (!vgic_v5_vmt_allocated())
> +		return -ENXIO;
> +
> +	if (!vmt_info->two_level)
> +		vmt_base = virt_to_phys(vmt_info->linear.vmt_base);
> +	else
> +		vmt_base = virt_to_phys(vmt_info->l2.vmt_base);
> +
> +	return vmt_base;
> +}
> +
> +u8 vgic_v5_vmt_vpe_id_bits(void)
> +{
> +	return fls(vmt_info->max_vpes) - 1;
> +}
> +
>  /*
>   * Assign an already allocated IST to the VM by populating the fields in the
>   * corresponding VMTE. We re-use this code for both an SPI IST and LPI IST, even
> @@ -715,6 +735,11 @@ int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
>  	/* Finally, mark the entry as valid */
>  	cmd_info.cmd_type = spi_ist ? SPI_VIST_MAKE_VALID : LPI_VIST_MAKE_VALID;
>  	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
> +	if (ret) {
> +		WRITE_ONCE(vmte->val[section], 0ULL);
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, false);
> +		return ret;
> +	}
>  
>  	/* Any cached entries we now have are stale! */
>  	vgic_v5_clean_inval(vmte, sizeof(*vmte), false, true);
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> index 37e220cda1987..6a024337eba79 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> @@ -150,6 +150,8 @@ int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
>  			 size_t vmd_size, size_t vped_size,
>  			 unsigned int vpe_id_bits);
>  int vgic_v5_vmt_free(void);
> +phys_addr_t vgic_v5_get_vmt_base(void);
> +u8 vgic_v5_vmt_vpe_id_bits(void);
>  
>  int vgic_v5_allocate_vm_id(struct kvm *kvm);
>  void vgic_v5_release_vm_id(struct kvm *kvm);
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 4e0d52b309628..49eb01ca07961 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -36,6 +36,12 @@ static void vgic_v5_get_implemented_ppis(void)
>  	__assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3());
>  }
>  
> +/*
> + * The IRS MMIO interface is shared between all VMs, so make sure we don't do
> + * anything stupid!
> + */
> +static DEFINE_RAW_SPINLOCK(vm_config_lock);
> +
>  static void __iomem *irs_base;
>  
>  static u32 irs_readl_relaxed(const u32 reg_offset)
> @@ -43,6 +49,21 @@ static u32 irs_readl_relaxed(const u32 reg_offset)
>  	return readl_relaxed(irs_base + reg_offset);
>  }
>  
> +static void irs_writel_relaxed(const u32 val, const u32 reg_offset)
> +{
> +	writel_relaxed(val, irs_base + reg_offset);
> +}
> +
> +static u64 irs_readq_relaxed(const u32 reg_offset)
> +{
> +	return readq_relaxed(irs_base + reg_offset);
> +}
> +
> +static void irs_writeq_relaxed(const u64 val, const u32 reg_offset)
> +{
> +	writeq_relaxed(val, irs_base + reg_offset);
> +}
> +
>  static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
>  {
>  	u64 idr;
> @@ -84,16 +105,22 @@ static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
>  	return 0;
>  }
>  
> +/* Forward decl for cleaner code layout */
> +static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base);
> +static int vgic_v5_irs_clear_vmt(void);
> +
>  /*
>   * Probe for a vGICv5 compatible interrupt controller, returning 0 on success.
>   */
>  int vgic_v5_probe(const struct gic_kvm_info *info)
>  {
> +	struct vgic_v5_host_ist_caps *ist_caps;
>  	bool v5_registered = false;
>  	u64 ich_vtr_el2;
>  	int ret;
>  
>  	kvm_vgic_global_state.type = VGIC_V5;
> +	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
>  
>  	kvm_vgic_global_state.vcpu_base = 0;
>  	kvm_vgic_global_state.vctrl_base = NULL;
> @@ -114,13 +141,53 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  	if (gicv5_irs_extract_vm_caps(info))
>  		goto skip_v5;
>  
> -	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
> +	ist_caps = vgic_v5_host_caps();
> +
> +	/*
> +	 * Even if the HW supports more per-VM vCPUs, artifically cap as we
> +	 * can't use them all.
> +	 */
> +	kvm_vgic_global_state.max_gic_vcpus = min(ist_caps->max_vpes,
> +						  VGIC_V5_MAX_CPUS);
> +
> +	/*
> +	 * GICv5 requires a set of tables to be allocated in order to manage
> +	 * VMs. We allocate them in advance here, which alas means that we
> +	 * already have to make a decisions regarding the maximum number of VMs
> +	 * we want to run. For now, we match the maximum number offered by the
> +	 * hardware, but this might not be a wise choice in the long term.
> +	 */
> +	ret = vgic_v5_vmt_allocate(ist_caps->two_level_vmt_support,
> +				   ist_caps->max_vms, ist_caps->vmd_size,
> +				   ist_caps->vped_size,
> +				   kvm_vgic_global_state.max_gic_vcpus);
> +	if (ret) {
> +		kvm_err("Failed to allocate the GICv5 VM tables; no GICv5 support\n");
> +		goto skip_v5;
> +	}
> +
> +	/*
> +	 * We've now allocated the VM table, but the host's IRS doesn't know
> +	 * about it yet. Provide the base address of the VMT to the IRS, as well
> +	 * as the number of ID bits that it covers and the structure used
> +	 * (linear/two-level).
> +	 */
> +	ret = vgic_v5_irs_assign_vmt(ist_caps->two_level_vmt_support,
> +				     vgic_v5_vmt_vpe_id_bits(),

You're passing vpe_id_bits to vm_id_bits. Should this be vgic_v5_host_caps()->max_vms?

> +				     vgic_v5_get_vmt_base());
> +	if (ret) {
> +		kvm_err("Failed to assign the GICv5 VM tables to the IRS; no GICv5 support\n");
> +		vgic_v5_vmt_free();
> +		goto skip_v5;
> +	}
>  
>  	vgic_v5_get_implemented_ppis();
>  
>  	ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V5);
>  	if (ret) {
>  		kvm_err("Cannot register GICv5 KVM device.\n");
> +		vgic_v5_irs_clear_vmt();
> +		vgic_v5_vmt_free();
>  		goto skip_v5;
>  	}
>  
> @@ -148,12 +215,13 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  	ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V3);
>  	if (ret) {
>  		kvm_err("Cannot register GICv3-legacy KVM device.\n");
> -		return ret;
> +		/* vGICv5 should still work */
> +		return v5_registered ? 0 : ret;
>  	}
>  
>  	/* We potentially limit the max VCPUs further than we need to here */
>  	kvm_vgic_global_state.max_gic_vcpus = min(VGIC_V3_MAX_CPUS,
> -						  VGIC_V5_MAX_CPUS);
> +						  kvm_vgic_global_state.max_gic_vcpus);
>  
>  	static_branch_enable(&kvm_vgic_global_state.gicv3_cpuif);
>  	kvm_info("GCIE legacy system register CPU interface\n");

[...]

Thanks,
Joey


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 08/43] KVM: arm64: gic-v5: Introduce guest IST alloc and management
  2026-04-27 16:08 ` [PATCH 08/43] KVM: arm64: gic-v5: Introduce guest IST alloc and management Sascha Bischoff
@ 2026-04-29 14:29   ` Marc Zyngier
  0 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-29 14:29 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:08:46 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> GICv5 guests use Interrupt State Tables (ISTs) to track and manage the
> interrupt state for SPIs and LPIs. These ISTs are provided to the
> host's IRS via the VMTE.
> 
> On a host GICv5 system, SPIs do not require any up-front memory
> allocation prior to their use, unlike LPIs which require the OS to
> allocate an IST. For a GICv5 guest, the same holds from the guest's
> point of view - the SPIs should require no explicit memory allocation
> by the guest. This means that the hypervisor must provision the memory
> which it passed to the IRS for managing a guest's SPI state.
> 
> In light of the above, the hypervisor allocates the SPI IST prior to
> running the guest for the first time. As only a small number of SPIs
> are expected, this is always allocated as a linear IST. The host is
> responsible for freeing this memory on guest teardown.
> 
> For LPIs, the OS needs to provision memory for state tracking. This
> applies to both hosts and guests, and so the guest will provision some
> memory for the LPI IST. However, this is not directly used by
> KVM. Instead, KVM allocates a shadow LPI IST which is passed to the
> IRS (in the VMTE). Again, on guest teardown, the hypervisor must free
> this memory again. The LPI IST is allocated as a two level structure,
> as many more LPIs are expected than SPIs.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5-tables.c | 531 +++++++++++++++++++++++++++
>  arch/arm64/kvm/vgic/vgic-v5-tables.h |  22 ++
>  include/linux/irqchip/arm-gic-v5.h   |   3 +
>  3 files changed, 556 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> index 502d05d46cccf..de905f37b61a5 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> @@ -501,6 +501,25 @@ int vgic_v5_vmte_init(struct kvm *kvm)
>  	return ret;
>  }
>  
> +/*
> + * The following set of forward declarations makes the code layout a *little*
> + * clearer as it lets us keep the IST-related code together.
> + */
> +static int vgic_v5_alloc_linear_ist(struct kvm *kvm, bool spi_ist,
> +				    unsigned int id_bits,
> +				    unsigned int istsz);
> +static int vgic_v5_alloc_l1_ist(struct kvm *kvm, unsigned int id_bits,
> +				unsigned int istsz, unsigned int l2_split);
> +static int vgic_v5_alloc_l2_ists(struct kvm *kvm, unsigned int id_bits,
> +				 unsigned int istsz, unsigned int l2_split);
> +static int vgic_v5_alloc_two_level_lpi_ist(struct kvm *kvm,
> +					   unsigned int id_bits,
> +					   unsigned int istsz,
> +					   unsigned int l2_split);
> +static int vgic_v5_linear_ist_free(struct kvm *kvm, bool spi);
> +static int vgic_v5_two_level_ist_free(struct kvm *kvm, bool spi);
> +static int vgic_v5_spi_ist_free(struct kvm *kvm);
> +
>  /*
>   * Release the VMT Entry, freeing up any allocated data structures before
>   * zeroing the VMTE.
> @@ -531,6 +550,18 @@ int vgic_v5_vmte_release(struct kvm *kvm)
>  	kfree(vmi->vmd_base);
>  	kfree(vmi->vpet_base);
>  
> +	/* If we have an LPI IST, free it */
> +	if (vmi->h_lpi_ist)
> +		ret = vgic_v5_lpi_ist_free(kvm);
> +	if (ret)
> +		return ret;
> +
> +	/* If we have an SPI IST, free it */
> +	if (vmi->h_spi_ist)
> +		ret = vgic_v5_spi_ist_free(kvm);
> +	if (ret)
> +		return ret;
> +
>  	xa_erase(&vm_info, vm_id);
>  	kfree(vmi);
>  
> @@ -634,3 +665,503 @@ int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
>  
>  	return 0;
>  }
> +
> +/*
> + * Assign an already allocated IST to the VM by populating the fields in the
> + * corresponding VMTE. We re-use this code for both an SPI IST and LPI IST, even
> + * if the paths to reach it might be vastly different.
> + */
> +int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
> +			    bool two_level, unsigned int id_bits,
> +			    unsigned int l2sz, unsigned int istsz,
> +			    bool spi_ist)
> +{
> +	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct gicv5_cmd_info cmd_info;
> +	struct vmtl2_entry *vmte;
> +	unsigned int section;
> +	u64 tmp;
> +	int ret;
> +
> +	section = spi_ist ? GICV5_VMTEL2_SPI_SECTION : GICV5_VMTEL2_LPI_SECTION;

Section? What is a section? This needs documentation (11.2.2 in the
EAC0 version of the spec) so that people can understand you are
talking about the 64bit word number in the Level-2 VM Table Entry.

> +
> +	if (ist_base & ~GICV5_VMTEL2E_IST_ADDR) {
> +		kvm_err("IST alignment issue! Address: 0x%llx, Mask 0x%llx\n",
> +			ist_base, GICV5_VMTEL2E_IST_ADDR);
> +		return -EINVAL;
> +	}
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	/* Bail if already allocated - something is broken! */
> +	if (FIELD_GET(GICV5_VMTEL2E_IST_VALID, vmte->val[section])) {
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);

Still this odd construct. I'm starting to wonder whether I'm really
missing something.

> +		return -EINVAL;
> +	}
> +
> +	tmp = FIELD_PREP(GICV5_VMTEL2E_IST_L2SZ, l2sz);
> +	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_ADDR,
> +			ist_base >> GICV5_VMTEL2E_IST_ADDR_SHIFT);
> +	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_ISTSZ, istsz);
> +	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_ID_BITS, id_bits);
> +	tmp |= FIELD_PREP(GICV5_VMTEL2E_IST_STRUCTURE, two_level);
> +
> +	WRITE_ONCE(vmte->val[section], cpu_to_le64(tmp));
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, false);
> +
> +	/* Finally, mark the entry as valid */
> +	cmd_info.cmd_type = spi_ist ? SPI_VIST_MAKE_VALID : LPI_VIST_MAKE_VALID;
> +	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
> +
> +	/* Any cached entries we now have are stale! */
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), false, true);

Shouldn't the clean operation happen *before* you call into the IRQ
stack? It feels dangerous to do so, even if the callback doesn't do
much.

> +
> +	return ret;
> +}
> +
> +/*
> + * Helper to determine the correct l2sz to use based on the combination of
> + * PAGE_SIZE and whatever hardware supports.
> + */
> +static unsigned int vgic_v5_ist_l2sz(void)
> +{
> +	switch (PAGE_SIZE) {
> +	case SZ_64K:
> +		if (gicv5_host_ist_caps.ist_l2sz & 0x4)

Please had definitions for IRS_IDR2.IST_L2SZ.

> +			return GICV5_IRS_IST_CFGR_L2SZ_64K;
> +		fallthrough;
> +	case SZ_4K:
> +		if (gicv5_host_ist_caps.ist_l2sz & 0x1)
> +			return GICV5_IRS_IST_CFGR_L2SZ_4K;
> +		fallthrough;
> +	case SZ_16K:
> +		if (gicv5_host_ist_caps.ist_l2sz & 0x2)
> +			return GICV5_IRS_IST_CFGR_L2SZ_16K;
> +		break;
> +	}
> +
> +	if (gicv5_host_ist_caps.ist_l2sz & 0x1)
> +		return GICV5_IRS_IST_CFGR_L2SZ_4K;
> +
> +	return GICV5_IRS_IST_CFGR_L2SZ_64K;
> +}
> +
> +/* Helper to determine ISTE size based on metadata requirements */
> +static unsigned int vgic_v5_ist_istsz(unsigned int id_bits)
> +{
> +	if (!gicv5_host_ist_caps.istmd)
> +		return GICV5_IRS_IST_CFGR_ISTSZ_4;
> +
> +	if (id_bits >= gicv5_host_ist_caps.istmd_sz)
> +		return GICV5_IRS_IST_CFGR_ISTSZ_16;
> +
> +	return GICV5_IRS_IST_CFGR_ISTSZ_8;
> +}
> +
> +/*
> + * Allocate a Linear IST - always used for SPIs and potentially LPIs.
> + *
> + * The calculation for n has been taken from the GICv5 spec.

Bonus points if you add a reference to the relevant part of the spec.

> + *
> + * NOTE: istsz is the FIELD used by GICv5, not the actual size (or log2() of the
> + * size).
> + */
> +static int vgic_v5_alloc_linear_ist(struct kvm *kvm, bool spi_ist,
> +				    unsigned int id_bits, unsigned int istsz)
> +{
> +	const size_t n = id_bits + 1 + istsz;
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	__le64 *ist;
> +	u32 l1sz;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return -EINVAL;
> +
> +	/*
> +	 * Allocate the IST. We only have one level, so we just use the L2 ISTE.
> +	 */
> +	l1sz = BIT(n + 1);
> +	ist = kzalloc(l1sz, GFP_KERNEL);
> +	if (!ist)
> +		return -ENOMEM;
> +
> +	if (spi_ist) {
> +		vmi->h_spi_ist = ist;
> +	} else {
> +		vmi->h_lpi_ist_structure = false;
> +		vmi->h_lpi_ist = ist;
> +	}
> +
> +	vgic_v5_clean_inval(ist, l1sz, true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate the first level of a two-level IST - LPI, only.
> + *
> + * The calculations for n, l1_size have been taken from the GICv5 spec.
> + *
> + * NOTE: istsz and l2sz are the FIELDS used by GICv5, not the actual sizes (or
> + * log2() of the sizes).
> + */
> +static int vgic_v5_alloc_l1_ist(struct kvm *kvm, unsigned int id_bits,
> +				unsigned int istsz, unsigned int l2sz)
> +{
> +	const size_t n =  max(5, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	const u32 l1_size = BIT(n + 1);
> +	struct vgic_v5_vm_info *vmi;
> +	__le64 *ist;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (!vmi)
> +		return -EINVAL;
> +
> +	ist = kzalloc(l1_size, GFP_KERNEL);
> +	if (!ist)
> +		return -ENOMEM;
> +
> +	vmi->h_lpi_ist_structure = true;
> +	vmi->h_lpi_ist = ist;
> +
> +	vgic_v5_clean_inval(ist, l1_size, true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate ALL of the second level ISTs for a two-level IST - LPI, only.
> + *
> + * The calculations for n, l1_entries, l2_size have been taken from the GICv5
> + * spec.
> + *
> + * NOTE: istsz and l2sz are the FIELDS used by GICv5, not the actual sizes (or
> + * log2() of the sizes).
> + */
> +static int vgic_v5_alloc_l2_ists(struct kvm *kvm, unsigned int id_bits,
> +				unsigned int istsz, unsigned int l2sz)
> +{
> +	const size_t n =  max(5, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
> +	const int l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
> +	const size_t l2_size = BIT(11 + (2 * l2sz) + 1);
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	__le64 *l2ist;
> +	__le64 *l1ist;
> +	int index;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return -EINVAL;
> +
> +	l1ist = vmi->h_lpi_ist;
> +
> +	/*
> +	 * Allocate the storage for the pointers to the L2 ISTs (used when
> +	 * freeing later).
> +	 */
> +	vmi->h_lpi_l2_ists = kzalloc_objs(*vmi->h_lpi_l2_ists, l1_entries,
> +					  GFP_KERNEL);
> +	if (!vmi->h_lpi_l2_ists)
> +		return -ENOMEM;
> +
> +	/* Allocate the L2 IST for each L1 IST entry */
> +	for (index = 0; index < l1_entries; ++index) {
> +		l2ist = kzalloc(l2_size, GFP_KERNEL);
> +		if (!l2ist) {
> +			while (--index >= 0)
> +				kfree(vmi->h_lpi_l2_ists[index]);
> +
> +			kfree(vmi->h_lpi_l2_ists);
> +			vmi->h_lpi_l2_ists = NULL;
> +
> +			return -ENOMEM;
> +		}
> +
> +		/*
> +		 * We are not doing on-demand allocation of the L2 ISTs, and are
> +		 * instead provisioning the whole IST up front. This means that
> +		 * we are able to mark the L2 ISTs as valid in the L1 ISTEs as
> +		 * the overall IST is not yet valid.
> +		 */
> +		l1ist[index] = cpu_to_le64(
> +			virt_to_phys(l2ist) & GICV5_ISTL1E_L2_ADDR_MASK) |
> +			GICV5_ISTL1E_VALID;
> +
> +		vmi->h_lpi_l2_ists[index] = l2ist;
> +
> +		vgic_v5_clean_inval(l2ist, l2_size, true, true);
> +	}
> +
> +	/* Handle CMOs for the whole L1 IST in one go */
> +	vgic_v5_clean_inval(l1ist, l1_entries * sizeof(*l1ist), true, false);
> +
> +	return 0;
> +}
> +
> +/* Allocate a two-level IST - LPIs, only */
> +static int vgic_v5_alloc_two_level_lpi_ist(struct kvm *kvm, unsigned int id_bits,
> +					   unsigned int istsz, unsigned int l2sz)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	int ret;
> +
> +	/*
> +	 * Allocate the L1 IST first, then all of the L2s. Everything
> +	 * is preallocated and we do no on-demand IST allocation. This
> +	 * is to avoid needing to track if and when the guest is doing
> +	 * on-demand IST allocation.
> +	 */
> +	ret = vgic_v5_alloc_l1_ist(kvm, id_bits, istsz, l2sz);
> +	if (ret)
> +		return ret;
> +
> +	ret = vgic_v5_alloc_l2_ists(kvm, id_bits, istsz, l2sz);
> +	if (ret) {
> +		/* Free the L1 IST again */
> +		vmi = xa_load(&vm_info, vm_id);
> +		kfree(vmi->h_lpi_ist);
> +		vmi->h_lpi_ist = 0;
> +
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vgic_v5_free_allocated_lpi_ist(struct vgic_v5_vm_info *vmi,
> +					   unsigned int id_bits,
> +					   unsigned int istsz,
> +					   unsigned int l2sz)
> +{
> +	if (!vmi->h_lpi_ist_structure) {
> +		kfree(vmi->h_lpi_ist);
> +		vmi->h_lpi_ist = NULL;
> +		return;
> +	}
> +
> +	if (vmi->h_lpi_l2_ists) {
> +		const size_t n = max(2, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
> +		const int l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
> +		int index;
> +
> +		for (index = 0; index < l1_entries; ++index)
> +			kfree(vmi->h_lpi_l2_ists[index]);
> +
> +		kfree(vmi->h_lpi_l2_ists);
> +		vmi->h_lpi_l2_ists = NULL;
> +	}
> +
> +	kfree(vmi->h_lpi_ist);
> +	vmi->h_lpi_ist = NULL;
> +}
> +
> +void vgic_v5_free_allocated_spi_ist(struct kvm *kvm)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return;
> +
> +	kfree(vmi->h_spi_ist);
> +	vmi->h_spi_ist = NULL;
> +}
> +
> +/*
> + * Free a Linear IST. Can only happen once the VM is dead.
> + */
> +static int vgic_v5_linear_ist_free(struct kvm *kvm, bool spi)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vmtl2_entry *vmte;
> +	struct vgic_v5_vm_info *vmi;
> +	int section, ret;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (!vmi)
> +		return -EINVAL;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	if (spi) {
> +		section = GICV5_VMTEL2_SPI_SECTION;
> +		vgic_v5_free_allocated_spi_ist(kvm);
> +	} else {
> +		section = GICV5_VMTEL2_LPI_SECTION;
> +		vgic_v5_free_allocated_lpi_ist(vmi, 0, 0, 0);
> +	}
> +
> +	/* The VM should be dead here, so we can just zero the VMT section */
> +	WRITE_ONCE(vmte->val[section], 0ULL);
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Free a Two-Level IST. Can only happen once the VM is dead.
> + */
> +static int vgic_v5_two_level_ist_free(struct kvm *kvm, bool spi)
> +{
> +	unsigned int id_bits, istsz, l2sz;
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	__le64 *l1ist, tmp;
> +	struct vmtl2_entry *vmte;
> +	int section, l1_entries;
> +	size_t n;
> +	int ret;
> +
> +	/* We don't create two-level SPI ISTs, so freeing is a bad idea! */
> +	if (spi)
> +		return -EINVAL;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (!vmi)
> +		return -EINVAL;
> +
> +	section = GICV5_VMTEL2_LPI_SECTION;
> +	l1ist = vmi->h_lpi_ist;
> +
> +	if (!vmi->h_lpi_ist_structure)
> +		return -EINVAL;
> +
> +	ret = vgic_v5_get_l2_vmte(vm_id, &vmte);
> +	if (ret)
> +		return ret;
> +
> +	tmp = le64_to_cpu(READ_ONCE(vmte->val[section]));
> +
> +	id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
> +	istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
> +	l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
> +
> +	/* Calculation for n taken from the GICv5 specification */
> +	n =  max(2, id_bits - ((10 - istsz) + (2 * l2sz)) + 3 - 1);
> +	l1_entries = BIT(n + 1) / GICV5_IRS_ISTL1E_SIZE;
> +
> +	vgic_v5_free_allocated_lpi_ist(vmi, id_bits, istsz, l2sz);
> +
> +	/* The VM must be dead, so we can just zero the VMT section */
> +	WRITE_ONCE(vmte->val[section], 0ULL);
> +
> +	vgic_v5_clean_inval(vmte, sizeof(*vmte), true, true);
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate an IST for SPIs.
> + *
> + * We don't anticipate a large number of SPIs being allocated. Therefore, we
> + * always allocate a Linear IST for SPIs. This will need to be revisited should
> + * that assumption no longer hold.
> + */
> +int vgic_v5_spi_ist_allocate(struct kvm *kvm, phys_addr_t *base_addr,
> +			     unsigned int id_bits, unsigned int istsz)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	int ret;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return -EINVAL;
> +
> +	ret = vgic_v5_alloc_linear_ist(kvm, true, id_bits, istsz);
> +	if (ret)
> +		return ret;
> +
> +	*base_addr = virt_to_phys(vmi->h_spi_ist);
> +
> +	return 0;
> +}
> +
> +/*
> + * Free the IST for SPIs. Should only happen once the VM is dead.
> + */
> +static int vgic_v5_spi_ist_free(struct kvm *kvm)
> +{
> +	return vgic_v5_linear_ist_free(kvm, true);
> +}
> +
> +/*
> + * Allocate an IST for LPIs.
> + *
> + * Unlike with SPIs, we anticipate that the guest will allocate a relatively
> + * large number of LPIs. Therefore, while we support doing a linear LPI IST, it
> + * is expected that LPI ISTs will be two-level.
> + */
> +int vgic_v5_lpi_ist_alloc(struct kvm *kvm, unsigned int id_bits)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +	unsigned int istsz, l2sz;
> +	phys_addr_t phys_addr;
> +	bool two_level;
> +	int ret;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (WARN_ON_ONCE(!vmi))
> +		return -EINVAL;
> +
> +	istsz = vgic_v5_ist_istsz(id_bits);
> +	l2sz = vgic_v5_ist_l2sz();
> +
> +	/*
> +	 * Determine if we want to create a Linear or a Two-Level IST.
> +	 *
> +	 * If we require more than one page for the IST, create a Two-Level IST
> +	 * (if the host supports it, which is likely).
> +	 *
> +	 * Note: GICv5's istsz is not the size of the ISTEs in log2(bytes). It
> +	 * is 2 less, hence the +2 below.
> +	 */
> +	two_level = gicv5_host_ist_caps.ist_levels &&
> +		id_bits > PAGE_SHIFT - (2 + istsz);
> +
> +	if (!two_level)
> +		ret = vgic_v5_alloc_linear_ist(kvm, false /* LPIs, not SPIs */,
> +					       id_bits, istsz);
> +	else
> +		ret = vgic_v5_alloc_two_level_lpi_ist(kvm, id_bits, istsz,
> +						      l2sz);
> +
> +	if (ret)
> +		return ret;
> +
> +	phys_addr = virt_to_phys(vmi->h_lpi_ist);
> +	ret = vgic_v5_vmte_assign_ist(kvm, phys_addr, two_level, id_bits, l2sz,
> +				      istsz, false);
> +	if (ret)
> +		vgic_v5_free_allocated_lpi_ist(vmi, id_bits, istsz, l2sz);
> +
> +	return ret;
> +}
> +
> +/* Free the LPI IST again */
> +int vgic_v5_lpi_ist_free(struct kvm *kvm)
> +{
> +	u16 vm_id = vgic_v5_vm_id(kvm);
> +	struct vgic_v5_vm_info *vmi;
> +
> +	vmi = xa_load(&vm_info, vm_id);
> +	if (!vmi)
> +		return -ENXIO;
> +
> +	if (!vmi->h_lpi_ist_structure)
> +		return vgic_v5_linear_ist_free(kvm, false);
> +	else
> +		return vgic_v5_two_level_ist_free(kvm, false);
> +}
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> index 5501a44308362..37e220cda1987 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> @@ -54,6 +54,13 @@ struct vmtl2_entry {
>  #define GICV5_VMTEL2E_IST_STRUCTURE	BIT_ULL(58)
>  #define GICV5_VMTEL2E_IST_ID_BITS	GENMASK_ULL(63, 59)
>  
> +/*
> + * The LPI and SPI configuration is stored in the 2nd and 3rd 64-bit chunks of
> + * the VMTE (0-based).
> + */
> +#define GICV5_VMTEL2_LPI_SECTION	2
> +#define GICV5_VMTEL2_SPI_SECTION	3
> +
>  /* Virtual PE Table Entry */
>  typedef __le64 vpe_entry;
>  #define GICV5_VPE_VALID			BIT_ULL(0)
> @@ -66,6 +73,12 @@ struct vgic_v5_vm_info {
>  	vpe_entry __iomem	*vpet_base;
>  	void __iomem		**vped_ptrs;
>  	u8			vpe_id_bits;
> +
> +	/* Tracking for the hyp-owned ISTs */
> +	bool			h_lpi_ist_structure;
> +	__le64			*h_lpi_ist;
> +	__le64			**h_lpi_l2_ists;
> +	__le64			*h_spi_ist;

Can you please document what these individual fields represent? I'm
not sure what hyp-owned means here...

>  };
>  
>  struct vgic_v5_vmt {
> @@ -146,4 +159,13 @@ int vgic_v5_vmte_release(struct kvm *kvm);
>  int vgic_v5_vmte_alloc_vpe(struct kvm_vcpu *vcpu);
>  int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu);
>  
> +int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
> +			    bool two_level, unsigned int id_bits,
> +			    unsigned int l2sz, unsigned int istsz, bool spi_ist);
> +int vgic_v5_spi_ist_allocate(struct kvm *kvm, phys_addr_t *base_addr,
> +			     unsigned int id_bits, unsigned int istsz);
> +void vgic_v5_free_allocated_spi_ist(struct kvm *kvm);
> +int vgic_v5_lpi_ist_alloc(struct kvm *kvm, unsigned int id_bits);
> +int vgic_v5_lpi_ist_free(struct kvm *kvm);
> +
>  #endif
> diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
> index 89579ee04f5d1..ccec0a045927c 100644
> --- a/include/linux/irqchip/arm-gic-v5.h
> +++ b/include/linux/irqchip/arm-gic-v5.h
> @@ -450,6 +450,9 @@ enum gicv5_vcpu_info_cmd_type {
>  	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
>  	VMTE_MAKE_VALID,	/* Make the VMTE valid */
>  	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
> +	SPI_VIST_MAKE_VALID,	/* No corresponding invalid */
> +	LPI_VIST_MAKE_VALID,	/* Triggered by a guest */
> +	LPI_VIST_MAKE_INVALID,	/* Triggered by a guest */
>  };
>  
>  struct gicv5_cmd_info {

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops
  2026-04-27 16:09 ` [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops Sascha Bischoff
  2026-04-29 12:50   ` Joey Gouly
@ 2026-04-29 16:04   ` Marc Zyngier
  1 sibling, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-29 16:04 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:09:06 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> GICv5 has rules about which fields of a VMTE (or L1 VMT) may be
> directly written by the host once the table is valid. This ensures
> that no stale state is cached by the hardware, and provides a clear
> interface for making VMs, ISTs, etc, valid.
> 
> The hypervisor is responsible for populating the VMTE for a
> VM. However, it is not permitted to write the Valid bit (as the VM
> table is already valid). Instead, the VM is made valid via an IRS MMIO
> Op. The same applies to the ISTs - they must be made valid via the
> host IRS.
> 
> This commit adds support for:
> 
> * Making level 2 VMTs valid (only), allowing for dynamic level 2 table
>   allocation.

Isn't it level 1 instead, if L2 is supposed to be dynamic?

> * Making VMTEs (VMs) valid or invalid
> * Making SPI/LPI ISTs valid or invalid for a specific VM
> 
> When (successfully) probing for a GICv5, the VMT is allocated, and is
> made valid via the IRS's MMIO interface.
> 
> This commit also extends the doorbell domain to allow the doorbells
> themselves to act as a conduit for issuing commands - this is similar
> to what exists for GICv4 support. Effectively, irq_set_vcpu_affinity()
> becomes an ioctl-like interface for issuing commands specific to
> either a VM or the particular VPE that the doorbell belongs to. This
> change adds support for the following via the VPE doorbells:
> 
>         VMT_L2_MAP - Make a second level VM table valid
>         VMTE_MAKE_VALID - Make a single VMTE (and hence VM) valid
>         VMTE_MAKE_INVALID - Make a single VMTE (and hence VM) invalid
>         SPI_VIST_MAKE_VALID - Make the SPI IST valid
>         LPI_VIST_MAKE_VALID - Make the LPI IST valid
>         LPI_VIST_MAKE_INVALID - Make the LPI IST invalid
> 
> Note: It is intentional that there is no SPI_VIST_MAKE_INVALID - this
> cannot happen while the VM is live, and given that the SPI is

This SPI_VIST_MAKE_VALID introduced in the previous patch. It feels
weird to only explain the lack of INVALID here...

> allocated as part of VM creation, there is no need to make it invalid
> again until the VM is destroyed, at which point the VMTE is
> invalid. Therefore, there's no need to do this via the host's IRS MMIO
> interface, as it can be directly marked as invalid and freed. LPIs, on
> the other hand, are driven by the guest itself, and the guest is
> theoretically free to invalidate and free the LPI IST at any point.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5-tables.c |  25 +++
>  arch/arm64/kvm/vgic/vgic-v5-tables.h |   2 +
>  arch/arm64/kvm/vgic/vgic-v5.c        | 236 ++++++++++++++++++++++++++-
>  include/linux/irqchip/arm-gic-v5.h   |  30 ++++
>  4 files changed, 290 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> index de905f37b61a5..0120c3205dea6 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> @@ -666,6 +666,26 @@ int vgic_v5_vmte_free_vpe(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +phys_addr_t vgic_v5_get_vmt_base(void)
> +{
> +	phys_addr_t vmt_base;
> +
> +	if (!vgic_v5_vmt_allocated())
> +		return -ENXIO;
> +
> +	if (!vmt_info->two_level)
> +		vmt_base = virt_to_phys(vmt_info->linear.vmt_base);
> +	else
> +		vmt_base = virt_to_phys(vmt_info->l2.vmt_base);
> +
> +	return vmt_base;
> +}
> +
> +u8 vgic_v5_vmt_vpe_id_bits(void)
> +{
> +	return fls(vmt_info->max_vpes) - 1;
> +}
> +
>  /*
>   * Assign an already allocated IST to the VM by populating the fields in the
>   * corresponding VMTE. We re-use this code for both an SPI IST and LPI IST, even
> @@ -715,6 +735,11 @@ int vgic_v5_vmte_assign_ist(struct kvm *kvm, phys_addr_t ist_base,
>  	/* Finally, mark the entry as valid */
>  	cmd_info.cmd_type = spi_ist ? SPI_VIST_MAKE_VALID : LPI_VIST_MAKE_VALID;
>  	ret = irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu0), &cmd_info);
> +	if (ret) {
> +		WRITE_ONCE(vmte->val[section], 0ULL);
> +		vgic_v5_clean_inval(vmte, sizeof(*vmte), true, false);
> +		return ret;
> +	}
>  
>  	/* Any cached entries we now have are stale! */
>  	vgic_v5_clean_inval(vmte, sizeof(*vmte), false, true);
> diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> index 37e220cda1987..6a024337eba79 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5-tables.h
> +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> @@ -150,6 +150,8 @@ int vgic_v5_vmt_allocate(bool two_level, unsigned int num_entries,
>  			 size_t vmd_size, size_t vped_size,
>  			 unsigned int vpe_id_bits);
>  int vgic_v5_vmt_free(void);
> +phys_addr_t vgic_v5_get_vmt_base(void);
> +u8 vgic_v5_vmt_vpe_id_bits(void);
>  
>  int vgic_v5_allocate_vm_id(struct kvm *kvm);
>  void vgic_v5_release_vm_id(struct kvm *kvm);
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 4e0d52b309628..49eb01ca07961 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -36,6 +36,12 @@ static void vgic_v5_get_implemented_ppis(void)
>  	__assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3());
>  }
>  
> +/*
> + * The IRS MMIO interface is shared between all VMs, so make sure we don't do
> + * anything stupid!
> + */
> +static DEFINE_RAW_SPINLOCK(vm_config_lock);
> +

I don't think you could have picked a worse name for this lock. It has
nothing to do with a VM. It really is a global IRS lock.

>  static void __iomem *irs_base;
>  
>  static u32 irs_readl_relaxed(const u32 reg_offset)
> @@ -43,6 +49,21 @@ static u32 irs_readl_relaxed(const u32 reg_offset)
>  	return readl_relaxed(irs_base + reg_offset);
>  }
>  
> +static void irs_writel_relaxed(const u32 val, const u32 reg_offset)
> +{
> +	writel_relaxed(val, irs_base + reg_offset);
> +}
> +
> +static u64 irs_readq_relaxed(const u32 reg_offset)
> +{
> +	return readq_relaxed(irs_base + reg_offset);
> +}
> +
> +static void irs_writeq_relaxed(const u64 val, const u32 reg_offset)
> +{
> +	writeq_relaxed(val, irs_base + reg_offset);
> +}
> +
>  static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
>  {
>  	u64 idr;
> @@ -84,16 +105,22 @@ static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info *info)
>  	return 0;
>  }
>  
> +/* Forward decl for cleaner code layout */

Drop this comment. The intent is pretty obvious. And maybe move them
to the top, so that all forward declarations are grouped together.

> +static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base);
> +static int vgic_v5_irs_clear_vmt(void);
> +
>  /*
>   * Probe for a vGICv5 compatible interrupt controller, returning 0 on success.
>   */
>  int vgic_v5_probe(const struct gic_kvm_info *info)
>  {
> +	struct vgic_v5_host_ist_caps *ist_caps;
>  	bool v5_registered = false;
>  	u64 ich_vtr_el2;
>  	int ret;
>  
>  	kvm_vgic_global_state.type = VGIC_V5;
> +	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
>  
>  	kvm_vgic_global_state.vcpu_base = 0;
>  	kvm_vgic_global_state.vctrl_base = NULL;
> @@ -114,13 +141,53 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  	if (gicv5_irs_extract_vm_caps(info))
>  		goto skip_v5;
>  
> -	kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS;
> +	ist_caps = vgic_v5_host_caps();
> +
> +	/*
> +	 * Even if the HW supports more per-VM vCPUs, artifically cap as we
> +	 * can't use them all.
> +	 */
> +	kvm_vgic_global_state.max_gic_vcpus = min(ist_caps->max_vpes,
> +						  VGIC_V5_MAX_CPUS);

Can this be less than 512, which we still want to support for GICv3?

> +
> +	/*
> +	 * GICv5 requires a set of tables to be allocated in order to manage
> +	 * VMs. We allocate them in advance here, which alas means that we
> +	 * already have to make a decisions regarding the maximum number of VMs
> +	 * we want to run. For now, we match the maximum number offered by the
> +	 * hardware, but this might not be a wise choice in the long term.
> +	 */
> +	ret = vgic_v5_vmt_allocate(ist_caps->two_level_vmt_support,
> +				   ist_caps->max_vms, ist_caps->vmd_size,
> +				   ist_caps->vped_size,

Why don't you just pass irs_caps to the allocator instead of teasing
out individual fields?

> +				   kvm_vgic_global_state.max_gic_vcpus);
> +	if (ret) {
> +		kvm_err("Failed to allocate the GICv5 VM tables; no GICv5 support\n");
> +		goto skip_v5;

Turn this into a hard fail.

> +	}
> +
> +	/*
> +	 * We've now allocated the VM table, but the host's IRS doesn't know
> +	 * about it yet. Provide the base address of the VMT to the IRS, as well
> +	 * as the number of ID bits that it covers and the structure used
> +	 * (linear/two-level).
> +	 */
> +	ret = vgic_v5_irs_assign_vmt(ist_caps->two_level_vmt_support,
> +				     vgic_v5_vmt_vpe_id_bits(),
> +				     vgic_v5_get_vmt_base());
> +	if (ret) {
> +		kvm_err("Failed to assign the GICv5 VM tables to the IRS; no GICv5 support\n");
> +		vgic_v5_vmt_free();
> +		goto skip_v5;
> +	}
>  
>  	vgic_v5_get_implemented_ppis();
>  
>  	ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V5);
>  	if (ret) {
>  		kvm_err("Cannot register GICv5 KVM device.\n");
> +		vgic_v5_irs_clear_vmt();
> +		vgic_v5_vmt_free();
>  		goto skip_v5;
>  	}
>  
> @@ -148,12 +215,13 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  	ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V3);
>  	if (ret) {
>  		kvm_err("Cannot register GICv3-legacy KVM device.\n");
> -		return ret;
> +		/* vGICv5 should still work */
> +		return v5_registered ? 0 : ret;
>  	}
>  
>  	/* We potentially limit the max VCPUs further than we need to here */
>  	kvm_vgic_global_state.max_gic_vcpus = min(VGIC_V3_MAX_CPUS,
> -						  VGIC_V5_MAX_CPUS);
> +						  kvm_vgic_global_state.max_gic_vcpus);
>  
>  	static_branch_enable(&kvm_vgic_global_state.gicv3_cpuif);
>  	kvm_info("GCIE legacy system register CPU interface\n");
> @@ -163,6 +231,167 @@ int vgic_v5_probe(const struct gic_kvm_info *info)
>  	return 0;
>  }
>  
> +/*
> + * Wait for completion of a change in any of IRS_VMT_BASER, IRS_VMAP_L2_VMTR,
> + * IRS_VMAP_VMR, IRS_VMAP_VPER, IRS_VMAP_VISTR, IRS_VMAP_L2_VISTR.
> + */
> +static int vgic_v5_irs_wait_for_vm_op(void)
> +{
> +	u32 statusr;
> +	int ret;
> +
> +	ret = readl_relaxed_poll_timeout_atomic(
> +		irs_base + GICV5_IRS_VMT_STATUSR, statusr,
> +		FIELD_GET(GICV5_IRS_VMT_STATUSR_IDLE, statusr), 1,
> +		USEC_PER_SEC);

nit: please don't split this line before the first parameter of the
function.

> +
> +	if (ret == -ETIMEDOUT) {
> +		pr_err_ratelimited("Time out waiting for IRS VM Op\n");
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base)
> +{
> +	u64 vmt_baser;
> +	u32 vmt_cfgr;
> +
> +	vmt_baser = irs_readq_relaxed(GICV5_IRS_VMT_BASER);
> +	if (!!FIELD_GET(GICV5_IRS_VMT_BASER_VALID, vmt_baser))
> +		return -EBUSY;
> +
> +	vmt_cfgr = FIELD_PREP(GICV5_IRS_VMT_CFGR_VM_ID_BITS, vm_id_bits);
> +	if (two_level)
> +		vmt_cfgr |= FIELD_PREP(GICV5_IRS_VMT_CFGR_STRUCTURE,
> +				       GICV5_IRS_VMT_CFGR_STRUCTURE_TWO_LEVEL);
> +
> +	irs_writel_relaxed(vmt_cfgr, GICV5_IRS_VMT_CFGR);
> +
> +	/* The base address is intentionally only masked and not shifted */
> +	vmt_baser = FIELD_PREP(GICV5_IRS_VMT_BASER_VALID, true) |
> +		    (vmt_base & GICV5_IRS_VMT_BASER_ADDR);
> +	irs_writeq_relaxed(vmt_baser, GICV5_IRS_VMT_BASER);
> +
> +	return vgic_v5_irs_wait_for_vm_op();
> +}
> +
> +static int vgic_v5_irs_clear_vmt(void)
> +{
> +	irs_writeq_relaxed(0ULL, GICV5_IRS_VMT_BASER);
> +
> +	return vgic_v5_irs_wait_for_vm_op();
> +}
> +
> +static int vgic_v5_irs_vmap_l2_vmt(int vm_id)
> +{
> +	u64 vmap_l2_vmtr;
> +	int ret = 0;
> +
> +	guard(raw_spinlock)(&vm_config_lock);
> +
> +	/* Make sure that we are idle to begin with */
> +	ret = vgic_v5_irs_wait_for_vm_op();
> +	if (ret)
> +		return ret;
> +
> +	/* Mark the VM as valid */
> +	vmap_l2_vmtr = FIELD_PREP(GICV5_IRS_VMAP_L2_VMTR_VM_ID, vm_id) |
> +		       FIELD_PREP(GICV5_IRS_VMAP_L2_VMTR_M, true);
> +	irs_writeq_relaxed(vmap_l2_vmtr, GICV5_IRS_VMAP_L2_VMTR);
> +
> +	return vgic_v5_irs_wait_for_vm_op();
> +}
> +
> +static int __vgic_v5_irs_vmap_vm(int vm_id, bool unmap)
> +{
> +	u64 vmap_vmr;
> +	int ret;
> +
> +	guard(raw_spinlock)(&vm_config_lock);
> +
> +	/* Make sure that we are idle to begin with */
> +	ret = vgic_v5_irs_wait_for_vm_op();
> +	if (ret)
> +		return ret;
> +
> +	/* Mark the VM as valid */
> +	vmap_vmr = FIELD_PREP(GICV5_IRS_VMAP_VMR_VM_ID, vm_id) |
> +		   FIELD_PREP(GICV5_IRS_VMAP_VMR_U, unmap) |
> +		   FIELD_PREP(GICV5_IRS_VMAP_VMR_M, true);
> +	irs_writeq_relaxed(vmap_vmr, GICV5_IRS_VMAP_VMR);
> +
> +	return vgic_v5_irs_wait_for_vm_op();
> +}

There is a pattern here:

static int do_something(...)
{
	int ret
	guard(raw_spinlock)(&vm_config_lock);

	/* Make sure that we are idle to begin with */
	ret = vgic_v5_irs_wait_for_vm_op();
	if (ret)
		return ret;

	[do the something we came here for]

	return vgic_v5_irs_wait_for_vm_op();
}

Surely this can be turned into a helper that avoids having that
boilerplate code in each and every function.

> +
> +static int vgic_v5_irs_set_vm_valid(int vm_id)
> +{
> +	return __vgic_v5_irs_vmap_vm(vm_id, false);
> +}
> +
> +static int vgic_v5_irs_set_vm_invalid(int vm_id)
> +{
> +	return __vgic_v5_irs_vmap_vm(vm_id, true);
> +}
> +
> +static int __vgic_v5_irs_update_vist_validity(int vm_id, bool spi_ist, bool unmap)
> +{
> +	u8 type = spi_ist ? 0b011 : 0b010;
> +	u64 vmap_vistr;
> +	int ret;
> +
> +	guard(raw_spinlock)(&vm_config_lock);
> +
> +	/* Make sure that we are idle to begin with */
> +	ret = vgic_v5_irs_wait_for_vm_op();
> +	if (ret)
> +		return ret;
> +
> +	/* Mark the IST as valid */
> +	vmap_vistr = FIELD_PREP(GICV5_IRS_VMAP_VISTR_TYPE, type) |
> +		     FIELD_PREP(GICV5_IRS_VMAP_VISTR_VM_ID, vm_id) |
> +		     FIELD_PREP(GICV5_IRS_VMAP_VISTR_U, unmap) |
> +		     FIELD_PREP(GICV5_IRS_VMAP_VISTR_M, true);
> +	irs_writeq_relaxed(vmap_vistr, GICV5_IRS_VMAP_VISTR);
> +
> +	return vgic_v5_irs_wait_for_vm_op();
> +}
> +
> +static int vgic_v5_irs_set_vist_valid(int vm_id, bool spi_ist)
> +{
> +	return __vgic_v5_irs_update_vist_validity(vm_id, spi_ist, false);
> +}
> +
> +/* Note: We currently do not use this as we rely on the VM becoming invalid. */
> +static int vgic_v5_irs_set_vist_invalid(int vm_id, bool spi_ist)
> +{
> +	return __vgic_v5_irs_update_vist_validity(vm_id, spi_ist, true);
> +}
> +
> +static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
> +{
> +	struct vgic_v5_vm *vm = data->domain->host_data;
> +	struct gicv5_cmd_info *cmd_info = vcpu_info;
> +
> +	switch (cmd_info->cmd_type) {
> +	case VMT_L2_MAP:
> +		return vgic_v5_irs_vmap_l2_vmt(vm->vm_id);
> +	case VMTE_MAKE_VALID:
> +		return vgic_v5_irs_set_vm_valid(vm->vm_id);
> +	case VMTE_MAKE_INVALID:
> +		return vgic_v5_irs_set_vm_invalid(vm->vm_id);
> +	case SPI_VIST_MAKE_VALID:
> +		return vgic_v5_irs_set_vist_valid(vm->vm_id, true);
> +	case LPI_VIST_MAKE_VALID:
> +		return vgic_v5_irs_set_vist_valid(vm->vm_id, false);
> +	case LPI_VIST_MAKE_INVALID:
> +		return vgic_v5_irs_set_vist_invalid(vm->vm_id, false);
> +	default:
> +		return -EINVAL;
> +	}
> +}

This function should be introduced ages ago, as soon as you start
issuing vcpu_set_affinity() calls.

> +
>  /*
>   * This set of irq_chip functions is specific for doorbells.
>   */
> @@ -174,6 +403,7 @@ static struct irq_chip vgic_v5_db_irq_chip = {
>  	.irq_set_affinity = irq_chip_set_affinity_parent,
>  	.irq_get_irqchip_state = irq_chip_get_parent_state,
>  	.irq_set_irqchip_state = irq_chip_set_parent_state,
> +	.irq_set_vcpu_affinity = vgic_v5_db_set_vcpu_affinity,
>  	.flags = IRQCHIP_SET_TYPE_MASKED | IRQCHIP_SKIP_SET_WAKE |
>  		 IRQCHIP_MASK_ON_SUSPEND,
>  };
> diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
> index ccec0a045927c..ff5ad653252d2 100644
> --- a/include/linux/irqchip/arm-gic-v5.h
> +++ b/include/linux/irqchip/arm-gic-v5.h
> @@ -87,6 +87,12 @@
>  #define GICV5_IRS_IST_CFGR		0x0190
>  #define GICV5_IRS_IST_STATUSR		0x0194
>  #define GICV5_IRS_MAP_L2_ISTR		0x01c0
> +#define GICV5_IRS_VMT_BASER		0x0200
> +#define GICV5_IRS_VMT_CFGR		0x0210
> +#define GICV5_IRS_VMT_STATUSR		0x0214
> +#define GICV5_IRS_VMAP_L2_VMTR		0x02c0
> +#define GICV5_IRS_VMAP_VMR		0x02c8
> +#define GICV5_IRS_VMAP_VISTR		0x02d0
>  
>  #define GICV5_IRS_IDR0_VIRT		BIT(6)
>  
> @@ -181,6 +187,30 @@
>  
>  #define GICV5_IRS_MAP_L2_ISTR_ID	GENMASK(23, 0)
>  
> +#define GICV5_IRS_VMT_BASER_ADDR	GENMASK_ULL(51, 3)
> +#define GICV5_IRS_VMT_BASER_ADDR_SHIFT	3ULL
> +#define GICV5_IRS_VMT_BASER_VALID	BIT_ULL(0)
> +
> +#define GICV5_IRS_VMT_CFGR_STRUCTURE_TWO_LEVEL	0b1
> +#define GICV5_IRS_VMT_CFGR_STRUCTURE_LINEAR	0b0
> +
> +#define GICV5_IRS_VMT_CFGR_STRUCTURE	BIT(16)
> +#define GICV5_IRS_VMT_CFGR_VM_ID_BITS	GENMASK(4, 0)
> +
> +#define GICV5_IRS_VMT_STATUSR_IDLE	BIT(0)
> +
> +#define GICV5_IRS_VMAP_L2_VMTR_M	BIT_ULL(63)
> +#define GICV5_IRS_VMAP_L2_VMTR_VM_ID	GENMASK_ULL(15, 0)
> +
> +#define GICV5_IRS_VMAP_VMR_M		BIT_ULL(63)
> +#define GICV5_IRS_VMAP_VMR_U		BIT_ULL(62)
> +#define GICV5_IRS_VMAP_VMR_VM_ID	GENMASK_ULL(15, 0)
> +
> +#define GICV5_IRS_VMAP_VISTR_M		BIT_ULL(63)
> +#define GICV5_IRS_VMAP_VISTR_U		BIT_ULL(62)
> +#define GICV5_IRS_VMAP_VISTR_VM_ID	GENMASK_ULL(47, 32)
> +#define GICV5_IRS_VMAP_VISTR_TYPE	GENMASK_ULL(31, 29)
> +
>  #define GICV5_ISTL1E_VALID		BIT_ULL(0)
>  #define GICV5_IRS_ISTL1E_SIZE		8UL
>  

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/43] KVM: arm64: gic-v5: Implement VPE IRS MMIO Ops
  2026-04-27 16:09 ` [PATCH 10/43] KVM: arm64: gic-v5: Implement VPE " Sascha Bischoff
@ 2026-04-30  8:46   ` Marc Zyngier
  0 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-30  8:46 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:09:27 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> Introduce interfaces to make VPEs valid, and to configure them, via
> the host's IRS. As with the other valid bits in the GICv5 VM tables,
> VPEs cannot be made valid directly, and instead are made valid via an
> IRS MMIO Op.
> 
> Additionally, some of the VPE configuration takes place via the IRS
> MMIO interface too (via the IRS_VPE_CR0, IRS_VPE_DBR). VPE doorbells
> are, for example, configured via this interface.
> 
> The existing VPE-doorbell-based commands are extended with:
> 
>         VPE_MAKE_VALID - Make the VPE valid in the VPET
>         VPE_CR0_READ - Handle a guest read from IRS_PE_CR0
>         VPE_CR0_WRITE - Handle a guest write to IRS_PE_CR0
> 
> Note: There is no VPE_MAKE_INVALID as VPEs are only made invalid on
> teardown, at which point the whole VMTE is marked as invalid. Hence,
> it is not required.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5.c      | 164 +++++++++++++++++++++++++++++
>  include/linux/irqchip/arm-gic-v5.h |  27 +++++
>  2 files changed, 191 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 49eb01ca07961..0649729f6b834 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -253,6 +253,25 @@ static int vgic_v5_irs_wait_for_vm_op(void)
>  	return 0;
>  }
>  
> +/* Wait for completion of an VPE_STATUSR change */
> +static int vgic_v5_irs_wait_for_vpe_op(void)
> +{
> +	int ret;
> +	u32 statusr;
> +
> +	ret = readl_relaxed_poll_timeout_atomic(
> +		irs_base + GICV5_IRS_VPE_STATUSR, statusr,
> +		FIELD_GET(GICV5_IRS_VPE_STATUSR_IDLE, statusr), 1,
> +		USEC_PER_SEC);

Formatting.

> +
> +	if (ret == -ETIMEDOUT) {
> +		pr_err_ratelimited("Time out waiting for IRS VPE Op\n");
> +		return ret;
> +	}
> +
> +	return 0;

You seem to have a number of these primitives. Consider having a
generic helper that takes the required parameters, including a partial
string in case of error.

> +}
> +
>  static int vgic_v5_irs_assign_vmt(bool two_level, u8 vm_id_bits, phys_addr_t vmt_base)
>  {
>  	u64 vmt_baser;
> @@ -369,10 +388,142 @@ static int vgic_v5_irs_set_vist_invalid(int vm_id, bool spi_ist)
>  	return __vgic_v5_irs_update_vist_validity(vm_id, spi_ist, true);
>  }
>  
> +static int vgic_v5_irs_set_up_vpe(int vm_id, int vpe_id, irq_hw_number_t db_hwirq)
> +{
> +	u64 vmap_vper, dbr, selr;
> +	u32 statusr, cr0;
> +	int ret;
> +
> +	guard(raw_spinlock)(&vm_config_lock);
> +
> +	/* Make sure that we are idle to begin with */
> +	ret = vgic_v5_irs_wait_for_vm_op();
> +	if (ret)
> +		return ret;
> +
> +	/* Mark the VPE as valid */
> +	vmap_vper = FIELD_PREP(GICV5_IRS_VMAP_VPER_VPE_ID, vpe_id) |
> +		    FIELD_PREP(GICV5_IRS_VMAP_VPER_VM_ID, vm_id) |
> +		    FIELD_PREP(GICV5_IRS_VMAP_VPER_M, true);

That's another of these single bit mask used with FIELD_PREP. Consider
rewriting it as:

	vmap_vper = FIELD_PREP(GICV5_IRS_VMAP_VPER_VPE_ID, vpe_id) |
		    FIELD_PREP(GICV5_IRS_VMAP_VPER_VM_ID, vm_id) |
		    GICV5_IRS_VMAP_VPER_M;


> +	irs_writeq_relaxed(vmap_vper, GICV5_IRS_VMAP_VPER);
> +
> +	/* Wait for the VPE to be marked valid in the VPET */
> +	ret = vgic_v5_irs_wait_for_vm_op();
> +	if (ret)
> +		return ret;
> +
> +	selr = FIELD_PREP(GICV5_IRS_VPE_SELR_VPE_ID, vpe_id) |
> +	       FIELD_PREP(GICV5_IRS_VPE_SELR_VM_ID, vm_id) |
> +	       FIELD_PREP(GICV5_IRS_VPE_SELR_S, true);
> +	irs_writeq_relaxed(selr, GICV5_IRS_VPE_SELR);
> +
> +	ret = vgic_v5_irs_wait_for_vpe_op();
> +	if (ret)
> +		return ret;
> +
> +	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
> +	if (!FIELD_GET(GICV5_IRS_VPE_STATUSR_V, statusr))
> +		return -EINVAL;
> +
> +	/* Set targeted only routing (disable 1ofN vPE selection) */
> +	cr0 = FIELD_PREP(GICV5_IRS_VPE_CR0_DPS, true);
> +	irs_writel_relaxed(cr0, GICV5_IRS_VPE_CR0);
> +
> +	ret = vgic_v5_irs_wait_for_vpe_op();
> +	if (ret)
> +		return ret;
> +
> +	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
> +	if (FIELD_GET(GICV5_IRS_VPE_STATUSR_F, statusr))
> +		ret = -EINVAL;
> +
> +	/*
> +	 * The VPE has not yet run. Therefore, make sure that all interrupts
> +	 * will generate a doorbell.
> +	 */
> +	dbr = FIELD_PREP(GICV5_IRS_VPE_DBR_LPI_ID, db_hwirq) |
> +	      FIELD_PREP(GICV5_IRS_VPE_DBR_DBPM, 0b11111) |
> +	      FIELD_PREP(GICV5_IRS_VPE_DBR_REQ_DB, false) |

And anything that set to false can be removed altogether.

> +	      FIELD_PREP(GICV5_IRS_VPE_DBR_DBV, true);
> +	irs_writeq_relaxed(dbr, GICV5_IRS_VPE_DBR);
> +
> +	ret = vgic_v5_irs_wait_for_vpe_op();
> +	if (ret)
> +		return ret;
> +
> +	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
> +	if (FIELD_GET(GICV5_IRS_VPE_STATUSR_F, statusr))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int vgic_v5_irs_vpe_cr0_read(int vm_id, int vpe_id, u64 *cr0)
> +{
> +	u32 statusr;
> +	u64 selr;
> +	int ret;
> +
> +	guard(raw_spinlock)(&vm_config_lock);
> +
> +	selr = FIELD_PREP(GICV5_IRS_VPE_SELR_VPE_ID, vpe_id) |
> +	       FIELD_PREP(GICV5_IRS_VPE_SELR_VM_ID, vm_id) |
> +	       FIELD_PREP(GICV5_IRS_VPE_SELR_S, true);
> +	irs_writeq_relaxed(selr, GICV5_IRS_VPE_SELR);
> +
> +	ret = vgic_v5_irs_wait_for_vpe_op();
> +	if (ret)
> +		return ret;
> +
> +	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
> +	if (!FIELD_GET(GICV5_IRS_VPE_STATUSR_V, statusr))
> +		return -EINVAL;
> +
> +	*cr0 = irs_readl_relaxed(GICV5_IRS_VPE_CR0);
> +
> +	return 0;

I'd rather this function returned the CR0 value directly, even if the
IDLE bit isn't set. You can have a WARN_ONCE() if you want.

> +}
> +
> +static int vgic_v5_irs_vpe_cr0_update(int vm_id, int vpe_id, u32 cr0)

*_write() would be better than *_update() when you already have
*_read(). Specially as a consequence of VPE_CR0_WRITE.

> +{
> +	u32 statusr;
> +	u64 selr;
> +	int ret;
> +
> +	guard(raw_spinlock)(&vm_config_lock);
> +
> +	selr = FIELD_PREP(GICV5_IRS_VPE_SELR_VPE_ID, vpe_id) |
> +	       FIELD_PREP(GICV5_IRS_VPE_SELR_VM_ID, vm_id) |
> +	       FIELD_PREP(GICV5_IRS_VPE_SELR_S, true);
> +	irs_writeq_relaxed(selr, GICV5_IRS_VPE_SELR);
> +
> +	ret = vgic_v5_irs_wait_for_vpe_op();
> +	if (ret)
> +		return ret;
> +
> +	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
> +	if (!FIELD_GET(GICV5_IRS_VPE_STATUSR_V, statusr))
> +		return ret;

return 0?  But you have set SELR to something. Surely reading V==0
here is an indication of a bug. So should you report the error? Warn?

> +
> +	irs_writel_relaxed(cr0, GICV5_IRS_VPE_CR0);
> +
> +	ret = vgic_v5_irs_wait_for_vpe_op();
> +	if (ret)
> +		return ret;
> +
> +	statusr = irs_readl_relaxed(GICV5_IRS_VPE_STATUSR);
> +	if (FIELD_GET(GICV5_IRS_VPE_STATUSR_F, statusr))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
>  static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>  {
>  	struct vgic_v5_vm *vm = data->domain->host_data;
>  	struct gicv5_cmd_info *cmd_info = vcpu_info;
> +	/* Our VPE ID is the index within the doorbell domain */
> +	u16 vpe_id = data->hwirq;
>  
>  	switch (cmd_info->cmd_type) {
>  	case VMT_L2_MAP:
> @@ -381,6 +532,19 @@ static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>  		return vgic_v5_irs_set_vm_valid(vm->vm_id);
>  	case VMTE_MAKE_INVALID:
>  		return vgic_v5_irs_set_vm_invalid(vm->vm_id);
> +	case VPE_MAKE_VALID:
> +		/*
> +		 * We need the actual LPI ID which lives in the top-most parent
> +		 * domain. This hwirq won't include the type (LPI) but that's
> +		 * not required for the IRS_VPE_DBR.
> +		 */
> +		while (data->parent_data != NULL)
> +			data = data->parent_data;
> +		return vgic_v5_irs_set_up_vpe(vm->vm_id, vpe_id, data->hwirq);
> +	case VPE_CR0_READ:
> +		return vgic_v5_irs_vpe_cr0_read(vm->vm_id, vpe_id, &cmd_info->data);
> +	case VPE_CR0_WRITE:
> +		return vgic_v5_irs_vpe_cr0_update(vm->vm_id, vpe_id, cmd_info->data);
>  	case SPI_VIST_MAKE_VALID:
>  		return vgic_v5_irs_set_vist_valid(vm->vm_id, true);
>  	case LPI_VIST_MAKE_VALID:
> diff --git a/include/linux/irqchip/arm-gic-v5.h b/include/linux/irqchip/arm-gic-v5.h
> index ff5ad653252d2..54b573783cd75 100644
> --- a/include/linux/irqchip/arm-gic-v5.h
> +++ b/include/linux/irqchip/arm-gic-v5.h
> @@ -90,9 +90,14 @@
>  #define GICV5_IRS_VMT_BASER		0x0200
>  #define GICV5_IRS_VMT_CFGR		0x0210
>  #define GICV5_IRS_VMT_STATUSR		0x0214
> +#define GICV5_IRS_VPE_SELR		0x0240
> +#define GICV5_IRS_VPE_DBR		0x0248
> +#define GICV5_IRS_VPE_CR0		0x0258
> +#define GICV5_IRS_VPE_STATUSR		0x025c
>  #define GICV5_IRS_VMAP_L2_VMTR		0x02c0
>  #define GICV5_IRS_VMAP_VMR		0x02c8
>  #define GICV5_IRS_VMAP_VISTR		0x02d0
> +#define GICV5_IRS_VMAP_VPER		0x02e0
>  
>  #define GICV5_IRS_IDR0_VIRT		BIT(6)
>  
> @@ -199,6 +204,21 @@
>  
>  #define GICV5_IRS_VMT_STATUSR_IDLE	BIT(0)
>  
> +#define GICV5_IRS_VPE_SELR_S		BIT_ULL(63)
> +#define GICV5_IRS_VPE_SELR_VPE_ID	GENMASK_ULL(47, 32)
> +#define GICV5_IRS_VPE_SELR_VM_ID	GENMASK_ULL(15, 0)
> +
> +#define GICV5_IRS_VPE_DBR_DBV		BIT_ULL(63)
> +#define GICV5_IRS_VPE_DBR_REQ_DB	BIT_ULL(62)
> +#define GICV5_IRS_VPE_DBR_DBPM		GENMASK_ULL(36, 32)
> +#define GICV5_IRS_VPE_DBR_LPI_ID	GENMASK_ULL(23, 0)
> +
> +#define GICV5_IRS_VPE_CR0_DPS		BIT(0)
> +
> +#define GICV5_IRS_VPE_STATUSR_F		BIT(2)
> +#define GICV5_IRS_VPE_STATUSR_V		BIT(1)
> +#define GICV5_IRS_VPE_STATUSR_IDLE	BIT(0)
> +
>  #define GICV5_IRS_VMAP_L2_VMTR_M	BIT_ULL(63)
>  #define GICV5_IRS_VMAP_L2_VMTR_VM_ID	GENMASK_ULL(15, 0)
>  
> @@ -211,6 +231,10 @@
>  #define GICV5_IRS_VMAP_VISTR_VM_ID	GENMASK_ULL(47, 32)
>  #define GICV5_IRS_VMAP_VISTR_TYPE	GENMASK_ULL(31, 29)
>  
> +#define GICV5_IRS_VMAP_VPER_M		BIT_ULL(63)
> +#define GICV5_IRS_VMAP_VPER_VM_ID	GENMASK_ULL(47, 32)
> +#define GICV5_IRS_VMAP_VPER_VPE_ID	GENMASK_ULL(15, 0)
> +
>  #define GICV5_ISTL1E_VALID		BIT_ULL(0)
>  #define GICV5_IRS_ISTL1E_SIZE		8UL
>  
> @@ -480,6 +504,9 @@ enum gicv5_vcpu_info_cmd_type {
>  	VMT_L2_MAP,		/* Map in a L2 VMT - *may* happen on VM init */
>  	VMTE_MAKE_VALID,	/* Make the VMTE valid */
>  	VMTE_MAKE_INVALID,	/* Make the VMTE (et al.) invalid */
> +	VPE_MAKE_VALID,		/* No corresponding invalid */
> +	VPE_CR0_READ,		/* Read of VPE_CR0 (guest read from PE_CR0) */
> +	VPE_CR0_WRITE,		/* Write to VPE_CR0 (guest write to PE_CR0) */
>  	SPI_VIST_MAKE_VALID,	/* No corresponding invalid */
>  	LPI_VIST_MAKE_VALID,	/* Triggered by a guest */
>  	LPI_VIST_MAKE_INVALID,	/* Triggered by a guest */

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 43/43] Documentation: KVM: Add the VGICv5 IRS save/restore sequences
  2026-04-27 16:20 ` [PATCH 43/43] Documentation: KVM: Add the VGICv5 IRS save/restore sequences Sascha Bischoff
@ 2026-04-30  8:57   ` Peter Maydell
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Maydell @ 2026-04-30  8:57 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, maz@kernel.org, oliver.upton@linux.dev,
	Joey Gouly, Suzuki Poulose, yuzenghui@huawei.com,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 at 17:22, Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
>
> When saving/restoring the state of the GICv5 IRS, it is important that
> it happens in the correct order. Failure to do so will almost
> certainly result in failing to restore a guest that is capable of
> handling interrupts correctly.
>
> On a save, the ISTs must be saved prior to saving the guest's memory
> as the guest's LPI IST is written to guest memory. Conversely, on
> restore the guest's memory must be restored prior to restoring the
> ISTs.
>
> It is important to restore the IRS MMIO registers by first restoring
> the IRS_IDx registers as they define the capabilities of the IRS, and
> are used as part of creating and managing ISTs and SPIs.
>
> In order to restore the ISTs themselves, the IRS_IST_CFGR must be
> restored prior to the IRS_IST_BASER. This is because KVM extracts
> fields from the CFGR to determine the size and structure of the IRS
> created by the guest. The IST itself is created as part of the write
> to the IRS_IST_BASER. At this stage the remaining MMIO registers can
> be restored.
>
> Once the LPI IST has been created (by the aforementioned write to the
> IRS_IST_BASER), the IST state can be restored using
> KVM_DEV_ARM_VGIC_GRP_IST. The SPI IST gets extracted from a userspace
> provided buffer, and is transferred to the host-allocated SPI IST. The
> LPI IST is extracted from guest memory, and is written to the
> host-allocated LPI IST.
>
> As a general rule, the IRS_*_STATUSR registers can be ignored on
> restore. They are not userspace writable.
>
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  .../virt/kvm/devices/arm-vgic-v5.rst          | 63 +++++++++++++++++++
>  1 file changed, 63 insertions(+)
>
> diff --git a/Documentation/virt/kvm/devices/arm-vgic-v5.rst b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
> index 38eef7cc63e3e..1c55f5040757d 100644
> --- a/Documentation/virt/kvm/devices/arm-vgic-v5.rst
> +++ b/Documentation/virt/kvm/devices/arm-vgic-v5.rst
> @@ -201,3 +201,66 @@ Groups:
>        -ENOMEM      Restoring IST state failed while tracking pending interrupts
>        -ETIMEDOUT   An IRS save/VM operation timed out
>        ===========  ============================================================
> +
> +IRS Save Sequence:
> +------------------
> +
> +The following ordering should be followed when saving the virtual GICv5 and
> +IRS:
> +
> +a) Save the ISTs by issuing KVM_GET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IST.
> +   This MUST happen before the guest's memory is serialised as the LPI IST is
> +   stored directly to guest memory.
> +
> +b) Save the IRS MMIO register state in the following order by issuing
> +   KVM_GET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
> +
> +     1. Save IRS_IDR0-2 and IRS_IDR5-7 registers.
> +     2. Save IRS_IST_CFGR.
> +     3. Save IRS_IST_BASER.
> +     4. Save the remaining global IRS MMIO registers.
> +     5. For each PE:
> +        - write IRS_PE_SELR
> +        - save IRS_PE_CR0
> +     6. For each SPI:
> +        - write IRS_SPI_SELR
> +        - save IRS_SPI_CFGR
> +
> +IRS Restore Sequence:
> +---------------------
> +
> +The following ordering must be followed when restoring the virtual GICv5 and
> +IRS:
> +
> +a) restore all guest memory and create vcpus
> +b) provide the IRS base address by issuing KVM_SET_DEVICE_ATTR on
> +   KVM_DEV_ARM_VGIC_GRP_ADDR
> +c) initialise the GIC - this sets up the default state and creates the SPI
> +   IST - by issuing KVM_SET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_CTRL with
> +   KVM_DEV_ARM_VGIC_CTRL_INIT

This isn't going to work for QEMU, if I understand it correctly.
QEMU always creates the whole VM first, including creating the
VCPUs and GIC, telling KVM what its base address is, initializing it,
etc, before it starts an inbound migration. So the memory read
is going to come in after step (c), not right at the start.

> +d) restore the IRS MMIO register state in the following order by issuing
> +   KVM_SET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
> +
> +     1. Restore IRS_IDR0-2 and IRS_IDR5-7 registers.
> +     2. Restore IRS_IST_CFGR.
> +     3. Restore IRS_IST_BASER - this triggers KVM to create the LPI IST.
> +
> +e) restore the ISTs by issuing KVM_SET_DEVICE_ATTR on
> +   KVM_DEV_ARM_VGIC_GRP_IST.
> +f) restore the remaining IRS MMIO register state in the following order by
> +   issuing KVM_SET_DEVICE_ATTR on KVM_DEV_ARM_VGIC_GRP_IRS_REGS:
> +
> +     1. Restore the remaining global IRS MMIO registers.
> +     2. For each PE:
> +        - write IRS_PE_SELR
> +        - restore IRS_PE_CR0
> +     3. For each SPI:
> +        - write IRS_SPI_SELR
> +        - restore IRS_SPI_CFGR

More generally, if your API involves this much in the way
of complicated ordering dependencies, it's going to be
very bug prone.  From userspace's perspective, this is
not a very helpful way to design the interface :-)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/43] KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset()
  2026-04-27 16:09 ` [PATCH 11/43] KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset() Sascha Bischoff
@ 2026-04-30  9:37   ` Marc Zyngier
  0 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-30  9:37 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:09:47 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> When resetting VPEs, also mark them as valid in the VM VPE Table. This
> is required as it informs the IRS that a specific VPE may be made
> resident, and without this the IRS will treat the VPE as invalid.
> 
> As part of this change, we also introduce a wrapper around the VPE
> doorbells - vgic_v5_send_command(). This takes a struct kvm_vcpu
> pointer, and the command to run, and triggers the function bound to
> the command via that vcpu's doorbell. This is a convenience function
> to simplify the code.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 0649729f6b834..92bb63b6dd6bb 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -518,6 +518,18 @@ static int vgic_v5_irs_vpe_cr0_update(int vm_id, int vpe_id, u32 cr0)
>  	return 0;
>  }
>  
> +static int vgic_v5_send_command(struct kvm_vcpu *vcpu,
> +				enum gicv5_vcpu_info_cmd_type type)
> +{
> +	struct gicv5_cmd_info cmd_info;
> +
> +	if (!vcpu)
> +		return -EINVAL;

Drop this. If we must crash, let's crash early, in the most
spectacular way, and leaving a backtrace. Trying to gracefully handle
this stuff is making it harder to track and debug such problems.

Also, if we made it that far, it is very likely that vcpu isn't NULL.

> +
> +	cmd_info.cmd_type = type;
> +	return irq_set_vcpu_affinity(vgic_v5_vpe_db(vcpu), &cmd_info);
> +}
> +
>  static int vgic_v5_db_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>  {
>  	struct vgic_v5_vm *vm = data->domain->host_data;
> @@ -691,6 +703,8 @@ static void vgic_v5_teardown_per_vm_domain(struct vgic_v5_vm *vm)
>  
>  void vgic_v5_reset(struct kvm_vcpu *vcpu)
>  {
> +	int rc;
> +
>  	/*
>  	 * We always present 16-bits of ID space to the guest, irrespective of
>  	 * the host allowing more.
> @@ -702,6 +716,14 @@ void vgic_v5_reset(struct kvm_vcpu *vcpu)
>  	 * CPUIF (but potentially fewer in the IRS).
>  	 */
>  	vcpu->arch.vgic_cpu.num_pri_bits = 5;
> +
> +	/* Make the VPE valid in the VPET */
> +	rc = vgic_v5_send_command(vcpu, VPE_MAKE_VALID);
> +	if (rc) {

nit: rc serves no purpose here. Just write it as:

	if (vgic_v5_send_command(...))

> +		/* We can't continue, so mark the VM as dead */
> +		kvm_vm_dead(vcpu->kvm);
> +		return;

But we probably should dump some traces here, rather than silently
mark the VM as fsck'd. THis is the sign that something has gone very
wrong.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/43] KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put
  2026-04-27 16:10 ` [PATCH 13/43] KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put Sascha Bischoff
@ 2026-04-30 10:26   ` Marc Zyngier
  0 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-30 10:26 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:10:28 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> Extend vgic_v5_load and vgic_v5_put to make the VPEs resident and
> non-resident, respectively. This makes the IRS aware of which VPE is
> currently resident, and therefore allows it to perform HPPI selection
> for LPIs and SPIs, which would otherwise never be signalled to the
> VPE.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 92bb63b6dd6bb..11a1a491b7e0a 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -1053,6 +1053,8 @@ void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu)
>  void vgic_v5_load(struct kvm_vcpu *vcpu)
>  {
>  	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
> +	u16 vm = vgic_v5_vm_id(vcpu->kvm);
> +	u16 vpe = vgic_v5_vpe_id(vcpu);
>  
>  	/*
>  	 * On the WFI path, vgic_load is called a second time. The first is when
> @@ -1065,7 +1067,11 @@ void vgic_v5_load(struct kvm_vcpu *vcpu)
>  
>  	kvm_call_hyp(__vgic_v5_restore_vmcr_apr, cpu_if);
>  
> -	cpu_if->gicv5_vpe.resident = true;
> +	cpu_if->vgic_contextr = FIELD_PREP(ICH_CONTEXTR_EL2_V, true) |
> +				FIELD_PREP(ICH_CONTEXTR_EL2_VPE, vpe) |
> +				FIELD_PREP(ICH_CONTEXTR_EL2_VM, vm);
> +
> +	kvm_call_hyp(__vgic_v5_make_resident, cpu_if);
>  }
>  
>  void vgic_v5_put(struct kvm_vcpu *vcpu)
> @@ -1083,7 +1089,9 @@ void vgic_v5_put(struct kvm_vcpu *vcpu)
>  
>  	kvm_call_hyp(__vgic_v5_save_apr, cpu_if);
>  
> -	cpu_if->gicv5_vpe.resident = false;
> +	cpu_if->vgic_contextr = 0;
> +
> +	kvm_call_hyp(__vgic_v5_make_non_resident, cpu_if);
>  
>  	/* The shadow priority is only updated on entering WFI */
>  	if (vcpu_get_flag(vcpu, IN_WFI))

Should this patch be folded in patch #3? They seem to overlap in major
ways.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 14/43] KVM: arm64: gic-v5: Request VPE doorbells when going non-resident
  2026-04-27 16:10 ` [PATCH 14/43] KVM: arm64: gic-v5: Request VPE doorbells when going non-resident Sascha Bischoff
@ 2026-04-30 10:37   ` Marc Zyngier
  0 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-30 10:37 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:10:49 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> When a VPE is made non-resident and is entering WFI, a doorbell should
> be requested for the VPE. This allows the VPE to be easily woken once
> an SPI/LPI interrupt is pending for it. This is tracked by the IRS,
> which will signal the specific VPE doorbell for the VPE once such an
> interrupt arrives.
> 
> Requesting a doorbell involves calculating the DBPM - DoorBell
> Priority Mask - which ensures that the DB is only signalled by the
> hardware if the pending interrupt is of sufficient priority. This
> avoids waking a VPE that can't process the incoming interrupt.
> 
> Doorbells are NOT requested if a VPE is not entering WFI as we expect
> to enter again imminently.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 11a1a491b7e0a..2fc6fa4df034f 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -1077,6 +1077,9 @@ void vgic_v5_load(struct kvm_vcpu *vcpu)
>  void vgic_v5_put(struct kvm_vcpu *vcpu)
>  {
>  	struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5;
> +	bool req_db = !!vcpu_get_flag(vcpu, IN_WFI);

Drop the spurious variable and move the check in the if () statement.
This is way more readable than declaring a variable.

> +	u32 priority_mask;
> +	int dbpm;

Move these in the inner block.

>  
>  	/*
>  	 * Do nothing if we're not resident. This can happen in the WFI path
> @@ -1090,6 +1093,31 @@ void vgic_v5_put(struct kvm_vcpu *vcpu)
>  	kvm_call_hyp(__vgic_v5_save_apr, cpu_if);
>  
>  	cpu_if->vgic_contextr = 0;
> +	if (req_db) {
> +		/*
> +		 * Find the virtual running priority and use this to calculate
> +		 * the doorbell priority mask. We combine the highest active
> +		 * priority and the CPU's priority mask. The guest can't handle
> +		 * interrupts with priorities less than or equal to the virtual
> +		 * running priority, so there's literally no point in waking the
> +		 * guest for these.
> +		 *
> +		 * The priority needs to be higher than the mask to signal, so
> +		 * pick the next higher priority (subtract 1).
> +		 */
> +		priority_mask = vgic_v5_get_effective_priority_mask(vcpu);
> +
> +		/* Don't request a doorbell if the max priority is masked */

This comment reads badly. I'd suggest something like "Request a
doorbell *unless* the priority is 0, indicating that no interrupt can
wake the vcpu up".

> +		if (priority_mask) {
> +			dbpm = priority_mask - 1;
> +			cpu_if->vgic_contextr = FIELD_PREP(ICH_CONTEXTR_EL2_DB, 1) |
> +						FIELD_PREP(ICH_CONTEXTR_EL2_DBPM, dbpm);
> +		}
> +
> +		/* Make the doorbell affine to this CPU */
> +		WARN_ON(irq_set_affinity(vgic_v5_vpe_db(vcpu),
> +					 cpumask_of(smp_processor_id())));

Repeatedly setting the affinity is likely to be costly. It may be
worth comparing with the current affinity somehow.

> +	}
>  
>  	kvm_call_hyp(__vgic_v5_make_non_resident, cpu_if);
>  

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells
  2026-04-27 16:11 ` [PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells Sascha Bischoff
@ 2026-04-30 12:23   ` Marc Zyngier
  0 siblings, 0 replies; 66+ messages in thread
From: Marc Zyngier @ 2026-04-30 12:23 UTC (permalink / raw)
  To: Sascha Bischoff
  Cc: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, nd, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

On Mon, 27 Apr 2026 17:11:30 +0100,
Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> 
> Each GICv5 VM requires a valid VM Table Entry (VMTE). The VM Table
> itself is allocated during probe time, but a VM needs to provision a
> VMTE before it is able to properly run (PPIs will work, but nothing
> else will - and PPIs only are not useful!).
> 
> The correct time for setting up the VMTE is during VM
> initialisation. For GICv5, this is vgic_v5_init(). Each VM needs a VM
> ID - this is actually the index into the VM Table so it is how a
> specific VMTE is selected too. As part of vgic_v5_init get a VM ID via
> vgic_v5_allocate_vm_id(), which internally uses an IDA to select an
> unused VM ID (and hence VMTE) within the range of allowed VM IDs.
> 
> Once the VM ID has been allocated, the doorbell domain for the VM is
> allocated, and each of the doorbells itself is allocated and assigned
> to a vcpu.
> 
> Assuming everything up until this point has succeeded, initialise the
> VMTE. Internally this allocates the additional data structures
> required by the hardware - the VM Descriptor, VPE Table, etc. This
> VMTE is then made valid via the IRS's MMIO interface. Finally, all
> VPEs are allocated within the VPET.
> 
> On teardown, this process is reversed again. The VMTE is made invalid,
> the VPEs are freed, the doorbells are released and the domain torn
> down, and finally the VM ID is released. The latter allows the VM ID
> and VMTE to be reused for a future VM.
> 
> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v5.c | 146 +++++++++++++++++++++++++++++-----
>  1 file changed, 128 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c
> index 2fc6fa4df034f..9347bc6895223 100644
> --- a/arch/arm64/kvm/vgic/vgic-v5.c
> +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> @@ -518,6 +518,18 @@ static int vgic_v5_irs_vpe_cr0_update(int vm_id, int vpe_id, u32 cr0)
>  	return 0;
>  }
>  
> +static irqreturn_t db_handler(int irq, void *data)
> +{
> +	struct kvm_vcpu *vcpu = data;
> +
> +	WRITE_ONCE(vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db_fired, true);
> +
> +	kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
> +	kvm_vcpu_kick(vcpu);
> +
> +	return IRQ_HANDLED;
> +}
> +

I think it'd make more sense if the doorbell
handling/requesting/freeing was one patch, or at least a set of
consecutive patches in the series.

As it is now, it is very hard to keep track of things. You have part
of it in the previous patch, the requesting and handling here, and
probably the freeing in some other patch I haven't seen.

>  static int vgic_v5_send_command(struct kvm_vcpu *vcpu,
>  				enum gicv5_vcpu_info_cmd_type type)
>  {
> @@ -726,26 +738,46 @@ void vgic_v5_reset(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> -int vgic_v5_init(struct kvm *kvm)
> +int vgic_v5_map_resources(struct kvm *kvm)
>  {
> -	struct kvm_vcpu *vcpu;
> -	unsigned long idx;
> -	int ret;
> +	if (!vgic_initialized(kvm))
> +		return -EBUSY;
>  
> -	if (vgic_initialized(kvm))
> -		return 0;
> +	return 0;
> +}

Pointless code movement?

>
> -	ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> -	if (ret)
> -		return ret;
> +/*
> + * Claim and populate a VMTE (optionally making a new L2 VMT valid), create VPE
> + * doorbells, allocate VPET and populate for each VPE. Finally, we also init the
> + * vIRS, which means allocating and making the virtual SPI IST valid.
> + *
> + * Note: We do need to put the cart before the horse here. The VPE doorbells are
> + * our conduit for communication with the IRS, which means we need to have those
> + * before making the VMTE valid.
> + *
> + * On failure, we clean up in the teardown path (vgic_v5_teardown()).
> + */
> +int vgic_v5_init(struct kvm *kvm)
> +{
> +	int nr_vcpus, ret = 0;
> +	struct kvm_vcpu *vcpu, *vcpu0;
> +	unsigned long i;
> +	struct irq_data *d;
> +	unsigned int db_virq;
> +
> +	nr_vcpus = atomic_read(&kvm->online_vcpus);
> +	if (nr_vcpus == 0)
> +		return -ENODEV;
>  
> -	kvm_for_each_vcpu(idx, vcpu, kvm) {
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>  		if (vcpu_has_nv(vcpu)) {
>  			kvm_err("Nested GICv5 VMs are currently unsupported\n");
>  			return -EINVAL;
>  		}
>  	}
>  
> +	kvm->arch.vgic.gicv5_vm.nr_vpes = nr_vcpus;

Why do we need to track the number of vcpus separately from what KVM
already does? GICv4 does it because a lot of the state is managed by
the irqchip driver, but that's not the case here. I hope we can come
up with a slightly simpler model with GICv5.

> +
>  	/* We only allow userspace to drive the SW_PPI, if it is implemented. */
>  	bitmap_zero(kvm->arch.vgic.gicv5_vm.userspace_ppis,
>  		    VGIC_V5_NR_PRIVATE_IRQS);
> @@ -754,20 +786,98 @@ int vgic_v5_init(struct kvm *kvm)
>  		   kvm->arch.vgic.gicv5_vm.userspace_ppis,
>  		   ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS);
>  
> -	return 0;
> +	ret = vgic_v5_allocate_vm_id(kvm);
> +	if (ret) {
> +		kvm_err("Maximum number of GICv5 VMs reached!\n");
> +		return ret;
> +	}

I'd rather we don't scream on the console when running out of
VMIDs. If we're at capacity, so be it. That's not an error worth
spamming the console over.

> +
> +	ret = vgic_v5_create_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> +	if (ret)
> +		return ret;

Who is freeing the VMID?

> +
> +	/*
> +	 * Allocate VPE doorbells first - these are our conduit for
> +	 * communicating with the host irqchip driver.
> +	 */
> +	db_virq = irq_domain_alloc_irqs(kvm->arch.vgic.gicv5_vm.domain,
> +					nr_vcpus, NUMA_NO_NODE,
> +					&kvm->arch.vgic.gicv5_vm);
> +	if (db_virq < 0) {
> +		/* Simplify teardown by doing this early! */
> +		vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> +		return db_virq;
> +	}
> +
> +	kvm->arch.vgic.gicv5_vm.vpe_db_base = db_virq;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		d = irq_domain_get_irq_data(kvm->arch.vgic.gicv5_vm.domain,
> +					    db_virq + i);
> +		irq_set_status_flags(db_virq + i, IRQ_NOAUTOEN);
> +
> +		ret = request_irq(db_virq + i, db_handler, 0, "vcpu", vcpu);
> +		if (ret)
> +			return ret;
> +
> +		/* Stash it with the VCPU for easy retrieval */
> +		vcpu->arch.vgic_cpu.vgic_v5.gicv5_vpe.db = db_virq + i;
> +	}
> +
> +	/* Populate VMTE (with VPET and VM descriptor) */
> +	ret = vgic_v5_vmte_init(kvm);
> +	if (ret)
> +		return ret;
> +
> +	/* We pick the first vcpu to make the VMTE valid - any would do */
> +	vcpu0 = kvm_get_vcpu(kvm, 0);
> +	ret = vgic_v5_send_command(vcpu0, VMTE_MAKE_VALID);
> +	if (ret)
> +		return ret;
> +
> +	/* Loop over all VPEs, allocate/populate their data structures */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		ret = vgic_v5_vmte_alloc_vpe(vcpu);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return ret;

I'm very worried about the error handling of that function. Who is
responsible for cleaning up the mess when this fails?

>  }
>  
>  void vgic_v5_teardown(struct kvm *kvm)
>  {
> -	vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> -}
> +	struct kvm_vcpu *vcpu, *vcpu0;
> +	struct vgic_dist *dist = &kvm->arch.vgic;
> +	unsigned long i;
> +	int rc;
>  
> -int vgic_v5_map_resources(struct kvm *kvm)
> -{
> -	if (!vgic_initialized(kvm))
> -		return -EBUSY;
> +	/*
> +	 * If the VM's ID isn't valid, then we failed init very early. Nothing
> +	 * to do here.
> +	 */
> +	if (!kvm->arch.vgic.gicv5_vm.vm_id_valid)
> +		return;
>  
> -	return 0;
> +	if (kvm->arch.vgic.gicv5_vm.vmte_allocated) {
> +		/* Make the VM invalid  */
> +		vcpu0 = kvm_get_vcpu(kvm, 0);
> +		rc = vgic_v5_send_command(vcpu0, VMTE_MAKE_INVALID);
> +		if (rc)
> +			kvm_err("could not make VMTE invalid\n");
> +
> +		kvm_for_each_vcpu(i, vcpu, kvm) {
> +			if (vgic_v5_vmte_free_vpe(vcpu))
> +				kvm_err("Failed to free VPE\n");
> +		}
> +
> +		if (vgic_v5_vmte_release(kvm))
> +			kvm_err("Failed to release VM 0x%x\n", dist->gicv5_vm.vm_id);
> +	}
> +
> +	vgic_v5_teardown_per_vm_domain(&kvm->arch.vgic.gicv5_vm);
> +
> +	vgic_v5_release_vm_id(kvm);
>  }
>  
>  int vgic_v5_finalize_ppi_state(struct kvm *kvm)

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls
  2026-04-28 14:28   ` Marc Zyngier
@ 2026-05-01 16:40     ` Sascha Bischoff
  0 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-05-01 16:40 UTC (permalink / raw)
  To: maz@kernel.org
  Cc: yuzenghui@huawei.com, Timothy Hayes, Suzuki Poulose, nd,
	peter.maydell@linaro.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	Joey Gouly, lpieralisi@kernel.org, oliver.upton@linux.dev

On Tue, 2026-04-28 at 15:28 +0100, Marc Zyngier wrote:
> On Mon, 27 Apr 2026 17:07:03 +0100,
> Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> > 
> > So far the KVM GICv5 support has been limited to PPIs. These only
> > go
> > as far out as the CPU interface, and have no interaction with the
> > host's IRS. Therefore, PPIs can be directly used for guests without
> > host IRS involvement. However, in order to support both SPIs and
> > LPIs
> > IRS involvement is required.
> 
> Please keep this in the cover letter. It isn't required to understand
> this patch.

Done.

> 
> > 
> > GICv5 introduces the concept of VPE residency - a VPE can be either
> > resident or non-resident. When the VPE is resident, the IRS is
> > allowed
> > to select interrupts that target that VPE (or the VM) as the HPPI
> > (Highest Priority Pending Interrupt). As the IRS handles both SPIs
> > and
> > LPIs, these will only be picked as the IRS's HPPI when a VPE is
> > resident.
> > 
> > A GICv5 VPE is made resident by writing to the ICH_CONTEXTR_EL2
> > with a
> > valid VM and VPE ID, and marking it valid in the process. This
> > informs
> > the IRS that a specific VPE is running, and that it can begin HPPI
> > selection for that VPE. Making a VPE non-resident (by making the
> > ICH_CONTEXTR_EL2 invalid) informs the IRS that the VPE is no longer
> > running, and it stops HPPI selection for it.
> > 
> > This change introduces two new hyp calls - one to make a VPE
> > resident
> > and its counterpart to make a VPE non-resident. As part of making a
> > VPE resident, the ICH_CONTEXTR_EL2.F bit is checked in order to
> > catch
> > faults, at which point the kernel will WARN. If everything is
> > configured correctly, this should not happen.
> > 
> > Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> > ---
> >  arch/arm64/include/asm/kvm_asm.h   |  2 ++
> >  arch/arm64/include/asm/kvm_hyp.h   |  2 ++
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c | 15 +++++++++++++++
> >  arch/arm64/kvm/hyp/vgic-v5-sr.c    | 25 +++++++++++++++++++++++++
> >  include/kvm/arm_vgic.h             |  3 +++
> >  5 files changed, 47 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_asm.h
> > b/arch/arm64/include/asm/kvm_asm.h
> > index fa033be6141ad..8c69f1f4de534 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -79,6 +79,8 @@ enum __kvm_host_smccc_func {
> >  	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
> >  	__KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs,
> >  	__KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs,
> > +	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_resident,
> > +	__KVM_HOST_SMCCC_FUNC___vgic_v5_make_non_resident,
> >  	__KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr,
> >  	__KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr,
> >  
> > diff --git a/arch/arm64/include/asm/kvm_hyp.h
> > b/arch/arm64/include/asm/kvm_hyp.h
> > index 8d06b62e7188c..5f9184276b04e 100644
> > --- a/arch/arm64/include/asm/kvm_hyp.h
> > +++ b/arch/arm64/include/asm/kvm_hyp.h
> > @@ -88,6 +88,8 @@ void __vgic_v3_restore_vmcr_aprs(struct
> > vgic_v3_cpu_if *cpu_if);
> >  int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu);
> >  
> >  /* GICv5 */
> > +void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if);
> > +void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if);
> >  void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if);
> >  void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if);
> >  /* No hypercalls for the following */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > index 9e44c05cf780e..804a9ffdc8594 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > @@ -672,6 +672,19 @@ static void
> > handle___tracing_write_event(struct kvm_cpu_context *host_ctxt)
> >  	trace_selftest(id);
> >  }
> >  
> > +static void handle___vgic_v5_make_resident(struct kvm_cpu_context
> > *host_ctxt)
> > +{
> > +	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt,
> > 1);
> > +
> > +	__vgic_v5_make_resident(kern_hyp_va(cpu_if));
> > +}
> > +static void handle___vgic_v5_make_non_resident(struct
> > kvm_cpu_context *host_ctxt)
> > +{
> > +	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt,
> > 1);
> > +
> > +	__vgic_v5_make_non_resident(kern_hyp_va(cpu_if));
> > +}
> > +
> >  static void handle___vgic_v5_save_apr(struct kvm_cpu_context
> > *host_ctxt)
> >  {
> >  	DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt,
> > 1);
> > @@ -711,6 +724,8 @@ static const hcall_t host_hcall[] = {
> >  	HANDLE_FUNC(__kvm_timer_set_cntvoff),
> >  	HANDLE_FUNC(__vgic_v3_save_aprs),
> >  	HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs),
> > +	HANDLE_FUNC(__vgic_v5_make_resident),
> > +	HANDLE_FUNC(__vgic_v5_make_non_resident),
> >  	HANDLE_FUNC(__vgic_v5_save_apr),
> >  	HANDLE_FUNC(__vgic_v5_restore_vmcr_apr),
> >  
> > diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c
> > b/arch/arm64/kvm/hyp/vgic-v5-sr.c
> > index 6d69dfe89a96c..04c5846b9abac 100644
> > --- a/arch/arm64/kvm/hyp/vgic-v5-sr.c
> > +++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c
> > @@ -7,6 +7,31 @@
> >  
> >  #include <asm/kvm_hyp.h>
> >  
> > +void __vgic_v5_make_resident(struct vgic_v5_cpu_if *cpu_if)
> > +{
> > +	write_sysreg_s(cpu_if->vgic_contextr,
> > SYS_ICH_CONTEXTR_EL2);
> > +	isb();
> > +
> > +	/* Catch any faults */
> > +	cpu_if->vgic_contextr =
> > read_sysreg_s(SYS_ICH_CONTEXTR_EL2);
> > +	if (WARN_ON(FIELD_GET(ICH_CONTEXTR_EL2_F, cpu_if-
> > >vgic_contextr)))
> > +		return;
> 
> I don't think this is particularly useful here. With non-VHE, this
> results in a panic. If there is a problem, you need to fail the vcpu
> entry early, and let the caller find out.

OK, noted. I've dropped the WARN_ON() here, have reworked this to catch
the error in the calling function (vgic_v5_load()). Given that we
really are quite dead if we can't make the VPE resident, I catch the
error and mark the VM as dead.
> 
> > +
> > +	cpu_if->gicv5_vpe.resident = true;
> > +}
> > +
> > +void __vgic_v5_make_non_resident(struct vgic_v5_cpu_if *cpu_if)
> > +{
> > +	/*
> > +	 * Make as non-resident before actually making non-
> > resident. Avoids race
> > +	 * with doorbell arriving.
> > +	 */
> > +	cpu_if->gicv5_vpe.resident = false;
> 
> Stores and sysreg accesses are not ordered without a DSB ST in
> between.

Argh, yes. Have added that in.

> 
> > +
> > +	write_sysreg_s(cpu_if->vgic_contextr,
> > SYS_ICH_CONTEXTR_EL2);
> 
> Is this really making anything non-resident? Who clears the V bit?

A vcpu is made resident as part of vgic_v5_load() and non-resident as
part of vgic_v5_put(). In put, the new value of the contextr is
computed as it isn't just a case of zeroing the valid bit. We need to
triage if we need to request a doorbell or not, and what the priority
mask for that is. Therefore, as part of putting (and hence making the
vCPU non-resident) we recalculate a new contextr with V=0, and DB/DBPM
set appropriately.

The commit that plumbs these into load/put this is "KVM: arm64: gic-v5:
Make VPEs (non-)resident in vgic_load/put".

Thanks,
Sascha

> 
> > +	isb();
> > +}
> > +
> >  void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if)
> >  {
> >  	cpu_if->vgic_apr = read_sysreg_s(SYS_ICH_APR_EL2);
> > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> > index fe49fb56dc3c9..d14cf4771d606 100644
> > --- a/include/kvm/arm_vgic.h
> > +++ b/include/kvm/arm_vgic.h
> > @@ -495,6 +495,9 @@ struct vgic_v5_cpu_if {
> >  	 */
> >  	u64	vgic_icsr;
> >  
> > +	/* The contextr used to make VPEs resident and non-
> > resident */
> > +	u64	vgic_contextr;
> > +
> >  	struct gicv5_vpe gicv5_vpe;
> >  };
> >  
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables
  2026-04-28 14:54   ` Vladimir Murzin
@ 2026-05-01 16:42     ` Sascha Bischoff
  0 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-05-01 16:42 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, Vladimir Murzin
  Cc: yuzenghui@huawei.com, Timothy Hayes, Suzuki Poulose,
	oliver.upton@linux.dev, peter.maydell@linaro.org, nd,
	maz@kernel.org, Joey Gouly, lpieralisi@kernel.org

On Tue, 2026-04-28 at 15:54 +0100, Vladimir Murzin wrote:
> Hi Sascha,
> 
> On 4/27/26 17:08, Sascha Bischoff wrote:
> > +
> > +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
> > +		kfree(l2_table);
> > +		return -EINVAL;
> > +	}
> > +
> > +	vmt_info->l2.l2ptrs[l1_index] = l2_table;
> > +
> > +	/* Alignment issue! */
> > +	if (virt_to_phys(l2_table) & ~GICV5_VMTEL1E_L2_ADDR) {
> > +		kfree(l2_table);
> > +		return -EFAULT;
> > +	}
> > +
> > +	tmp = virt_to_phys(l2_table) & GICV5_VMTEL1E_L2_ADDR;
> > +	WRITE_ONCE(vmt_info->l2.vmt_base[l1_index],
> > cpu_to_le64(tmp));
> 
> It seems the same l2_table alignment calculated (and checked)
> multiple times...

Hi Vladimir,

Yeah, you are correct. I've dropped that. Given the comments on this
change in general, I'm going through and dropping much of the paranoia.
We should trust the allocator to give us sane alignments, so there's no
point in checking, masking, then writing. Same goes for the CMO side of
this - will go and re-work that to be more sane.

Thanks,
Sascha

> 
> Cheers
> Vladimir


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame
  2026-04-28 15:20   ` Marc Zyngier
@ 2026-05-01 16:44     ` Sascha Bischoff
  0 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-05-01 16:44 UTC (permalink / raw)
  To: maz@kernel.org
  Cc: yuzenghui@huawei.com, Timothy Hayes, Suzuki Poulose, nd,
	peter.maydell@linaro.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	Joey Gouly, lpieralisi@kernel.org, oliver.upton@linux.dev

On Tue, 2026-04-28 at 16:20 +0100, Marc Zyngier wrote:
> On Mon, 27 Apr 2026 17:07:44 +0100,
> Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> > 
> > The host irqchip driver provides KVM with a pointer to an IRS's
> > config
> > frame, which allows KVM to directly interact with the host's IRS.
> > The
> > MMIO registers in the config frame are used to configure VMs (in
> > addition to them being used by the host). The IRS's config frame
> > also
> > includes a set of ID registers which describe the capabilities that
> > the IRS has.
> > 
> > Stash the pointer to the config frame, and extract the VM
> > capabilities
> > (from IRS_IDR3 & IRS_IDR4), as well as the IST
> > capabilities/requirements (IRS_IDR2) from the IRS.
> > 
> > Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> > ---
> >  arch/arm64/kvm/Makefile              |  2 +-
> >  arch/arm64/kvm/vgic/vgic-v5-tables.c |  8 +++++
> >  arch/arm64/kvm/vgic/vgic-v5-tables.h | 41 ++++++++++++++++++++++
> >  arch/arm64/kvm/vgic/vgic-v5.c        | 52
> > ++++++++++++++++++++++++++++
> >  include/linux/irqchip/arm-gic-v5.h   | 10 ++++++
> >  5 files changed, 112 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.c
> >  create mode 100644 arch/arm64/kvm/vgic/vgic-v5-tables.h
> > 
> > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > index 59612d2f277c1..431de9b145ca1 100644
> > --- a/arch/arm64/kvm/Makefile
> > +++ b/arch/arm64/kvm/Makefile
> > @@ -24,7 +24,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o
> > pvtime.o \
> >  	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> >  	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> >  	 vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
> > -	 vgic/vgic-v5.o
> > +	 vgic/vgic-v5.o vgic/vgic-v5-tables.o
> >  
> >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.c
> > b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> > new file mode 100644
> > index 0000000000000..30e2b108b1aa3
> > --- /dev/null
> > +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.c
> > @@ -0,0 +1,8 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2025, 2026 Arm Ltd.
> > + */
> > +
> > +#include "vgic-v5-tables.h"
> > +
> > +struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
> > diff --git a/arch/arm64/kvm/vgic/vgic-v5-tables.h
> > b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> > new file mode 100644
> > index 0000000000000..cf00a248eabd5
> > --- /dev/null
> > +++ b/arch/arm64/kvm/vgic/vgic-v5-tables.h
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2025, 2026 Arm Ltd.
> > + */
> > +
> > +#ifndef __KVM_ARM_VGICV5_TABLES_H__
> > +#define __KVM_ARM_VGICV5_TABLES_H__
> > +
> > +#include <linux/irqchip/arm-gic-v5.h>
> > +
> > +struct vgic_v5_host_ist_caps {
> > +	/* IST Capabilities */
> > +
> > +	/* Apply to LPIs and SPIs */
> > +	u8	ist_id_bits;
> > +	bool	ist_levels;
> > +	u8	ist_l2sz;
> > +	bool	istmd;
> > +	u8	istmd_sz;
> > +
> > +	/* LPI only */
> > +	u8	min_lpi_id_bits;
> > +
> > +	/* VM Table, VPE Table */
> > +	bool	two_level_vmt_support;
> > +	u32	max_vms;
> > +	u32	max_vpes;
> > +	u16	vmd_size;
> > +	u16	vped_size;
> > +
> > +	/* Is the IRS coherent with us, or not? */
> > +	bool	irs_non_coherent;
> > +};
> > +
> > +extern struct vgic_v5_host_ist_caps gicv5_host_ist_caps;
> > +static inline struct vgic_v5_host_ist_caps
> > *vgic_v5_host_caps(void)
> > +{
> > +	return &gicv5_host_ist_caps;
> > +}
> 
> Err. No. Make gicv5_host_ist_caps static, and move the helper as
> non-inline in vgic-v5-tables.c. It's not like this is anywhere near
> performance-critical stuff, is it?
> 
> But also, if that's global information, we have
> kvm_vgic_global_state.
> Isn't that where these things should live? Then the introduction of
> vgic-v5-tables.[ch] can be moved to the point where it actually
> matters.

I've gone with this latter approach. vgic-v5-tables.[ch] are now
introduced at a later stage in the series. All of these caps are now
part of the global state. Much cleaner for sure!

> 
> > +
> > +#endif
> > diff --git a/arch/arm64/kvm/vgic/vgic-v5.c
> > b/arch/arm64/kvm/vgic/vgic-v5.c
> > index d4789ff3e7402..fd3d6299a2baa 100644
> > --- a/arch/arm64/kvm/vgic/vgic-v5.c
> > +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/irqchip/arm-vgic-info.h>
> >  
> >  #include "vgic.h"
> > +#include "vgic-v5-tables.h"
> >  
> >  #define ppi_caps	kvm_vgic_global_state.vgic_v5_ppi_caps
> >  
> > @@ -34,6 +35,54 @@ static void vgic_v5_get_implemented_ppis(void)
> >  	__assign_bit(GICV5_ARCH_PPI_PMUIRQ,
> > ppi_caps.impl_ppi_mask, system_supports_pmuv3());
> >  }
> >  
> > +static void __iomem *irs_base;
> 
> Global state?

Yup, done.

> 
> > +
> > +static u32 irs_readl_relaxed(const u32 reg_offset)
> > +{
> > +	return readl_relaxed(irs_base + reg_offset);
> > +}
> > +
> > +static int gicv5_irs_extract_vm_caps(const struct gic_kvm_info
> > *info)
> > +{
> > +	u64 idr;
> > +
> > +	irs_base = info->gicv5_irs.base;
> > +	if (!irs_base) {
> > +		kvm_info("No GICv5 MMIO IRS address; no GICv5
> > support\n");
> > +		return -ENODEV;
> > +	}
> 
> Should you instead bail out early by not registering the gic_kvm_info
> when the IRS base is unknown, making this sort of checks irrelevant?
> 
> Also, it's not like we can make it very far without an IRS...

In fact, we won't even make it this far (or even as far as registering
gic_kvm_info) if we fail to init the host's IRS. Have dropped this.

> 
> > +
> > +	vgic_v5_host_caps()->irs_non_coherent = info-
> > >gicv5_irs.non_coherent;
> > +
> > +	idr = irs_readl_relaxed(GICV5_IRS_IDR2);
> > +
> > +	/* We skip the LPI field as it only applies to physical
> > LPIs */
> > +	vgic_v5_host_caps()->ist_id_bits =
> > FIELD_GET(GICV5_IRS_IDR2_ID_BITS, idr);
> > +	vgic_v5_host_caps()->min_lpi_id_bits =
> > FIELD_GET(GICV5_IRS_IDR2_MIN_LPI_ID_BITS, idr);
> > +	vgic_v5_host_caps()->ist_levels =
> > !!FIELD_GET(GICV5_IRS_IDR2_IST_LEVELS, idr);
> > +	vgic_v5_host_caps()->ist_l2sz =
> > FIELD_GET(GICV5_IRS_IDR2_IST_L2SZ, idr);
> > +	vgic_v5_host_caps()->istmd =
> > !!FIELD_GET(GICV5_IRS_IDR2_ISTMD, idr);
> > +	vgic_v5_host_caps()->istmd_sz =
> > FIELD_GET(GICV5_IRS_IDR2_ISTMD_SZ, idr);
> > +
> > +	idr = irs_readl_relaxed(GICV5_IRS_IDR3);
> > +
> > +	vgic_v5_host_caps()->max_vms =
> > BIT(FIELD_GET(GICV5_IRS_IDR3_VM_ID_BITS, idr));
> > +	vgic_v5_host_caps()->two_level_vmt_support =
> > !!FIELD_GET(GICV5_IRS_IDR3_VMT_LEVELS, idr);
> > +
> > +	if (FIELD_GET(GICV5_IRS_IDR3_VMD, idr))
> 
> The constant (ab)use of FIELD_GET() for fields that are single bit
> wide is very hard to read. I'd like to see:
> 
> 	vgic_v5_host_caps()->ist_levels = (idr &
> GICV5_IRS_IDR2_IST_LEVELS);
> [...]
> 	vgic_v5_host_caps()->istmd = (idr & GICV5_IRS_IDR2_ISTMD);
> [...]
> 	if (idr & GICV5_IRS_IDR3_VMD)
> [...]
> 
> which is infinitely more readable.

Done.

> 
> > +		vgic_v5_host_caps()->vmd_size =
> > BIT(FIELD_GET(GICV5_IRS_IDR3_VMD_SZ, idr));
> > +	else
> > +		vgic_v5_host_caps()->vmd_size = 0;
> > +
> > +	idr = irs_readl_relaxed(GICV5_IRS_IDR4);
> > +
> > +	vgic_v5_host_caps()->vped_size =
> > BIT(FIELD_GET(GICV5_IRS_IDR4_VPED_SZ, idr));
> > +	/* Field stores VPE_ID_BITS - 1 */
> > +	vgic_v5_host_caps()->max_vpes =
> > BIT(FIELD_GET(GICV5_IRS_IDR4_VPE_ID_BITS, idr) + 1);
> > +
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Probe for a vGICv5 compatible interrupt controller, returning 0
> > on success.
> >   */
> > @@ -61,6 +110,9 @@ int vgic_v5_probe(const struct gic_kvm_info
> > *info)
> >  		goto skip_v5;
> >  	}
> >  
> > +	if (gicv5_irs_extract_vm_caps(info))
> > +		goto skip_v5;
> > +
> 
> We shouldn't "skip_v5" anymore. If we can't initialise KVM with
> GICv5,
> we're done, and we should not even try to register v3.

Agreed, with the exception of pKVM. In that case, we only support
vGICv3 on (compatible) GICv5 hardware. In all other cases, it makes
sense to fail outright if we can't correctly init vGICv5.

I'll update it to error out on all cases aside from that one.

> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM
  2026-04-28 14:56   ` Marc Zyngier
@ 2026-05-01 16:46     ` Sascha Bischoff
  0 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-05-01 16:46 UTC (permalink / raw)
  To: maz@kernel.org
  Cc: yuzenghui@huawei.com, Timothy Hayes, Suzuki Poulose, nd,
	peter.maydell@linaro.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	Joey Gouly, lpieralisi@kernel.org, oliver.upton@linux.dev

On Tue, 2026-04-28 at 15:56 +0100, Marc Zyngier wrote:
> On Mon, 27 Apr 2026 17:07:24 +0100,
> Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> > 
> > KVM needs to interact with the host IRS in order to, for example,
> > make
> > VMs or VPEs valid. There are two potential approaches here. Either
> > the
> > host irqchip driver can provide an interface, or KVM can interact
> > directly with the host IRS. The latter of these two is chosen as
> > the
> > set of MMIO registers that KVM needs to interact with is orthogonal
> > to
> > the set used by the host irqchip driver (with the exception of some
> > of
> > the read-only IRS_IDRx registers).
> > 
> > Pass KVM a pointer to an IRS config frame - the config frame
> > belonging
> > to ANY IRS is fine as long as one IRS's config frame is used
> > consistently - in struct gic_kvm_info. Additionally, include a flag
> > telling KVM whether the IRS is coherent or non-coherent in order to
> > make sure that KVM can do the correct cache state management, if
> > required.
> > 
> > Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> > ---
> >  drivers/irqchip/irq-gic-v5-irs.c      | 26
> > ++++++++++++++++++++++++++
> >  drivers/irqchip/irq-gic-v5.c          |  3 +++
> >  include/linux/irqchip/arm-gic-v5.h    |  2 ++
> >  include/linux/irqchip/arm-vgic-info.h |  5 +++++
> >  4 files changed, 36 insertions(+)
> > 
> > diff --git a/drivers/irqchip/irq-gic-v5-irs.c
> > b/drivers/irqchip/irq-gic-v5-irs.c
> > index f3fce0b1e25d9..5dfa043cf9e34 100644
> > --- a/drivers/irqchip/irq-gic-v5-irs.c
> > +++ b/drivers/irqchip/irq-gic-v5-irs.c
> > @@ -50,6 +50,32 @@ static void irs_writeq_relaxed(struct
> > gicv5_irs_chip_data *irs_data,
> >  	writeq_relaxed(val, irs_data->irs_base + reg_offset);
> >  }
> >  
> > +void __iomem *gicv5_irs_get_config_frame_base(void)
> > +{
> > +	struct gicv5_irs_chip_data *irs_data =
> > per_cpu(per_cpu_irs_data,
> > +						      
> > smp_processor_id());
> > +
> > +	if (!irs_data)
> > +		return NULL;
> > +
> > +	return irs_data->irs_base;
> > +}
> > +
> > +bool gicv5_irs_is_non_coherent(void)
> > +{
> > +	struct gicv5_irs_chip_data *irs_data =
> > per_cpu(per_cpu_irs_data,
> > +						      
> > smp_processor_id());
> > +
> > +	if (!irs_data) {
> > +		pr_err("Failed to look up IRS for CPU %d\n",
> > +		       smp_processor_id());
> > +		return false;
> > +	}
> > +
> > +	return !!(irs_data->flags & IRS_FLAGS_NON_COHERENT);
> > +}
> > +
> 
> Frankly, we don't need these micro-helpers. Just add *one* that
> returns a pointer to the IRS data for CPU0, and do the information
> cherry-picking in the caller.

I've dropped these and have done what you suggested.

> 
> > +
> >  /*
> >   * The polling wait (in gicv5_wait_for_op_s_atomic()) on a GIC
> > register
> >   * provides the memory barriers (through MMIO accessors)
> > diff --git a/drivers/irqchip/irq-gic-v5.c b/drivers/irqchip/irq-
> > gic-v5.c
> > index 58e457d4c1476..3329019722360 100644
> > --- a/drivers/irqchip/irq-gic-v5.c
> > +++ b/drivers/irqchip/irq-gic-v5.c
> > @@ -1134,6 +1134,9 @@ static void __init
> > gic_of_setup_kvm_info(struct device_node *node)
> >  
> >  	gic_v5_kvm_info.type = GIC_V5;
> >  
> > +	gic_v5_kvm_info.gicv5_irs.base =
> > gicv5_irs_get_config_frame_base();
> > +	gic_v5_kvm_info.gicv5_irs.non_coherent =
> > gicv5_irs_is_non_coherent();
> > +
> >  	/* GIC Virtual CPU interface maintenance interrupt */
> >  	gic_v5_kvm_info.no_maint_irq_mask = false;
> >  	gic_v5_kvm_info.maint_irq = irq_of_parse_and_map(node, 0);
> 
> While you're at it, you may want to fix the check on the maintenance
> interrupt, which gives up registering with KVM if no MI is found,
> even
> in the absence of FEAT_GCIE_LEGACY.

I've done the re-work here. GICv5 systems without FEAT_GCIE_LEGACY can
now omit the MI. If we have the feature, then the MI remains mandatory,
and we bail out at probe-time in kvm_vgic_hyp_init(). Will include this
in my next version of the series.

> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain
  2026-04-28 16:40   ` Marc Zyngier
@ 2026-05-01 16:54     ` Sascha Bischoff
  0 siblings, 0 replies; 66+ messages in thread
From: Sascha Bischoff @ 2026-05-01 16:54 UTC (permalink / raw)
  To: maz@kernel.org
  Cc: yuzenghui@huawei.com, Timothy Hayes, Suzuki Poulose, nd,
	peter.maydell@linaro.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	Joey Gouly, lpieralisi@kernel.org, oliver.upton@linux.dev

On Tue, 2026-04-28 at 17:40 +0100, Marc Zyngier wrote:
> On Mon, 27 Apr 2026 17:08:05 +0100,
> Sascha Bischoff <Sascha.Bischoff@arm.com> wrote:
> > 
> > GICv5 supports two types of doorbell - VPE doorbells and VM
> > doorbells. In KVM we only support Targeted interrupts, and do not
> > support 1ofN target selection. This means that we only implement
> > VPE
> > doorbells. These doorbells are implemented as host LPIs which are
> > generated when a non-resident VPE has a pending interrupt of
> > sufficient priority and the doorbell has been requested as part of
> > making the VPE non-resident.
> 
> This is mostly a repeat of the architecture spec. I don't think we
> need to paraphrase it.

Dropped.

> 
> > 
> > VPE doorbells allow KVM to wake VPEs (so, vcpus) as soon as the
> > hardware determines that sufficient conditions for the interrupt to
> > be
> > signalled have been met. This simplifies the wake-up path for vcpus
> > with GICv5 for LPIs and SPIs. NOTE: PPI pending state must still be
> > checked explicitly as the IRS never sees them.
> 
> Drop the note, it serves no purpose here.

Done.

> 
> > 
> > This change introduces support for the vgic_v5 doorbell domain. One
> > doorbell domain is created per GICv5 VM, and all VPEs have their
> > own
> > doorbell within this domain. When the doorbell fires, this is
> > tracked
> > (in gicv5_vpe.db_fired) and the corresponding vcpu is kicked.
> > 
> > Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
> > ---
> >  arch/arm64/kvm/vgic/vgic-init.c    |   5 +-
> >  arch/arm64/kvm/vgic/vgic-v5.c      | 143
> > +++++++++++++++++++++++++++++
> >  arch/arm64/kvm/vgic/vgic.h         |   1 +
> >  include/kvm/arm_vgic.h             |   6 ++
> >  include/linux/irqchip/arm-gic-v5.h |   2 +
> >  5 files changed, 156 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/kvm/vgic/vgic-init.c
> > b/arch/arm64/kvm/vgic/vgic-init.c
> > index 907057881b26a..984908a271c8d 100644
> > --- a/arch/arm64/kvm/vgic/vgic-init.c
> > +++ b/arch/arm64/kvm/vgic/vgic-init.c
> > @@ -500,8 +500,11 @@ static void kvm_vgic_dist_destroy(struct kvm
> > *kvm)
> >  		dist->vgic_cpu_base = VGIC_ADDR_UNDEF;
> >  	}
> >  
> > -	if (vgic_supports_direct_irqs(kvm))
> > +	if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 &&
> > +	    vgic_supports_direct_irqs(kvm))
> >  		vgic_v4_teardown(kvm);
> > +	else if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5)
> > +		vgic_v5_teardown(kvm);
> 
> nit: switch/case instead?

Ack. Done.

> 
> > 
> >  	xa_destroy(&dist->lpi_xa);
> >  }
> > diff --git a/arch/arm64/kvm/vgic/vgic-v5.c
> > b/arch/arm64/kvm/vgic/vgic-v5.c
> > index fd3d6299a2baa..4e0d52b309628 100644
> > --- a/arch/arm64/kvm/vgic/vgic-v5.c
> > +++ b/arch/arm64/kvm/vgic/vgic-v5.c
> > @@ -7,6 +7,7 @@
> >  
> >  #include <linux/bitops.h>
> >  #include <linux/irqchip/arm-vgic-info.h>
> > +#include <linux/irqdomain.h>
> >  
> >  #include "vgic.h"
> >  #include "vgic-v5-tables.h"
> > @@ -162,6 +163,138 @@ int vgic_v5_probe(const struct gic_kvm_info
> > *info)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * This set of irq_chip functions is specific for doorbells.
> > + */
> > +static struct irq_chip vgic_v5_db_irq_chip = {
> 
> const?

Oops, yes. Done!

> 
> > +	.name = "GICv5-DB",
> > +	.irq_mask = irq_chip_mask_parent,
> > +	.irq_unmask = irq_chip_unmask_parent,
> > +	.irq_eoi = irq_chip_eoi_parent,
> > +	.irq_set_affinity = irq_chip_set_affinity_parent,
> > +	.irq_get_irqchip_state = irq_chip_get_parent_state,
> > +	.irq_set_irqchip_state = irq_chip_set_parent_state,
> > +	.flags = IRQCHIP_SET_TYPE_MASKED | IRQCHIP_SKIP_SET_WAKE |
> > +		 IRQCHIP_MASK_ON_SUSPEND,
> > +};
> > +
> > +static int vgic_v5_irq_db_domain_map(struct irq_domain *d,
> > unsigned int virq,
> > +				     u16 vpe_id)
> > +{
> > +	int ret;
> > +	u32 lpi;
> > +	irq_hw_number_t hwirq;
> > +	struct irq_chip *chip = &vgic_v5_db_irq_chip;
> > +	struct irq_data *irqd =
> > irq_desc_get_irq_data(irq_to_desc(virq));
> > +
> > +	/*
> > +	 * For the DB domain, we don't use the same hwirq as for
> > LPIs.
> > +	 */
> > +	hwirq = vpe_id;
> > +
> > +	ret = gicv5_alloc_lpi();
> 
> NAK. Allocating LPIs is the task of the underlying domain that
> manages
> LPIs, and absolutely not the vgic code.

Yeah, that's rather wonky! Alas that is (hopefully soon was - see
below) how the other domains using the GICv5 LPI domain work today.

> 
> > +	if (ret < 0)
> > +		return ret;
> > +	lpi = ret;
> > +
> > +	ret = irq_domain_alloc_irqs_parent(d, virq, 1, &lpi);
> 
> Why? I'd expect to see an irq_domain_alloc_irqs() for the whole VM,
> and be done with it.
> 
> The whole allocation/freeing of LPIs is upside down. You really
> should
> not have to do this, and I'd strongly suggest you align the way the
> doorbell domain is constructed with the way GICv4 does it.

Following our offline discussion around the LPI wackiness, I've posted
a small series to fix the underlying cause of this mess - the fact that
the LPI allocation for GICv5 was not owned by the LPI domain, but
rather was managed by other domains in the hierarchy (IPI & ITS's MSI
domains).

This series is here:
https://lore.kernel.org/all/20260430153352.3654325-1-sascha.bischoff@arm.com/

Based on these changes, I've re-worked this to do what you suggested,
and now alloc the domain and all DB virqs in one go.

Thanks,
Sascha

> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 36/43] KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs
  2026-04-27 16:18 ` [PATCH 36/43] KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs Sascha Bischoff
@ 2026-05-01 18:54   ` Vladimir Murzin
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Murzin @ 2026-05-01 18:54 UTC (permalink / raw)
  To: Sascha Bischoff, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, kvm@vger.kernel.org
  Cc: nd, maz@kernel.org, oliver.upton@linux.dev, Joey Gouly,
	Suzuki Poulose, yuzenghui@huawei.com, peter.maydell@linaro.org,
	lpieralisi@kernel.org, Timothy Hayes

Hi Sascha,

On 4/27/26 17:18, Sascha Bischoff wrote:
> +	tmp = le64_to_cpu(READ_ONCE(vmte->val[3]));
> +	host_id_bits = FIELD_GET(GICV5_VMTEL2E_IST_ID_BITS, tmp);
> +	host_istsz = FIELD_GET(GICV5_VMTEL2E_IST_ISTSZ, tmp);
> +	host_l2sz = FIELD_GET(GICV5_VMTEL2E_IST_L2SZ, tmp);
> +
> +	/* We always use a Linear SPI IST on the host */
> +	for (int i = 0; i < BIT(host_id_bits); ++i) {

IIUC here we iterate over host's SPIs and save them into buffer provided
by the user. One thing I noticed is that if user set number of SPIs via
KVM_DEV_ARM_VGIC_GRP_NR_IRQS bellow 16, say 1 (btw, it seems there is no
way to set SPIs to 0...), it would be rounded up to 16 in 
kvm_vgic_v5_irs_init(). So from user perspective it needs to save just
single SPI, thus it could provide buffer just for that single entry. With such
setup we can end up corrupting user's memory. We can argue that it is user's
problem... perhaps we need to spell explicitly requirements for the buffer or
provide interface to the user to query required size or respect nr_spis set
by user or some combinations of all :)

Cheers
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2026-05-01 18:55 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 16:06 [PATCH 00/43] KVM: arm64: Add GICv5 IRS support Sascha Bischoff
2026-04-27 16:06 ` [PATCH 01/43] arm64/sysreg: Add GICv5 GIC VDPEND and VDRCFG encodings Sascha Bischoff
2026-04-27 16:06 ` [PATCH 02/43] arm64/sysreg: Update ICC_CR0_EL1 with LINK and LINK_IDLE fields Sascha Bischoff
2026-04-27 16:07 ` [PATCH 03/43] KVM: arm64: gic-v5: Add resident/non-resident hyp calls Sascha Bischoff
2026-04-28 14:28   ` Marc Zyngier
2026-05-01 16:40     ` Sascha Bischoff
2026-04-27 16:07 ` [PATCH 04/43] irqchip/gic-v5: Provide IRS config frame attrs to KVM Sascha Bischoff
2026-04-28 14:56   ` Marc Zyngier
2026-05-01 16:46     ` Sascha Bischoff
2026-04-27 16:07 ` [PATCH 05/43] KVM: arm64: gic-v5: Extract host IRS caps from IRS config frame Sascha Bischoff
2026-04-28 15:20   ` Marc Zyngier
2026-05-01 16:44     ` Sascha Bischoff
2026-04-27 16:08 ` [PATCH 06/43] KVM: arm64: gic-v5: Add VPE doorbell domain Sascha Bischoff
2026-04-28 16:40   ` Marc Zyngier
2026-05-01 16:54     ` Sascha Bischoff
2026-04-27 16:08 ` [PATCH 07/43] KVM: arm64: gic-v5: Create & manage VM and VPE tables Sascha Bischoff
2026-04-28 14:54   ` Vladimir Murzin
2026-05-01 16:42     ` Sascha Bischoff
2026-04-28 15:55   ` Joey Gouly
2026-04-29 10:25   ` Marc Zyngier
2026-04-27 16:08 ` [PATCH 08/43] KVM: arm64: gic-v5: Introduce guest IST alloc and management Sascha Bischoff
2026-04-29 14:29   ` Marc Zyngier
2026-04-27 16:09 ` [PATCH 09/43] KVM: arm64: gic-v5: Implement VMT/vIST IRS MMIO Ops Sascha Bischoff
2026-04-29 12:50   ` Joey Gouly
2026-04-29 16:04   ` Marc Zyngier
2026-04-27 16:09 ` [PATCH 10/43] KVM: arm64: gic-v5: Implement VPE " Sascha Bischoff
2026-04-30  8:46   ` Marc Zyngier
2026-04-27 16:09 ` [PATCH 11/43] KVM: arm64: gic-v5: Make VPEs valid in vgic_v5_reset() Sascha Bischoff
2026-04-30  9:37   ` Marc Zyngier
2026-04-27 16:10 ` [PATCH 12/43] KVM: arm64: gic-v5: Clear db_fired flag before making VPE non-resident Sascha Bischoff
2026-04-27 16:10 ` [PATCH 13/43] KVM: arm64: gic-v5: Make VPEs (non-)resident in vgic_load/put Sascha Bischoff
2026-04-30 10:26   ` Marc Zyngier
2026-04-27 16:10 ` [PATCH 14/43] KVM: arm64: gic-v5: Request VPE doorbells when going non-resident Sascha Bischoff
2026-04-30 10:37   ` Marc Zyngier
2026-04-27 16:11 ` [PATCH 15/43] KVM: arm64: gic-v5: Handle doorbells in kvm_vgic_vcpu_pending_irq() Sascha Bischoff
2026-04-27 16:11 ` [PATCH 16/43] KVM: arm64: gic-v5: Initialise and teardown VMTEs & doorbells Sascha Bischoff
2026-04-30 12:23   ` Marc Zyngier
2026-04-27 16:11 ` [PATCH 17/43] KVM: arm64: gic-v5: Enable VPE DBs on VPE reset and disable on teardown Sascha Bischoff
2026-04-27 16:12 ` [PATCH 18/43] KVM: arm64: gic-v5: Define remaining IRS MMIO registers Sascha Bischoff
2026-04-27 16:12 ` [PATCH 19/43] KVM: arm64: gic-v5: Introduce struct vgic_v5_irs and IRS base address Sascha Bischoff
2026-04-27 16:12 ` [PATCH 20/43] KVM: arm64: gic-v5: Add IRS IODEV to iodev_types and generic MMIO handlers Sascha Bischoff
2026-04-27 16:13 ` [PATCH 21/43] KVM: arm64: gic-v5: Add KVM_VGIC_V5_ADDR_TYPE_IRS to UAPI Sascha Bischoff
2026-04-27 16:13 ` [PATCH 22/43] KVM: arm64: gic-v5: Add GICv5 IRS IODEV and MMIO emulation Sascha Bischoff
2026-04-27 16:13 ` [PATCH 23/43] KVM: arm64: gic-v5: Set IRICHPPIDIS based on IRS enable state Sascha Bischoff
2026-04-27 16:14 ` [PATCH 24/43] KVM: arm64: gic-v5: Call IRS init/teardown from vgic_v5 init/teardown Sascha Bischoff
2026-04-27 16:14 ` [PATCH 25/43] KVM: arm64: gic-v5: Register the IRS IODEV Sascha Bischoff
2026-04-27 16:14 ` [PATCH 26/43] Documentation: KVM: Extend VGICv5 docs for KVM_VGIC_V5_ADDR_TYPE_IRS Sascha Bischoff
2026-04-27 16:15 ` [PATCH 27/43] KVM: arm64: selftests: Update vGICv5 selftest to set IRS address Sascha Bischoff
2026-04-27 16:15 ` [PATCH 28/43] KVM: arm64: gic-v5: Introduce SPI AP list Sascha Bischoff
2026-04-27 16:15 ` [PATCH 29/43] KVM: arm64: gic-v5: Add GIC VDPEND and GIC VDRCFG hyp calls Sascha Bischoff
2026-04-27 16:16 ` [PATCH 30/43] KVM: arm64: gic-v5: Track SPI state for in-flight SPIs Sascha Bischoff
2026-04-27 16:16 ` [PATCH 31/43] KVM: arm64: gic: Introduce set_pending_state() to irq_op Sascha Bischoff
2026-04-27 16:16 ` [PATCH 32/43] KVM: arm64: gic-v5: Support SPI injection Sascha Bischoff
2026-04-27 16:17 ` [PATCH 33/43] KVM: arm64: gic-v5: Add GICv5 SPI injection to irqfd Sascha Bischoff
2026-04-27 16:17 ` [PATCH 34/43] KVM: arm64: gic-v5: Mask per-vcpu PPI state in vgic_v5_finalize_ppi_state() Sascha Bischoff
2026-04-27 16:17 ` [PATCH 35/43] KVM: arm64: gic-v5: Add GICv5 EL1 sysreg userspace set/get interface Sascha Bischoff
2026-04-27 16:18 ` [PATCH 36/43] KVM: arm64: gic-v5: Implement save/restore mechanisms for ISTs Sascha Bischoff
2026-05-01 18:54   ` Vladimir Murzin
2026-04-27 16:18 ` [PATCH 37/43] KVM: arm64: gic-v5: Handle userspace accesses to IRS MMIO region Sascha Bischoff
2026-04-27 16:19 ` [PATCH 38/43] KVM: arm64: gic-v5: Add VGIC_GRP_IRS_REGS/VGIC_GRP_IST to UAPI Sascha Bischoff
2026-04-27 16:19 ` [PATCH 39/43] KVM: arm64: gic-v5: Plumb in has/set/get_attr for sysregs & IRS MMIO regs Sascha Bischoff
2026-04-27 16:19 ` [PATCH 40/43] Documentation: KVM: Document KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS for VGICv5 Sascha Bischoff
2026-04-27 16:20 ` [PATCH 41/43] Documentation: KVM: Add KVM_DEV_ARM_VGIC_GRP_IRS_REGS to VGICv5 docs Sascha Bischoff
2026-04-27 16:20 ` [PATCH 42/43] Documentation: KVM: Add docs for KVM_DEV_ARM_VGIC_GRP_IST Sascha Bischoff
2026-04-27 16:20 ` [PATCH 43/43] Documentation: KVM: Add the VGICv5 IRS save/restore sequences Sascha Bischoff
2026-04-30  8:57   ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox