* [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
@ 2025-10-13 18:59 Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA Jiaqi Yan
` (4 more replies)
0 siblings, 5 replies; 19+ messages in thread
From: Jiaqi Yan @ 2025-10-13 18:59 UTC (permalink / raw)
To: maz, oliver.upton
Cc: duenwen, rananta, jthoughton, vsethi, jgg, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest, Jiaqi Yan
Problem
=======
When host APEI is unable to claim a synchronous external abort (SEA)
during guest abort, today KVM directly injects an asynchronous SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory,
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
The idea is, we can replay the SEA to the faulting VCPU. If the memory
error consumption or the fault that cause SEA is not from guest kernel,
the blast radius can be limited to the poison-consuming guest process,
while the VM can keep running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides a playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault [2], or by splitting the
memslot that contains the poisoned pages.
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred KVM exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU.
New UAPIs
=========
This patchset introduces following userspace-visible changes to empower
VMM to control what happens for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not owned by kernel
allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA
to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, faulting guest virtual and physical
addresses if available.
* From v3 [4]
- Rebased on commit 3a8660878839 ("Linux 6.18-rc1").
- In selftest, print a message if GVA or GPA expects to be valid.
* From v2 [5]:
- Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6]
and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI
mappings when vPE allocation is disabled")
- Took the host_owns_sea implementation from Oliver [7, 8].
- Excluded the guest SEA injection patches.
- Updated selftest.
* From v1 [9]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com
[5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com
[6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.dev
[7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev
[8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev
[9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (3):
KVM: arm64: VM exit to userspace to handle SEA
KVM: selftests: Test for KVM_EXIT_ARM_SEA
Documentation: kvm: new UAPI for handling SEA
Documentation/virt/kvm/api.rst | 61 ++++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 5 +
arch/arm64/kvm/mmu.c | 68 +++-
include/uapi/linux/kvm.h | 10 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
9 files changed, 480 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.51.0.760.g7b8bcc2412-goog
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA
2025-10-13 18:59 [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
@ 2025-10-13 18:59 ` Jiaqi Yan
2025-11-03 18:17 ` Jose Marinho
2025-10-13 18:59 ` [PATCH v4 2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA Jiaqi Yan
` (3 subsequent siblings)
4 siblings, 1 reply; 19+ messages in thread
From: Jiaqi Yan @ 2025-10-13 18:59 UTC (permalink / raw)
To: maz, oliver.upton
Cc: duenwen, rananta, jthoughton, vsethi, jgg, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest, Jiaqi Yan
When APEI fails to handle a stage-2 synchronous external abort (SEA),
today KVM injects an asynchronous SError to the VCPU then resumes it,
which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable
uncorrected memory error (UER). Although SError and guest kernel panic
effectively stops the propagation of corrupted memory, guest may
re-use the corrupted memory if auto-rebooted; in worse case, guest
boot may run into poisoned memory. So there is room to recover from
an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via
KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
consumption or fault is not from guest kernel, blast radius can be
limited to the triggering thread in guest userspace, so VM can
keep running.
- Allow VMM to protect from future memory poison consumption by
unmapping the page from stage-2, or to interrupt guest of the
poisoned page so guest kernel can unmap it from stage-1 page table.
- Allow VMM to track SEA events that VM customers care about, to restart
VM when certain number of distinct poison events have happened,
to provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
when host APEI fails to claim a SEA, userspace can opt in this new
capability to let KVM exit to userspace during SEA if it is not
owned by host.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
KVM fills kvm_run.arm_sea with as much as possible information about
the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits
useful for userspace and relevant to guest memory.
- Flags indicating if faulting guest physical address is valid.
- Faulting guest physical and virtual addresses if valid.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 5 +++
arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++-
include/uapi/linux/kvm.h | 10 +++++
4 files changed, 84 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b763293281c88..e2c65b14e60c4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -350,6 +350,8 @@ struct kvm_arch {
#define KVM_ARCH_FLAG_GUEST_HAS_SVE 9
/* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
#define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
+ /* Unhandled SEAs are taken to userspace */
+#define KVM_ARCH_FLAG_EXIT_SEA 11
unsigned long flags;
/* VM-wide vCPU feature set */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f21d1b7f20f8e..888600df79c40 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(&kvm->lock);
break;
+ case KVM_CAP_ARM_SEA_TO_USER:
+ r = 0;
+ set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
+ break;
default:
break;
}
@@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_COUNTER_OFFSET:
case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
+ case KVM_CAP_ARM_SEA_TO_USER:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7cc964af8d305..09210b6ab3907 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
read_unlock(&vcpu->kvm->mmu_lock);
}
+/*
+ * Returns true if the SEA should be handled locally within KVM if the abort
+ * is caused by a kernel memory allocation (e.g. stage-2 table memory).
+ */
+static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr)
+{
+ /*
+ * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
+ * taken from a guest EL to EL2 is due to a host-imposed access (e.g.
+ * stage-2 PTW).
+ */
+ if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
+ return true;
+
+ /* KVM owns the VNCR when the vCPU isn't in a nested context. */
+ if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
+ return true;
+
+ /*
+ * Determine if an external abort during a table walk happened at
+ * stage-2 is only possible with S1PTW is set. Otherwise, since KVM
+ * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the
+ * PA of the stage-1 descriptor) can reach here and are reported
+ * with a TTW ESR value.
+ */
+ return (esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW));
+}
+
int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_run *run = vcpu->run;
+ u64 esr = kvm_vcpu_get_esr(vcpu);
+ u64 esr_mask = ESR_ELx_EC_MASK |
+ ESR_ELx_IL |
+ ESR_ELx_FnV |
+ ESR_ELx_EA |
+ ESR_ELx_CM |
+ ESR_ELx_WNR |
+ ESR_ELx_FSC;
+ u64 ipa;
+
/*
* Give APEI the opportunity to claim the abort before handling it
* within KVM. apei_claim_sea() expects to be called with IRQs enabled.
@@ -1909,7 +1949,33 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
if (apei_claim_sea(NULL) == 0)
return 1;
- return kvm_inject_serror(vcpu);
+ if (host_owns_sea(vcpu, esr) ||
+ !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &vcpu->kvm->arch.flags))
+ return kvm_inject_serror(vcpu);
+
+ /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */
+ if (kvm_has_ras(kvm))
+ esr_mask |= ESR_ELx_SET_MASK;
+
+ /*
+ * Exit to userspace, and provide faulting guest virtual and physical
+ * addresses in case userspace wants to emulate SEA to guest by
+ * writing to FAR_ELx and HPFAR_ELx registers.
+ */
+ memset(&run->arm_sea, 0, sizeof(run->arm_sea));
+ run->exit_reason = KVM_EXIT_ARM_SEA;
+ run->arm_sea.esr = esr & esr_mask;
+
+ if (!(esr & ESR_ELx_FnV))
+ run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu);
+
+ ipa = kvm_vcpu_get_fault_ipa(vcpu);
+ if (ipa != INVALID_GPA) {
+ run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
+ run->arm_sea.gpa = ipa;
+ }
+
+ return 0;
}
/**
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6efa98a57ec11..acc7b3a346992 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -179,6 +179,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_LOONGARCH_IOCSR 38
#define KVM_EXIT_MEMORY_FAULT 39
#define KVM_EXIT_TDX 40
+#define KVM_EXIT_ARM_SEA 41
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -473,6 +474,14 @@ struct kvm_run {
} setup_event_notify;
};
} tdx;
+ /* KVM_EXIT_ARM_SEA */
+ struct {
+#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
+ __u64 flags;
+ __u64 esr;
+ __u64 gva;
+ __u64 gpa;
+ } arm_sea;
/* Fix the size of the union. */
char padding[256];
};
@@ -963,6 +972,7 @@ struct kvm_enable_cap {
#define KVM_CAP_RISCV_MP_STATE_RESET 242
#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
#define KVM_CAP_GUEST_MEMFD_MMAP 244
+#define KVM_CAP_ARM_SEA_TO_USER 245
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.51.0.760.g7b8bcc2412-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v4 2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA
2025-10-13 18:59 [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA Jiaqi Yan
@ 2025-10-13 18:59 ` Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA Jiaqi Yan
` (2 subsequent siblings)
4 siblings, 0 replies; 19+ messages in thread
From: Jiaqi Yan @ 2025-10-13 18:59 UTC (permalink / raw)
To: maz, oliver.upton
Cc: duenwen, rananta, jthoughton, vsethi, jgg, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest, Jiaqi Yan
Test how KVM handles guest SEA when APEI is unable to claim it, and
KVM_CAP_ARM_SEA_TO_USER is enabled.
The behavior is triggered by consuming recoverable memory error (UER)
injected via EINJ. The test asserts two major things:
1. KVM returns to userspace with KVM_EXIT_ARM_SEA exit reason, and
has provided expected fault information, e.g. esr, flags, gva, gpa.
2. Userspace is able to handle KVM_EXIT_ARM_SEA by injecting SEA to
guest and KVM injects expected SEA into the VCPU.
Tested on a data center server running Siryn AmpereOne processor
that has RAS support.
Several things to notice before attempting to run this selftest:
- The test relies on EINJ support in both firmware and kernel to
inject UER. Otherwise the test will be skipped.
- The under-test platform's APEI should be unable to claim the SEA.
Otherwise the test will be skipped.
- Some platform doesn't support notrigger in EINJ, which may cause
APEI and GHES to offline the memory before guest can consume
injected UER, and making test unable to trigger SEA.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
tools/arch/arm64/include/asm/esr.h | 2 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
4 files changed, 335 insertions(+)
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
diff --git a/tools/arch/arm64/include/asm/esr.h b/tools/arch/arm64/include/asm/esr.h
index bd592ca815711..0fa17b3af1f78 100644
--- a/tools/arch/arm64/include/asm/esr.h
+++ b/tools/arch/arm64/include/asm/esr.h
@@ -141,6 +141,8 @@
#define ESR_ELx_SF (UL(1) << ESR_ELx_SF_SHIFT)
#define ESR_ELx_AR_SHIFT (14)
#define ESR_ELx_AR (UL(1) << ESR_ELx_AR_SHIFT)
+#define ESR_ELx_VNCR_SHIFT (13)
+#define ESR_ELx_VNCR (UL(1) << ESR_ELx_VNCR_SHIFT)
#define ESR_ELx_CM_SHIFT (8)
#define ESR_ELx_CM (UL(1) << ESR_ELx_CM_SHIFT)
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 148d427ff24be..02a7663c097b5 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -163,6 +163,7 @@ TEST_GEN_PROGS_arm64 += arm64/hypercalls
TEST_GEN_PROGS_arm64 += arm64/external_aborts
TEST_GEN_PROGS_arm64 += arm64/page_fault_test
TEST_GEN_PROGS_arm64 += arm64/psci_test
+TEST_GEN_PROGS_arm64 += arm64/sea_to_user
TEST_GEN_PROGS_arm64 += arm64/set_id_regs
TEST_GEN_PROGS_arm64 += arm64/smccc_filter
TEST_GEN_PROGS_arm64 += arm64/vcpu_width_config
diff --git a/tools/testing/selftests/kvm/arm64/sea_to_user.c b/tools/testing/selftests/kvm/arm64/sea_to_user.c
new file mode 100644
index 0000000000000..573dd790aeb8e
--- /dev/null
+++ b/tools/testing/selftests/kvm/arm64/sea_to_user.c
@@ -0,0 +1,331 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test KVM returns to userspace with KVM_EXIT_ARM_SEA if host APEI fails
+ * to handle SEA and userspace has opt-ed in KVM_CAP_ARM_SEA_TO_USER.
+ *
+ * After reaching userspace with expected arm_sea info, also test userspace
+ * injecting a synchronous external data abort into the guest.
+ *
+ * This test utilizes EINJ to generate a REAL synchronous external data
+ * abort by consuming a recoverable uncorrectable memory error. Therefore
+ * the device under test must support EINJ in both firmware and host kernel,
+ * including the notrigger feature. Otherwise the test will be skipped.
+ * The under-test platform's APEI should be unable to claim SEA. Otherwise
+ * the test will also be skipped.
+ */
+
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "guest_modes.h"
+
+#define PAGE_PRESENT (1ULL << 63)
+#define PAGE_PHYSICAL 0x007fffffffffffffULL
+#define PAGE_ADDR_MASK (~(0xfffULL))
+
+/* Group ISV and ISS[23:14]. */
+#define ESR_ELx_INST_SYNDROME ((ESR_ELx_ISV) | (ESR_ELx_SAS) | \
+ (ESR_ELx_SSE) | (ESR_ELx_SRT_MASK) | \
+ (ESR_ELx_SF) | (ESR_ELx_AR))
+
+#define EINJ_ETYPE "/sys/kernel/debug/apei/einj/error_type"
+#define EINJ_ADDR "/sys/kernel/debug/apei/einj/param1"
+#define EINJ_MASK "/sys/kernel/debug/apei/einj/param2"
+#define EINJ_FLAGS "/sys/kernel/debug/apei/einj/flags"
+#define EINJ_NOTRIGGER "/sys/kernel/debug/apei/einj/notrigger"
+#define EINJ_DOIT "/sys/kernel/debug/apei/einj/error_inject"
+/* Memory Uncorrectable non-fatal. */
+#define ERROR_TYPE_MEMORY_UER 0x10
+/* Memory address and mask valid (param1 and param2). */
+#define MASK_MEMORY_UER 0b10
+
+/* Guest virtual address region = [2G, 3G). */
+#define START_GVA 0x80000000UL
+#define VM_MEM_SIZE 0x40000000UL
+/* Note: EINJ_OFFSET must < VM_MEM_SIZE. */
+#define EINJ_OFFSET 0x01234badUL
+#define EINJ_GVA ((START_GVA) + (EINJ_OFFSET))
+
+static vm_paddr_t einj_gpa;
+static void *einj_hva;
+static uint64_t einj_hpa;
+static bool far_invalid;
+
+static uint64_t translate_to_host_paddr(unsigned long vaddr)
+{
+ uint64_t pinfo;
+ int64_t offset = vaddr / getpagesize() * sizeof(pinfo);
+ int fd;
+ uint64_t page_addr;
+ uint64_t paddr;
+
+ fd = open("/proc/self/pagemap", O_RDONLY);
+ if (fd < 0)
+ ksft_exit_fail_perror("Failed to open /proc/self/pagemap");
+ if (pread(fd, &pinfo, sizeof(pinfo), offset) != sizeof(pinfo)) {
+ close(fd);
+ ksft_exit_fail_perror("Failed to read /proc/self/pagemap");
+ }
+
+ close(fd);
+
+ if ((pinfo & PAGE_PRESENT) == 0)
+ ksft_exit_fail_perror("Page not present");
+
+ page_addr = (pinfo & PAGE_PHYSICAL) << MIN_PAGE_SHIFT;
+ paddr = page_addr + (vaddr & (getpagesize() - 1));
+ return paddr;
+}
+
+static void write_einj_entry(const char *einj_path, uint64_t val)
+{
+ char cmd[256] = {0};
+ FILE *cmdfile = NULL;
+
+ sprintf(cmd, "echo %#lx > %s", val, einj_path);
+ cmdfile = popen(cmd, "r");
+
+ if (pclose(cmdfile) == 0)
+ ksft_print_msg("echo %#lx > %s - done\n", val, einj_path);
+ else
+ ksft_exit_fail_perror("Failed to write EINJ entry");
+}
+
+static void inject_uer(uint64_t paddr)
+{
+ if (access("/sys/firmware/acpi/tables/EINJ", R_OK) == -1)
+ ksft_test_result_skip("EINJ table no available in firmware");
+
+ if (access(EINJ_ETYPE, R_OK | W_OK) == -1)
+ ksft_test_result_skip("EINJ module probably not loaded?");
+
+ write_einj_entry(EINJ_ETYPE, ERROR_TYPE_MEMORY_UER);
+ write_einj_entry(EINJ_FLAGS, MASK_MEMORY_UER);
+ write_einj_entry(EINJ_ADDR, paddr);
+ write_einj_entry(EINJ_MASK, ~0x0UL);
+ write_einj_entry(EINJ_NOTRIGGER, 1);
+ write_einj_entry(EINJ_DOIT, 1);
+}
+
+/*
+ * When host APEI successfully claims the SEA caused by guest_code, kernel
+ * will send SIGBUS signal with BUS_MCEERR_AR to test thread.
+ *
+ * We set up this SIGBUS handler to skip the test for that case.
+ */
+static void sigbus_signal_handler(int sig, siginfo_t *si, void *v)
+{
+ ksft_print_msg("SIGBUS (%d) received, dumping siginfo...\n", sig);
+ ksft_print_msg("si_signo=%d, si_errno=%d, si_code=%d, si_addr=%p\n",
+ si->si_signo, si->si_errno, si->si_code, si->si_addr);
+ if (si->si_code == BUS_MCEERR_AR)
+ ksft_test_result_skip("SEA is claimed by host APEI\n");
+ else
+ ksft_test_result_fail("Exit with signal unhandled\n");
+
+ exit(0);
+}
+
+static void setup_sigbus_handler(void)
+{
+ struct sigaction act;
+
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&act.sa_mask);
+ act.sa_sigaction = sigbus_signal_handler;
+ act.sa_flags = SA_SIGINFO;
+ TEST_ASSERT(sigaction(SIGBUS, &act, NULL) == 0,
+ "Failed to setup SIGBUS handler");
+}
+
+static void guest_code(void)
+{
+ uint64_t guest_data;
+
+ /* Consumes error will cause a SEA. */
+ guest_data = *(uint64_t *)EINJ_GVA;
+
+ GUEST_FAIL("Poison not protected by SEA: gva=%#lx, guest_data=%#lx\n",
+ EINJ_GVA, guest_data);
+}
+
+static void expect_sea_handler(struct ex_regs *regs)
+{
+ u64 esr = read_sysreg(esr_el1);
+ u64 far = read_sysreg(far_el1);
+ bool expect_far_invalid = far_invalid;
+
+ GUEST_PRINTF("Handling Guest SEA\n");
+ GUEST_PRINTF("ESR_EL1=%#lx, FAR_EL1=%#lx\n", esr, far);
+
+ GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_CUR);
+ GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT);
+
+ if (expect_far_invalid) {
+ GUEST_ASSERT_EQ(esr & ESR_ELx_FnV, ESR_ELx_FnV);
+ GUEST_PRINTF("Guest observed garbage value in FAR\n");
+ } else {
+ GUEST_ASSERT_EQ(esr & ESR_ELx_FnV, 0);
+ GUEST_ASSERT_EQ(far, EINJ_GVA);
+ }
+
+ GUEST_DONE();
+}
+
+static void vcpu_inject_sea(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_events events = {};
+
+ events.exception.ext_dabt_pending = true;
+ vcpu_events_set(vcpu, &events);
+}
+
+static void run_vm(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ struct ucall uc;
+ bool guest_done = false;
+ struct kvm_run *run = vcpu->run;
+ u64 esr;
+
+ /* Resume the vCPU after error injection to consume the error. */
+ vcpu_run(vcpu);
+
+ ksft_print_msg("Dump kvm_run info about KVM_EXIT_%s\n",
+ exit_reason_str(run->exit_reason));
+ ksft_print_msg("kvm_run.arm_sea: esr=%#llx, flags=%#llx\n",
+ run->arm_sea.esr, run->arm_sea.flags);
+ ksft_print_msg("kvm_run.arm_sea: gva=%#llx, gpa=%#llx\n",
+ run->arm_sea.gva, run->arm_sea.gpa);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_ARM_SEA);
+
+ esr = run->arm_sea.esr;
+ TEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_LOW);
+ TEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT);
+ TEST_ASSERT_EQ(ESR_ELx_ISS2(esr), 0);
+ TEST_ASSERT_EQ((esr & ESR_ELx_INST_SYNDROME), 0);
+ TEST_ASSERT_EQ(esr & ESR_ELx_VNCR, 0);
+
+ if (!(esr & ESR_ELx_FnV)) {
+ ksft_print_msg("Expect gva to match given FnV bit is 0\n");
+ TEST_ASSERT_EQ(run->arm_sea.gva, EINJ_GVA);
+ }
+
+ if (run->arm_sea.flags & KVM_EXIT_ARM_SEA_FLAG_GPA_VALID) {
+ ksft_print_msg("Expect gpa to match given KVM_EXIT_ARM_SEA_FLAG_GPA_VALID is set\n");
+ TEST_ASSERT_EQ(run->arm_sea.gpa, einj_gpa & PAGE_ADDR_MASK);
+ }
+
+ far_invalid = esr & ESR_ELx_FnV;
+
+ /* Inject a SEA into guest and expect handled in SEA handler. */
+ vcpu_inject_sea(vcpu);
+
+ /* Expect the guest to reach GUEST_DONE gracefully. */
+ do {
+ vcpu_run(vcpu);
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_PRINTF:
+ ksft_print_msg("From guest: %s", uc.buffer);
+ break;
+ case UCALL_DONE:
+ ksft_print_msg("Guest done gracefully!\n");
+ guest_done = 1;
+ break;
+ case UCALL_ABORT:
+ ksft_print_msg("Guest aborted!\n");
+ guest_done = 1;
+ REPORT_GUEST_ASSERT(uc);
+ break;
+ default:
+ TEST_FAIL("Unexpected ucall: %lu\n", uc.cmd);
+ }
+ } while (!guest_done);
+}
+
+static struct kvm_vm *vm_create_with_sea_handler(struct kvm_vcpu **vcpu)
+{
+ size_t backing_page_size;
+ size_t guest_page_size;
+ size_t alignment;
+ uint64_t num_guest_pages;
+ vm_paddr_t start_gpa;
+ enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB;
+ struct kvm_vm *vm;
+
+ backing_page_size = get_backing_src_pagesz(src_type);
+ guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size;
+ alignment = max(backing_page_size, guest_page_size);
+ num_guest_pages = VM_MEM_SIZE / guest_page_size;
+
+ vm = __vm_create_with_one_vcpu(vcpu, num_guest_pages, guest_code);
+ vm_init_descriptor_tables(vm);
+ vcpu_init_descriptor_tables(*vcpu);
+
+ vm_install_sync_handler(vm,
+ /*vector=*/VECTOR_SYNC_CURRENT,
+ /*ec=*/ESR_ELx_EC_DABT_CUR,
+ /*handler=*/expect_sea_handler);
+
+ start_gpa = (vm->max_gfn - num_guest_pages) * guest_page_size;
+ start_gpa = align_down(start_gpa, alignment);
+
+ vm_userspace_mem_region_add(
+ /*vm=*/vm,
+ /*src_type=*/src_type,
+ /*guest_paddr=*/start_gpa,
+ /*slot=*/1,
+ /*npages=*/num_guest_pages,
+ /*flags=*/0);
+
+ virt_map(vm, START_GVA, start_gpa, num_guest_pages);
+
+ ksft_print_msg("Mapped %#lx pages: gva=%#lx to gpa=%#lx\n",
+ num_guest_pages, START_GVA, start_gpa);
+ return vm;
+}
+
+static void vm_inject_memory_uer(struct kvm_vm *vm)
+{
+ uint64_t guest_data;
+
+ einj_gpa = addr_gva2gpa(vm, EINJ_GVA);
+ einj_hva = addr_gva2hva(vm, EINJ_GVA);
+
+ /* Populate certain data before injecting UER. */
+ *(uint64_t *)einj_hva = 0xBAADCAFE;
+ guest_data = *(uint64_t *)einj_hva;
+ ksft_print_msg("Before EINJect: data=%#lx\n",
+ guest_data);
+
+ einj_hpa = translate_to_host_paddr((unsigned long)einj_hva);
+
+ ksft_print_msg("EINJ_GVA=%#lx, einj_gpa=%#lx, einj_hva=%p, einj_hpa=%#lx\n",
+ EINJ_GVA, einj_gpa, einj_hva, einj_hpa);
+
+ inject_uer(einj_hpa);
+ ksft_print_msg("Memory UER EINJected\n");
+}
+
+int main(int argc, char *argv[])
+{
+ struct kvm_vm *vm;
+ struct kvm_vcpu *vcpu;
+
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_SEA_TO_USER));
+
+ setup_sigbus_handler();
+
+ vm = vm_create_with_sea_handler(&vcpu);
+ vm_enable_cap(vm, KVM_CAP_ARM_SEA_TO_USER, 0);
+ vm_inject_memory_uer(vm);
+ run_vm(vm, vcpu);
+ kvm_vm_free(vm);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 6743fbd9bd671..3e2f249e88271 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2039,6 +2039,7 @@ static struct exit_reason {
KVM_EXIT_STRING(NOTIFY),
KVM_EXIT_STRING(LOONGARCH_IOCSR),
KVM_EXIT_STRING(MEMORY_FAULT),
+ KVM_EXIT_STRING(ARM_SEA),
};
/*
--
2.51.0.760.g7b8bcc2412-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA
2025-10-13 18:59 [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA Jiaqi Yan
@ 2025-10-13 18:59 ` Jiaqi Yan
2025-10-14 1:51 ` Randy Dunlap
2025-10-20 14:46 ` [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jason Gunthorpe
2025-11-13 21:06 ` Oliver Upton
4 siblings, 1 reply; 19+ messages in thread
From: Jiaqi Yan @ 2025-10-13 18:59 UTC (permalink / raw)
To: maz, oliver.upton
Cc: duenwen, rananta, jthoughton, vsethi, jgg, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest, Jiaqi Yan
Document the new userspace-visible features and APIs for handling
synchronous external abort (SEA)
- KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature.
- KVM_EXIT_ARM_SEA: exit userspace gets when it needs to handle SEA
and what userspace gets while taking the SEA.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
Documentation/virt/kvm/api.rst | 61 ++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 6ae24c5ca5598..43bc2a1d78e01 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7272,6 +7272,55 @@ exit, even without calls to ``KVM_ENABLE_CAP`` or similar. In this case,
it will enter with output fields already valid; in the common case, the
``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``.
Userspace need not do anything if it does not wish to support a TDVMCALL.
+
+::
+ /* KVM_EXIT_ARM_SEA */
+ struct {
+ #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
+ __u64 flags;
+ __u64 esr;
+ __u64 gva;
+ __u64 gpa;
+ } arm_sea;
+
+Used on arm64 systems. When the VM capability KVM_CAP_ARM_SEA_TO_USER is
+enabled, a VM exit is generated if guest causes a synchronous external abort
+(SEA) and the host APEI fails to handle the SEA.
+
+Historically KVM handles SEA by first delegating the SEA to host APEI as there
+is high chance that the SEA is caused by consuming uncorrected memory error.
+However, not all platforms support SEA handling in APEI, and KVM's fallback
+is to inject an asynchronous SError into the guest, which usually panics
+guest kernel unpleasantly. As an alternative, userspace can participate into
+the SEA handling by enabling KVM_CAP_ARM_SEA_TO_USER at VM creation, after
+querying the capability. Once enabled, when KVM has to handle the guest
+caused SEA, it returns to userspace with KVM_EXIT_ARM_SEA, with details
+about the SEA available in 'arm_sea'.
+
+The 'esr' field holds the value of the exception syndrome register (ESR) while
+KVM taking the SEA, which tells userspace the character of the current SEA,
+such as its Exception Class, Synchronous Error Type, Fault Specific Code and
+so on. For more details on ESR, check the Arm Architecture Registers
+documentation.
+
+The following values are defined for the 'flags' field
+
+ - KVM_EXIT_ARM_SEA_FLAG_GPA_VALID -- the faulting guest physical address
+ is valid and userspace can get its value in the 'gpa' field.
+
+Note userspace can tell whether the faulting guest virtual address is valid
+from the FnV bit in 'esr' field. If FnV bit in 'esr' field is not set, the
+'gva' field hols the valid faulting guest virtual address.
+
+Userspace needs to take actions to handle guest SEA synchronously, namely in
+the same thread that runs KVM_RUN and receives KVM_EXIT_ARM_SEA. One of the
+encouraged approaches is to utilize the KVM_SET_VCPU_EVENTS to inject the SEA
+to the faulting VCPU. This way, the guest has the opportunity to keep running
+and limit the blast radius of the SEA to the particular guest application that
+caused the SEA. Userspace may also emulate the SEA to VM by itself using the
+KVM_SET_ONE_REG API. In this case, it can use the valid values from 'gva' and
+'gpa' fields to manipulate VCPU's registers (e.g. FAR_EL1, HPFAR_EL1).
+
::
/* Fix the size of the union. */
@@ -8689,6 +8738,18 @@ This capability indicate to the userspace whether a PFNMAP memory region
can be safely mapped as cacheable. This relies on the presence of
force write back (FWB) feature support on the hardware.
+7.45 KVM_CAP_ARM_SEA_TO_USER
+----------------------------
+
+:Architecture: arm64
+:Target: VM
+:Parameters: none
+:Returns: 0 on success, -EINVAL if unsupported.
+
+This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
+that KVM has an implementation that allows userspace to participate in handling
+synchronous external abort caused by VM, by an exit of KVM_EXIT_ARM_SEA.
+
8. Other capabilities.
======================
--
2.51.0.760.g7b8bcc2412-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA
2025-10-13 18:59 ` [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA Jiaqi Yan
@ 2025-10-14 1:51 ` Randy Dunlap
2025-10-21 16:13 ` Jiaqi Yan
0 siblings, 1 reply; 19+ messages in thread
From: Randy Dunlap @ 2025-10-14 1:51 UTC (permalink / raw)
To: Jiaqi Yan, maz, oliver.upton
Cc: duenwen, rananta, jthoughton, vsethi, jgg, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest
On 10/13/25 11:59 AM, Jiaqi Yan wrote:
> Document the new userspace-visible features and APIs for handling
> synchronous external abort (SEA)
> - KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature.
> - KVM_EXIT_ARM_SEA: exit userspace gets when it needs to handle SEA
> and what userspace gets while taking the SEA.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
> Documentation/virt/kvm/api.rst | 61 ++++++++++++++++++++++++++++++++++
> 1 file changed, 61 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 6ae24c5ca5598..43bc2a1d78e01 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7272,6 +7272,55 @@ exit, even without calls to ``KVM_ENABLE_CAP`` or similar. In this case,
> it will enter with output fields already valid; in the common case, the
> ``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``.
> Userspace need not do anything if it does not wish to support a TDVMCALL.
> +
> +::
> + /* KVM_EXIT_ARM_SEA */
> + struct {
> + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
> + __u64 flags;
> + __u64 esr;
> + __u64 gva;
> + __u64 gpa;
> + } arm_sea;
> +
> +Used on arm64 systems. When the VM capability KVM_CAP_ARM_SEA_TO_USER is
> +enabled, a VM exit is generated if guest causes a synchronous external abort
> +(SEA) and the host APEI fails to handle the SEA.
> +
> +Historically KVM handles SEA by first delegating the SEA to host APEI as there
> +is high chance that the SEA is caused by consuming uncorrected memory error.
> +However, not all platforms support SEA handling in APEI, and KVM's fallback
> +is to inject an asynchronous SError into the guest, which usually panics
> +guest kernel unpleasantly. As an alternative, userspace can participate into
in
> +the SEA handling by enabling KVM_CAP_ARM_SEA_TO_USER at VM creation, after
> +querying the capability. Once enabled, when KVM has to handle the guest
guest-
> +caused SEA, it returns to userspace with KVM_EXIT_ARM_SEA, with details
> +about the SEA available in 'arm_sea'.
> +
> +The 'esr' field holds the value of the exception syndrome register (ESR) while
> +KVM taking the SEA, which tells userspace the character of the current SEA,
KVM takes
> +such as its Exception Class, Synchronous Error Type, Fault Specific Code and
> +so on. For more details on ESR, check the Arm Architecture Registers
> +documentation.
> +
> +The following values are defined for the 'flags' field
Above needs an ending like '.' or ':'.
(or maybe "::" depending how it is processed by Sphinx)
> +
> + - KVM_EXIT_ARM_SEA_FLAG_GPA_VALID -- the faulting guest physical address
> + is valid and userspace can get its value in the 'gpa' field.
> +
> +Note userspace can tell whether the faulting guest virtual address is valid
> +from the FnV bit in 'esr' field. If FnV bit in 'esr' field is not set, the
> +'gva' field hols the valid faulting guest virtual address.
holds (or contains)> +
> +Userspace needs to take actions to handle guest SEA synchronously, namely in
> +the same thread that runs KVM_RUN and receives KVM_EXIT_ARM_SEA. One of the
> +encouraged approaches is to utilize the KVM_SET_VCPU_EVENTS to inject the SEA
> +to the faulting VCPU. This way, the guest has the opportunity to keep running
> +and limit the blast radius of the SEA to the particular guest application that
> +caused the SEA. Userspace may also emulate the SEA to VM by itself using the
> +KVM_SET_ONE_REG API. In this case, it can use the valid values from 'gva' and
> +'gpa' fields to manipulate VCPU's registers (e.g. FAR_EL1, HPFAR_EL1).
> +
> ::
>
> /* Fix the size of the union. */
> @@ -8689,6 +8738,18 @@ This capability indicate to the userspace whether a PFNMAP memory region
> can be safely mapped as cacheable. This relies on the presence of
> force write back (FWB) feature support on the hardware.
>
> +7.45 KVM_CAP_ARM_SEA_TO_USER
> +----------------------------
> +
> +:Architecture: arm64
> +:Target: VM
> +:Parameters: none
> +:Returns: 0 on success, -EINVAL if unsupported.
> +
> +This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
> +that KVM has an implementation that allows userspace to participate in handling
> +synchronous external abort caused by VM, by an exit of KVM_EXIT_ARM_SEA.
> +
> 8. Other capabilities.
> ======================
>
--
~Randy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-10-13 18:59 [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
` (2 preceding siblings ...)
2025-10-13 18:59 ` [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA Jiaqi Yan
@ 2025-10-20 14:46 ` Jason Gunthorpe
2025-11-10 17:41 ` Jiaqi Yan
2025-11-13 21:06 ` Oliver Upton
4 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 14:46 UTC (permalink / raw)
To: Jiaqi Yan
Cc: maz, oliver.upton, duenwen, rananta, jthoughton, vsethi,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> Problem
> =======
>
> When host APEI is unable to claim a synchronous external abort (SEA)
> during guest abort, today KVM directly injects an asynchronous SError
> into the VCPU then resumes it. The injected SError usually results in
> unpleasant guest kernel panic.
>
> One of the major situation of guest SEA is when VCPU consumes recoverable
> uncorrected memory error (UER), which is not uncommon at all in modern
> datacenter servers with large amounts of physical memory. Although SError
> and guest panic is sufficient to stop the propagation of corrupted memory,
> there is room to recover from an UER in a more graceful manner.
>
> Proposed Solution
> =================
>
> The idea is, we can replay the SEA to the faulting VCPU. If the memory
> error consumption or the fault that cause SEA is not from guest kernel,
> the blast radius can be limited to the poison-consuming guest process,
> while the VM can keep running.
>
> In addition, instead of doing under the hood without involving userspace,
> there are benefits to redirect the SEA to VMM:
>
> - VM customers care about the disruptions caused by memory errors, and
> VMM usually has the responsibility to start the process of notifying
> the customers of memory error events in their VMs. For example some
> cloud provider emits a critical log in their observability UI [1], and
> provides a playbook for customers on how to mitigate disruptions to
> their workloads.
>
> - VMM can protect future memory error consumption by unmapping the poisoned
> pages from stage-2 page table with KVM userfault [2], or by splitting the
> memslot that contains the poisoned pages.
>
> - VMM can keep track of SEA events in the VM. When VMM thinks the status
> on the host or the VM is bad enough, e.g. number of distinct SEAs
> exceeds a threshold, it can restart the VM on another healthy host.
>
> - Behavior parity with x86 architecture. When machine check exception
> (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
> let VMM either recover from the MCE, or terminate itself with VM.
> The prior RFC proposes to implement SIGBUS on arm64 as well, but
> Marc preferred KVM exit over signal [3]. However, implementation
> aside, returning SEA to VMM is on par with returning MCE to VMM.
>
> Once SEA is redirected to VMM, among other actions, VMM is encouraged
> to inject external aborts into the faulting VCPU.
I don't know much about the KVM details but this explanation makes
sense to me and we also have use cases for all of what is written
here.
Thanks,
Jason
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA
2025-10-14 1:51 ` Randy Dunlap
@ 2025-10-21 16:13 ` Jiaqi Yan
0 siblings, 0 replies; 19+ messages in thread
From: Jiaqi Yan @ 2025-10-21 16:13 UTC (permalink / raw)
To: Randy Dunlap
Cc: maz, oliver.upton, duenwen, rananta, jthoughton, vsethi, jgg,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Mon, Oct 13, 2025 at 6:51 PM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 10/13/25 11:59 AM, Jiaqi Yan wrote:
> > Document the new userspace-visible features and APIs for handling
> > synchronous external abort (SEA)
> > - KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature.
> > - KVM_EXIT_ARM_SEA: exit userspace gets when it needs to handle SEA
> > and what userspace gets while taking the SEA.
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> > Documentation/virt/kvm/api.rst | 61 ++++++++++++++++++++++++++++++++++
> > 1 file changed, 61 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 6ae24c5ca5598..43bc2a1d78e01 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -7272,6 +7272,55 @@ exit, even without calls to ``KVM_ENABLE_CAP`` or similar. In this case,
> > it will enter with output fields already valid; in the common case, the
> > ``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``.
> > Userspace need not do anything if it does not wish to support a TDVMCALL.
> > +
> > +::
> > + /* KVM_EXIT_ARM_SEA */
> > + struct {
> > + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
> > + __u64 flags;
> > + __u64 esr;
> > + __u64 gva;
> > + __u64 gpa;
> > + } arm_sea;
> > +
> > +Used on arm64 systems. When the VM capability KVM_CAP_ARM_SEA_TO_USER is
> > +enabled, a VM exit is generated if guest causes a synchronous external abort
> > +(SEA) and the host APEI fails to handle the SEA.
> > +
> > +Historically KVM handles SEA by first delegating the SEA to host APEI as there
> > +is high chance that the SEA is caused by consuming uncorrected memory error.
> > +However, not all platforms support SEA handling in APEI, and KVM's fallback
> > +is to inject an asynchronous SError into the guest, which usually panics
> > +guest kernel unpleasantly. As an alternative, userspace can participate into
>
> in
>
> > +the SEA handling by enabling KVM_CAP_ARM_SEA_TO_USER at VM creation, after
> > +querying the capability. Once enabled, when KVM has to handle the guest
>
> guest-
> > +caused SEA, it returns to userspace with KVM_EXIT_ARM_SEA, with details
> > +about the SEA available in 'arm_sea'.
> > +
> > +The 'esr' field holds the value of the exception syndrome register (ESR) while
> > +KVM taking the SEA, which tells userspace the character of the current SEA,
> KVM takes
>
> > +such as its Exception Class, Synchronous Error Type, Fault Specific Code and
> > +so on. For more details on ESR, check the Arm Architecture Registers
> > +documentation.
> > +
> > +The following values are defined for the 'flags' field
>
> Above needs an ending like '.' or ':'.
> (or maybe "::" depending how it is processed by Sphinx)
>
> > +
> > + - KVM_EXIT_ARM_SEA_FLAG_GPA_VALID -- the faulting guest physical address
> > + is valid and userspace can get its value in the 'gpa' field.
> > +
> > +Note userspace can tell whether the faulting guest virtual address is valid
> > +from the FnV bit in 'esr' field. If FnV bit in 'esr' field is not set, the
> > +'gva' field hols the valid faulting guest virtual address.
>
> holds (or contains)> +
> > +Userspace needs to take actions to handle guest SEA synchronously, namely in
> > +the same thread that runs KVM_RUN and receives KVM_EXIT_ARM_SEA. One of the
> > +encouraged approaches is to utilize the KVM_SET_VCPU_EVENTS to inject the SEA
> > +to the faulting VCPU. This way, the guest has the opportunity to keep running
> > +and limit the blast radius of the SEA to the particular guest application that
> > +caused the SEA. Userspace may also emulate the SEA to VM by itself using the
> > +KVM_SET_ONE_REG API. In this case, it can use the valid values from 'gva' and
> > +'gpa' fields to manipulate VCPU's registers (e.g. FAR_EL1, HPFAR_EL1).
> > +
> > ::
> >
> > /* Fix the size of the union. */
> > @@ -8689,6 +8738,18 @@ This capability indicate to the userspace whether a PFNMAP memory region
> > can be safely mapped as cacheable. This relies on the presence of
> > force write back (FWB) feature support on the hardware.
> >
> > +7.45 KVM_CAP_ARM_SEA_TO_USER
> > +----------------------------
> > +
> > +:Architecture: arm64
> > +:Target: VM
> > +:Parameters: none
> > +:Returns: 0 on success, -EINVAL if unsupported.
> > +
> > +This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
> > +that KVM has an implementation that allows userspace to participate in handling
> > +synchronous external abort caused by VM, by an exit of KVM_EXIT_ARM_SEA.
> > +
> > 8. Other capabilities.
> > ======================
> >
>
> --
> ~Randy
>
Thanks for your quick review, Randy. I have queued fixes and am
waiting for reviews on other commits in this PATCH.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA
2025-10-13 18:59 ` [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA Jiaqi Yan
@ 2025-11-03 18:17 ` Jose Marinho
2025-11-03 20:45 ` Jiaqi Yan
2025-11-03 22:22 ` Marc Zyngier
0 siblings, 2 replies; 19+ messages in thread
From: Jose Marinho @ 2025-11-03 18:17 UTC (permalink / raw)
To: Jiaqi Yan, maz, oliver.upton
Cc: duenwen, rananta, jthoughton, vsethi, jgg, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest
Thank you for these patches.
On 10/13/2025 7:59 PM, Jiaqi Yan wrote:
> When APEI fails to handle a stage-2 synchronous external abort (SEA),
> today KVM injects an asynchronous SError to the VCPU then resumes it,
> which usually results in unpleasant guest kernel panic.
>
> One major situation of guest SEA is when vCPU consumes recoverable
> uncorrected memory error (UER). Although SError and guest kernel panic
> effectively stops the propagation of corrupted memory, guest may
> re-use the corrupted memory if auto-rebooted; in worse case, guest
> boot may run into poisoned memory. So there is room to recover from
> an UER in a more graceful manner.
>
> Alternatively KVM can redirect the synchronous SEA event to VMM to
> - Reduce blast radius if possible. VMM can inject a SEA to VCPU via
> KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
> consumption or fault is not from guest kernel, blast radius can be
> limited to the triggering thread in guest userspace, so VM can
> keep running.
> - Allow VMM to protect from future memory poison consumption by
> unmapping the page from stage-2, or to interrupt guest of the
> poisoned page so guest kernel can unmap it from stage-1 page table.
> - Allow VMM to track SEA events that VM customers care about, to restart
> VM when certain number of distinct poison events have happened,
> to provide observability to customers in log management UI.
>
> Introduce an userspace-visible feature to enable VMM handle SEA:
> - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
> when host APEI fails to claim a SEA, userspace can opt in this new
> capability to let KVM exit to userspace during SEA if it is not
> owned by host.
> - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
> KVM fills kvm_run.arm_sea with as much as possible information about
> the SEA, enabling VMM to emulate SEA to guest by itself.
> - Sanitized ESR_EL2. The general rule is to keep only the bits
> useful for userspace and relevant to guest memory.
> - Flags indicating if faulting guest physical address is valid.
> - Faulting guest physical and virtual addresses if valid.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
> arch/arm64/include/asm/kvm_host.h | 2 +
> arch/arm64/kvm/arm.c | 5 +++
> arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++-
> include/uapi/linux/kvm.h | 10 +++++
> 4 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index b763293281c88..e2c65b14e60c4 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -350,6 +350,8 @@ struct kvm_arch {
> #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9
> /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
> #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
> + /* Unhandled SEAs are taken to userspace */
> +#define KVM_ARCH_FLAG_EXIT_SEA 11
> unsigned long flags;
>
> /* VM-wide vCPU feature set */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index f21d1b7f20f8e..888600df79c40 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> }
> mutex_unlock(&kvm->lock);
> break;
> + case KVM_CAP_ARM_SEA_TO_USER:
> + r = 0;
> + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> + break;
> default:
> break;
> }
> @@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_COUNTER_OFFSET:
> case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
> + case KVM_CAP_ARM_SEA_TO_USER:
> r = 1;
> break;
> case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7cc964af8d305..09210b6ab3907 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> read_unlock(&vcpu->kvm->mmu_lock);
> }
>
> +/*
> + * Returns true if the SEA should be handled locally within KVM if the abort
> + * is caused by a kernel memory allocation (e.g. stage-2 table memory).
> + */
> +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr)
> +{
> + /*
> + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
> + * taken from a guest EL to EL2 is due to a host-imposed access (e.g.
> + * stage-2 PTW).
> + */
> + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
> + return true;
> +
> + /* KVM owns the VNCR when the vCPU isn't in a nested context. */
> + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
Is this check valid only for a "Data Abort"?
> + return true;
> +
> + /*
> + * Determine if an external abort during a table walk happened at
> + * stage-2 is only possible with S1PTW is set. Otherwise, since KVM
nit: Is the first sentence correct?
> + * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the
> + * PA of the stage-1 descriptor) can reach here and are reported
> + * with a TTW ESR value.
> + */
> + return (esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW));
> +}
> +
> int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> {
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_run *run = vcpu->run;
> + u64 esr = kvm_vcpu_get_esr(vcpu);
> + u64 esr_mask = ESR_ELx_EC_MASK |
> + ESR_ELx_IL |
> + ESR_ELx_FnV |
> + ESR_ELx_EA |
> + ESR_ELx_CM |
> + ESR_ELx_WNR |
> + ESR_ELx_FSC;
> + u64 ipa;
> +
> /*
> * Give APEI the opportunity to claim the abort before handling it
> * within KVM. apei_claim_sea() expects to be called with IRQs enabled.
> @@ -1909,7 +1949,33 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> if (apei_claim_sea(NULL) == 0)
> return 1;
>
> - return kvm_inject_serror(vcpu);
> + if (host_owns_sea(vcpu, esr) ||
> + !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &vcpu->kvm->arch.flags))
> + return kvm_inject_serror(vcpu);
> +
> + /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */
> + if (kvm_has_ras(kvm))
> + esr_mask |= ESR_ELx_SET_MASK;
> +
> + /*
> + * Exit to userspace, and provide faulting guest virtual and physical
> + * addresses in case userspace wants to emulate SEA to guest by
> + * writing to FAR_ELx and HPFAR_ELx registers.
> + */
> + memset(&run->arm_sea, 0, sizeof(run->arm_sea));
> + run->exit_reason = KVM_EXIT_ARM_SEA;
> + run->arm_sea.esr = esr & esr_mask;
> +
> + if (!(esr & ESR_ELx_FnV))
> + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu) > +
> + ipa = kvm_vcpu_get_fault_ipa(vcpu);
> + if (ipa != INVALID_GPA) {
> + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
> + run->arm_sea.gpa = ipa;
Are we interested in the value of PFAR_EL2 (if FEAT_PFAR implemented)?
I believe kvm_vcpu_get_fault_ipa gets the HPFAR_EL2, which is valid for
S2 translation and GPC faults, but unknown for other cases.
Jose
> + }
> +
> + return 0;
> }
>
> /**
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6efa98a57ec11..acc7b3a346992 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -179,6 +179,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_LOONGARCH_IOCSR 38
> #define KVM_EXIT_MEMORY_FAULT 39
> #define KVM_EXIT_TDX 40
> +#define KVM_EXIT_ARM_SEA 41
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -473,6 +474,14 @@ struct kvm_run {
> } setup_event_notify;
> };
> } tdx;
> + /* KVM_EXIT_ARM_SEA */
> + struct {
> +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
> + __u64 flags;
> + __u64 esr;
> + __u64 gva;
> + __u64 gpa;
> + } arm_sea;
> /* Fix the size of the union. */
> char padding[256];
> };
> @@ -963,6 +972,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_RISCV_MP_STATE_RESET 242
> #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
> #define KVM_CAP_GUEST_MEMFD_MMAP 244
> +#define KVM_CAP_ARM_SEA_TO_USER 245
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA
2025-11-03 18:17 ` Jose Marinho
@ 2025-11-03 20:45 ` Jiaqi Yan
2025-11-11 9:53 ` Oliver Upton
2025-11-03 22:22 ` Marc Zyngier
1 sibling, 1 reply; 19+ messages in thread
From: Jiaqi Yan @ 2025-11-03 20:45 UTC (permalink / raw)
To: Jose Marinho
Cc: maz, oliver.upton, duenwen, rananta, jthoughton, vsethi, jgg,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Mon, Nov 3, 2025 at 10:17 AM Jose Marinho <jose.marinho@arm.com> wrote:
>
> Thank you for these patches.
Thanks for your comments, Jose!
>
> On 10/13/2025 7:59 PM, Jiaqi Yan wrote:
> > When APEI fails to handle a stage-2 synchronous external abort (SEA),
> > today KVM injects an asynchronous SError to the VCPU then resumes it,
> > which usually results in unpleasant guest kernel panic.
> >
> > One major situation of guest SEA is when vCPU consumes recoverable
> > uncorrected memory error (UER). Although SError and guest kernel panic
> > effectively stops the propagation of corrupted memory, guest may
> > re-use the corrupted memory if auto-rebooted; in worse case, guest
> > boot may run into poisoned memory. So there is room to recover from
> > an UER in a more graceful manner.
> >
> > Alternatively KVM can redirect the synchronous SEA event to VMM to
> > - Reduce blast radius if possible. VMM can inject a SEA to VCPU via
> > KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
> > consumption or fault is not from guest kernel, blast radius can be
> > limited to the triggering thread in guest userspace, so VM can
> > keep running.
> > - Allow VMM to protect from future memory poison consumption by
> > unmapping the page from stage-2, or to interrupt guest of the
> > poisoned page so guest kernel can unmap it from stage-1 page table.
> > - Allow VMM to track SEA events that VM customers care about, to restart
> > VM when certain number of distinct poison events have happened,
> > to provide observability to customers in log management UI.
> >
> > Introduce an userspace-visible feature to enable VMM handle SEA:
> > - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
> > when host APEI fails to claim a SEA, userspace can opt in this new
> > capability to let KVM exit to userspace during SEA if it is not
> > owned by host.
> > - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
> > KVM fills kvm_run.arm_sea with as much as possible information about
> > the SEA, enabling VMM to emulate SEA to guest by itself.
> > - Sanitized ESR_EL2. The general rule is to keep only the bits
> > useful for userspace and relevant to guest memory.
> > - Flags indicating if faulting guest physical address is valid.
> > - Faulting guest physical and virtual addresses if valid.
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> > Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 2 +
> > arch/arm64/kvm/arm.c | 5 +++
> > arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++-
> > include/uapi/linux/kvm.h | 10 +++++
> > 4 files changed, 84 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index b763293281c88..e2c65b14e60c4 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -350,6 +350,8 @@ struct kvm_arch {
> > #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9
> > /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
> > #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
> > + /* Unhandled SEAs are taken to userspace */
> > +#define KVM_ARCH_FLAG_EXIT_SEA 11
> > unsigned long flags;
> >
> > /* VM-wide vCPU feature set */
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index f21d1b7f20f8e..888600df79c40 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > }
> > mutex_unlock(&kvm->lock);
> > break;
> > + case KVM_CAP_ARM_SEA_TO_USER:
> > + r = 0;
> > + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > + break;
> > default:
> > break;
> > }
> > @@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > case KVM_CAP_IRQFD_RESAMPLE:
> > case KVM_CAP_COUNTER_OFFSET:
> > case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
> > + case KVM_CAP_ARM_SEA_TO_USER:
> > r = 1;
> > break;
> > case KVM_CAP_SET_GUEST_DEBUG2:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 7cc964af8d305..09210b6ab3907 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > read_unlock(&vcpu->kvm->mmu_lock);
> > }
> >
> > +/*
> > + * Returns true if the SEA should be handled locally within KVM if the abort
> > + * is caused by a kernel memory allocation (e.g. stage-2 table memory).
> > + */
> > +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr)
> > +{
> > + /*
> > + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
> > + * taken from a guest EL to EL2 is due to a host-imposed access (e.g.
> > + * stage-2 PTW).
> > + */
> > + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
> > + return true;
> > +
> > + /* KVM owns the VNCR when the vCPU isn't in a nested context. */
> > + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
> Is this check valid only for a "Data Abort"?
Yes, the VNCR bit is specific to a Data Abort (provided we can only
reach host_owns_sea if kvm_vcpu_abt_issea).
I don't think we need to explicitly exclude the check here for
Instruction Abort.
> > + return true;
> > +
> > + /*
> > + * Determine if an external abort during a table walk happened at
> > + * stage-2 is only possible with S1PTW is set. Otherwise, since KVM
> nit: Is the first sentence correct?
Oh, it should be "Determining ...".
>
> > + * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the
> > + * PA of the stage-1 descriptor) can reach here and are reported
> > + * with a TTW ESR value.
> > + */
> > + return (esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW));
> > +}
> > +
> > int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > {
> > + struct kvm *kvm = vcpu->kvm;
> > + struct kvm_run *run = vcpu->run;
> > + u64 esr = kvm_vcpu_get_esr(vcpu);
> > + u64 esr_mask = ESR_ELx_EC_MASK |
> > + ESR_ELx_IL |
> > + ESR_ELx_FnV |
> > + ESR_ELx_EA |
> > + ESR_ELx_CM |
> > + ESR_ELx_WNR |
> > + ESR_ELx_FSC;
> > + u64 ipa;
> > +
> > /*
> > * Give APEI the opportunity to claim the abort before handling it
> > * within KVM. apei_claim_sea() expects to be called with IRQs enabled.
> > @@ -1909,7 +1949,33 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > if (apei_claim_sea(NULL) == 0)
> > return 1;
> >
> > - return kvm_inject_serror(vcpu);
> > + if (host_owns_sea(vcpu, esr) ||
> > + !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &vcpu->kvm->arch.flags))
> > + return kvm_inject_serror(vcpu);
> > +
> > + /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */
> > + if (kvm_has_ras(kvm))
> > + esr_mask |= ESR_ELx_SET_MASK;
> > +
> > + /*
> > + * Exit to userspace, and provide faulting guest virtual and physical
> > + * addresses in case userspace wants to emulate SEA to guest by
> > + * writing to FAR_ELx and HPFAR_ELx registers.
> > + */
> > + memset(&run->arm_sea, 0, sizeof(run->arm_sea));
> > + run->exit_reason = KVM_EXIT_ARM_SEA;
> > + run->arm_sea.esr = esr & esr_mask;
> > +
> > + if (!(esr & ESR_ELx_FnV))
> > + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu) > +
> > + ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > + if (ipa != INVALID_GPA) {
> > + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
> > + run->arm_sea.gpa = ipa;
>
> Are we interested in the value of PFAR_EL2 (if FEAT_PFAR implemented)?
I don't think userspace (VMM) or the guest would need or make any use
of the physical memory address. I believe host physical address in
general should be hidden from userspace process.
Completely off-topic: if FEAT_PFAR is implemented, I would propose EL3
RAS to implement something below so that host APEI can claim the SEA:
1. Triage the SEA to determine if it has to be handled in place, or
should it be redirected to lower EL2.
2. If SEA should be redirected to EL2, craft a memory error CPER that
contains a valid physical memory address.
3. When redirect a SEA to EL2, also populate it to host APEI GHES.
This way, memory error caused SEA can properly trigger the normal
memory_failure routine provided by host kernel, instead of handled as
an exception without memory error context by KVM.
> I believe kvm_vcpu_get_fault_ipa gets the HPFAR_EL2, which is valid for
> S2 translation and GPC faults, but unknown for other cases.
You are absolutely right that HPFAR_EL2 register is unknown for SEA.
However, thanks to Oliver [1] KVM now performs a FAR to HPFAR address
translation (__translate_far_to_hpfar) for certain SEA cases (see
__fault_safe_to_translate), and stores the translation status +
results in vcpu->arch.fault. These SEA cases are returned to userspace
in this patchset.
[1] https://lore.kernel.org/all/20250402201725.2963645-4-oliver.upton@linux.dev.
>
> Jose
>
> > + }
> > +
> > + return 0;
> > }
> >
> > /**
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 6efa98a57ec11..acc7b3a346992 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -179,6 +179,7 @@ struct kvm_xen_exit {
> > #define KVM_EXIT_LOONGARCH_IOCSR 38
> > #define KVM_EXIT_MEMORY_FAULT 39
> > #define KVM_EXIT_TDX 40
> > +#define KVM_EXIT_ARM_SEA 41
> >
> > /* For KVM_EXIT_INTERNAL_ERROR */
> > /* Emulate instruction failed. */
> > @@ -473,6 +474,14 @@ struct kvm_run {
> > } setup_event_notify;
> > };
> > } tdx;
> > + /* KVM_EXIT_ARM_SEA */
> > + struct {
> > +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
> > + __u64 flags;
> > + __u64 esr;
> > + __u64 gva;
> > + __u64 gpa;
> > + } arm_sea;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
> > @@ -963,6 +972,7 @@ struct kvm_enable_cap {
> > #define KVM_CAP_RISCV_MP_STATE_RESET 242
> > #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
> > #define KVM_CAP_GUEST_MEMFD_MMAP 244
> > +#define KVM_CAP_ARM_SEA_TO_USER 245
> >
> > struct kvm_irq_routing_irqchip {
> > __u32 irqchip;
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA
2025-11-03 18:17 ` Jose Marinho
2025-11-03 20:45 ` Jiaqi Yan
@ 2025-11-03 22:22 ` Marc Zyngier
1 sibling, 0 replies; 19+ messages in thread
From: Marc Zyngier @ 2025-11-03 22:22 UTC (permalink / raw)
To: Jose Marinho
Cc: Jiaqi Yan, oliver.upton, duenwen, rananta, jthoughton, vsethi,
jgg, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Mon, 03 Nov 2025 18:17:00 +0000,
Jose Marinho <jose.marinho@arm.com> wrote:
>
> > + /*
> > + * Exit to userspace, and provide faulting guest virtual and physical
> > + * addresses in case userspace wants to emulate SEA to guest by
> > + * writing to FAR_ELx and HPFAR_ELx registers.
> > + */
> > + memset(&run->arm_sea, 0, sizeof(run->arm_sea));
> > + run->exit_reason = KVM_EXIT_ARM_SEA;
> > + run->arm_sea.esr = esr & esr_mask;
> > +
> > + if (!(esr & ESR_ELx_FnV))
> > + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu) > +
> > + ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > + if (ipa != INVALID_GPA) {
> > + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
> > + run->arm_sea.gpa = ipa;
>
> Are we interested in the value of PFAR_EL2 (if FEAT_PFAR implemented)?
We don't have any support for PFAR, and I don't think we have any plan
to support it in the near future. If anything, the rest of the kernel
should start by growing support for it before we start dragging it
into KVM.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-10-20 14:46 ` [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jason Gunthorpe
@ 2025-11-10 17:41 ` Jiaqi Yan
2025-11-13 13:54 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 19+ messages in thread
From: Jiaqi Yan @ 2025-11-10 17:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: maz, oliver.upton, duenwen, rananta, jthoughton, vsethi,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > Problem
> > =======
> >
> > When host APEI is unable to claim a synchronous external abort (SEA)
> > during guest abort, today KVM directly injects an asynchronous SError
> > into the VCPU then resumes it. The injected SError usually results in
> > unpleasant guest kernel panic.
> >
> > One of the major situation of guest SEA is when VCPU consumes recoverable
> > uncorrected memory error (UER), which is not uncommon at all in modern
> > datacenter servers with large amounts of physical memory. Although SError
> > and guest panic is sufficient to stop the propagation of corrupted memory,
> > there is room to recover from an UER in a more graceful manner.
> >
> > Proposed Solution
> > =================
> >
> > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > error consumption or the fault that cause SEA is not from guest kernel,
> > the blast radius can be limited to the poison-consuming guest process,
> > while the VM can keep running.
> >
> > In addition, instead of doing under the hood without involving userspace,
> > there are benefits to redirect the SEA to VMM:
> >
> > - VM customers care about the disruptions caused by memory errors, and
> > VMM usually has the responsibility to start the process of notifying
> > the customers of memory error events in their VMs. For example some
> > cloud provider emits a critical log in their observability UI [1], and
> > provides a playbook for customers on how to mitigate disruptions to
> > their workloads.
> >
> > - VMM can protect future memory error consumption by unmapping the poisoned
> > pages from stage-2 page table with KVM userfault [2], or by splitting the
> > memslot that contains the poisoned pages.
> >
> > - VMM can keep track of SEA events in the VM. When VMM thinks the status
> > on the host or the VM is bad enough, e.g. number of distinct SEAs
> > exceeds a threshold, it can restart the VM on another healthy host.
> >
> > - Behavior parity with x86 architecture. When machine check exception
> > (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
> > let VMM either recover from the MCE, or terminate itself with VM.
> > The prior RFC proposes to implement SIGBUS on arm64 as well, but
> > Marc preferred KVM exit over signal [3]. However, implementation
> > aside, returning SEA to VMM is on par with returning MCE to VMM.
> >
> > Once SEA is redirected to VMM, among other actions, VMM is encouraged
> > to inject external aborts into the faulting VCPU.
>
> I don't know much about the KVM details but this explanation makes
> sense to me and we also have use cases for all of what is written
> here.
>
> Thanks,
> Jason
Thanks for your feedback Jason. And thanks for the comments from Jose,
Randy, and Marc.
Just wondering if there are any concerns or comments on the API and
implementation? If no, I will fix the typos in 1/3 and 3/3 then send
out v5.
Thanks,
Jiaqi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA
2025-11-03 20:45 ` Jiaqi Yan
@ 2025-11-11 9:53 ` Oliver Upton
2025-11-11 23:32 ` Jiaqi Yan
0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2025-11-11 9:53 UTC (permalink / raw)
To: Jiaqi Yan
Cc: Jose Marinho, maz, oliver.upton, duenwen, rananta, jthoughton,
vsethi, jgg, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, pbonzini, corbet, shuah, kvm, kvmarm,
linux-arm-kernel, linux-kernel, linux-doc, linux-kselftest
Hi Jiaqi,
On Mon, Nov 03, 2025 at 12:45:50PM -0800, Jiaqi Yan wrote:
> On Mon, Nov 3, 2025 at 10:17 AM Jose Marinho <jose.marinho@arm.com> wrote:
> >
> > Thank you for these patches.
>
> Thanks for your comments, Jose!
>
> >
> > On 10/13/2025 7:59 PM, Jiaqi Yan wrote:
> > > When APEI fails to handle a stage-2 synchronous external abort (SEA),
> > > today KVM injects an asynchronous SError to the VCPU then resumes it,
> > > which usually results in unpleasant guest kernel panic.
> > >
> > > One major situation of guest SEA is when vCPU consumes recoverable
> > > uncorrected memory error (UER). Although SError and guest kernel panic
> > > effectively stops the propagation of corrupted memory, guest may
> > > re-use the corrupted memory if auto-rebooted; in worse case, guest
> > > boot may run into poisoned memory. So there is room to recover from
> > > an UER in a more graceful manner.
> > >
> > > Alternatively KVM can redirect the synchronous SEA event to VMM to
> > > - Reduce blast radius if possible. VMM can inject a SEA to VCPU via
> > > KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
> > > consumption or fault is not from guest kernel, blast radius can be
> > > limited to the triggering thread in guest userspace, so VM can
> > > keep running.
> > > - Allow VMM to protect from future memory poison consumption by
> > > unmapping the page from stage-2, or to interrupt guest of the
> > > poisoned page so guest kernel can unmap it from stage-1 page table.
> > > - Allow VMM to track SEA events that VM customers care about, to restart
> > > VM when certain number of distinct poison events have happened,
> > > to provide observability to customers in log management UI.
> > >
> > > Introduce an userspace-visible feature to enable VMM handle SEA:
> > > - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
> > > when host APEI fails to claim a SEA, userspace can opt in this new
> > > capability to let KVM exit to userspace during SEA if it is not
> > > owned by host.
> > > - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
> > > KVM fills kvm_run.arm_sea with as much as possible information about
> > > the SEA, enabling VMM to emulate SEA to guest by itself.
> > > - Sanitized ESR_EL2. The general rule is to keep only the bits
> > > useful for userspace and relevant to guest memory.
> > > - Flags indicating if faulting guest physical address is valid.
> > > - Faulting guest physical and virtual addresses if valid.
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> > > ---
> > > arch/arm64/include/asm/kvm_host.h | 2 +
> > > arch/arm64/kvm/arm.c | 5 +++
> > > arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++-
> > > include/uapi/linux/kvm.h | 10 +++++
> > > 4 files changed, 84 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index b763293281c88..e2c65b14e60c4 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -350,6 +350,8 @@ struct kvm_arch {
> > > #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9
> > > /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
> > > #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
> > > + /* Unhandled SEAs are taken to userspace */
> > > +#define KVM_ARCH_FLAG_EXIT_SEA 11
> > > unsigned long flags;
> > >
> > > /* VM-wide vCPU feature set */
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index f21d1b7f20f8e..888600df79c40 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > }
> > > mutex_unlock(&kvm->lock);
> > > break;
> > > + case KVM_CAP_ARM_SEA_TO_USER:
> > > + r = 0;
> > > + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > > + break;
> > > default:
> > > break;
> > > }
> > > @@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > case KVM_CAP_IRQFD_RESAMPLE:
> > > case KVM_CAP_COUNTER_OFFSET:
> > > case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
> > > + case KVM_CAP_ARM_SEA_TO_USER:
> > > r = 1;
> > > break;
> > > case KVM_CAP_SET_GUEST_DEBUG2:
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 7cc964af8d305..09210b6ab3907 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > > read_unlock(&vcpu->kvm->mmu_lock);
> > > }
> > >
> > > +/*
> > > + * Returns true if the SEA should be handled locally within KVM if the abort
> > > + * is caused by a kernel memory allocation (e.g. stage-2 table memory).
> > > + */
> > > +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr)
> > > +{
> > > + /*
> > > + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
> > > + * taken from a guest EL to EL2 is due to a host-imposed access (e.g.
> > > + * stage-2 PTW).
> > > + */
> > > + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
> > > + return true;
> > > +
> > > + /* KVM owns the VNCR when the vCPU isn't in a nested context. */
> > > + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
> > Is this check valid only for a "Data Abort"?
>
> Yes, the VNCR bit is specific to a Data Abort (provided we can only
> reach host_owns_sea if kvm_vcpu_abt_issea).
> I don't think we need to explicitly exclude the check here for
> Instruction Abort.
You can take an external abort on an instruction fetch, in which case
bit 13 of the ISS (VNCR bit for data abort) is RES0. So this does need
to check for a data abort.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA
2025-11-11 9:53 ` Oliver Upton
@ 2025-11-11 23:32 ` Jiaqi Yan
0 siblings, 0 replies; 19+ messages in thread
From: Jiaqi Yan @ 2025-11-11 23:32 UTC (permalink / raw)
To: Oliver Upton, Jose Marinho
Cc: maz, oliver.upton, duenwen, rananta, jthoughton, vsethi, jgg,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Tue, Nov 11, 2025 at 1:53 AM Oliver Upton <oupton@kernel.org> wrote:
>
> Hi Jiaqi,
>
> On Mon, Nov 03, 2025 at 12:45:50PM -0800, Jiaqi Yan wrote:
> > On Mon, Nov 3, 2025 at 10:17 AM Jose Marinho <jose.marinho@arm.com> wrote:
> > >
> > > Thank you for these patches.
> >
> > Thanks for your comments, Jose!
> >
> > >
> > > On 10/13/2025 7:59 PM, Jiaqi Yan wrote:
> > > > When APEI fails to handle a stage-2 synchronous external abort (SEA),
> > > > today KVM injects an asynchronous SError to the VCPU then resumes it,
> > > > which usually results in unpleasant guest kernel panic.
> > > >
> > > > One major situation of guest SEA is when vCPU consumes recoverable
> > > > uncorrected memory error (UER). Although SError and guest kernel panic
> > > > effectively stops the propagation of corrupted memory, guest may
> > > > re-use the corrupted memory if auto-rebooted; in worse case, guest
> > > > boot may run into poisoned memory. So there is room to recover from
> > > > an UER in a more graceful manner.
> > > >
> > > > Alternatively KVM can redirect the synchronous SEA event to VMM to
> > > > - Reduce blast radius if possible. VMM can inject a SEA to VCPU via
> > > > KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
> > > > consumption or fault is not from guest kernel, blast radius can be
> > > > limited to the triggering thread in guest userspace, so VM can
> > > > keep running.
> > > > - Allow VMM to protect from future memory poison consumption by
> > > > unmapping the page from stage-2, or to interrupt guest of the
> > > > poisoned page so guest kernel can unmap it from stage-1 page table.
> > > > - Allow VMM to track SEA events that VM customers care about, to restart
> > > > VM when certain number of distinct poison events have happened,
> > > > to provide observability to customers in log management UI.
> > > >
> > > > Introduce an userspace-visible feature to enable VMM handle SEA:
> > > > - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
> > > > when host APEI fails to claim a SEA, userspace can opt in this new
> > > > capability to let KVM exit to userspace during SEA if it is not
> > > > owned by host.
> > > > - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
> > > > KVM fills kvm_run.arm_sea with as much as possible information about
> > > > the SEA, enabling VMM to emulate SEA to guest by itself.
> > > > - Sanitized ESR_EL2. The general rule is to keep only the bits
> > > > useful for userspace and relevant to guest memory.
> > > > - Flags indicating if faulting guest physical address is valid.
> > > > - Faulting guest physical and virtual addresses if valid.
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
> > > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> > > > ---
> > > > arch/arm64/include/asm/kvm_host.h | 2 +
> > > > arch/arm64/kvm/arm.c | 5 +++
> > > > arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++-
> > > > include/uapi/linux/kvm.h | 10 +++++
> > > > 4 files changed, 84 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > index b763293281c88..e2c65b14e60c4 100644
> > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > @@ -350,6 +350,8 @@ struct kvm_arch {
> > > > #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9
> > > > /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
> > > > #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
> > > > + /* Unhandled SEAs are taken to userspace */
> > > > +#define KVM_ARCH_FLAG_EXIT_SEA 11
> > > > unsigned long flags;
> > > >
> > > > /* VM-wide vCPU feature set */
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index f21d1b7f20f8e..888600df79c40 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > }
> > > > mutex_unlock(&kvm->lock);
> > > > break;
> > > > + case KVM_CAP_ARM_SEA_TO_USER:
> > > > + r = 0;
> > > > + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > > > + break;
> > > > default:
> > > > break;
> > > > }
> > > > @@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > > case KVM_CAP_IRQFD_RESAMPLE:
> > > > case KVM_CAP_COUNTER_OFFSET:
> > > > case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
> > > > + case KVM_CAP_ARM_SEA_TO_USER:
> > > > r = 1;
> > > > break;
> > > > case KVM_CAP_SET_GUEST_DEBUG2:
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 7cc964af8d305..09210b6ab3907 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > > > read_unlock(&vcpu->kvm->mmu_lock);
> > > > }
> > > >
> > > > +/*
> > > > + * Returns true if the SEA should be handled locally within KVM if the abort
> > > > + * is caused by a kernel memory allocation (e.g. stage-2 table memory).
> > > > + */
> > > > +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr)
> > > > +{
> > > > + /*
> > > > + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
> > > > + * taken from a guest EL to EL2 is due to a host-imposed access (e.g.
> > > > + * stage-2 PTW).
> > > > + */
> > > > + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
> > > > + return true;
> > > > +
> > > > + /* KVM owns the VNCR when the vCPU isn't in a nested context. */
> > > > + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
> > > Is this check valid only for a "Data Abort"?
> >
> > Yes, the VNCR bit is specific to a Data Abort (provided we can only
> > reach host_owns_sea if kvm_vcpu_abt_issea).
> > I don't think we need to explicitly exclude the check here for
> > Instruction Abort.
>
> You can take an external abort on an instruction fetch, in which case
> bit 13 of the ISS (VNCR bit for data abort) is RES0. So this does need
> to check for a data abort.
Agreed and thanks for correcting me, Oliver! I will fix this in v5.
>
> Thanks,
> Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-11-10 17:41 ` Jiaqi Yan
@ 2025-11-13 13:54 ` Mauro Carvalho Chehab
2025-11-13 18:21 ` Oliver Upton
0 siblings, 1 reply; 19+ messages in thread
From: Mauro Carvalho Chehab @ 2025-11-13 13:54 UTC (permalink / raw)
To: Jiaqi Yan
Cc: Jason Gunthorpe, maz, oliver.upton, duenwen, rananta, jthoughton,
vsethi, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
will, pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
Hi,
On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote:
> On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > > Problem
> > > =======
> > >
> > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > during guest abort, today KVM directly injects an asynchronous SError
> > > into the VCPU then resumes it. The injected SError usually results in
> > > unpleasant guest kernel panic.
> > >
> > > One of the major situation of guest SEA is when VCPU consumes recoverable
> > > uncorrected memory error (UER), which is not uncommon at all in modern
> > > datacenter servers with large amounts of physical memory. Although SError
> > > and guest panic is sufficient to stop the propagation of corrupted memory,
> > > there is room to recover from an UER in a more graceful manner.
> > >
> > > Proposed Solution
> > > =================
> > >
> > > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > > error consumption or the fault that cause SEA is not from guest kernel,
> > > the blast radius can be limited to the poison-consuming guest process,
> > > while the VM can keep running.
I like the idea of having a "guest-first"/"host-first" approach for APEI,
letting userspace (likely rasdaemon) to decide to handle hardware errors
either at the guest or at the host. Yet, it sounds wrong to have a flag
called KVM_EXIT_ARM_SEA, as:
1. This is not exclusive to ARM;
2. There are other notification mechanisms that can rise an APEI
errors. For instance QEMU code defines:
ACPI_GHES_NOTIFY_POLLED = 0,
ACPI_GHES_NOTIFY_EXTERNAL = 1,
ACPI_GHES_NOTIFY_LOCAL = 2,
ACPI_GHES_NOTIFY_SCI = 3,
ACPI_GHES_NOTIFY_NMI = 4,
ACPI_GHES_NOTIFY_CMCI = 5,
ACPI_GHES_NOTIFY_MCE = 6,
ACPI_GHES_NOTIFY_GPIO = 7,
ACPI_GHES_NOTIFY_SEA = 8,
ACPI_GHES_NOTIFY_SEI = 9,
ACPI_GHES_NOTIFY_GSIV = 10,
ACPI_GHES_NOTIFY_SDEI = 11,
ACPI_GHES_NOTIFY_RESERVED = 12
- even on arm. QEMU currently implements two mechanisms (SEA and GPIO);
- once we implement the same feature on Intel, it will likely use
NMI, MCE and/or SCI.
So, IMO, the best would be to use a more generic name like
KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really
is meant: KVM_EXIT_ACPI_GUEST_FIRST.
That's said, I'd say that we need an implementation on a real userspace
applicaton to be able to test it (rasdaemon being the obvious candidate).
In order to test, the better is to use the new QEMU code (for 10.2) to
allow injecting hardware errors via QMP.
Regards,
Mauro
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-11-13 13:54 ` Mauro Carvalho Chehab
@ 2025-11-13 18:21 ` Oliver Upton
0 siblings, 0 replies; 19+ messages in thread
From: Oliver Upton @ 2025-11-13 18:21 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jiaqi Yan, Jason Gunthorpe, maz, duenwen, rananta, jthoughton,
vsethi, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
will, pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest
On Thu, Nov 13, 2025 at 02:54:33PM +0100, Mauro Carvalho Chehab wrote:
> Hi,
>
> On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote:
> > On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
> > > > Problem
> > > > =======
> > > >
> > > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > > during guest abort, today KVM directly injects an asynchronous SError
> > > > into the VCPU then resumes it. The injected SError usually results in
> > > > unpleasant guest kernel panic.
> > > >
> > > > One of the major situation of guest SEA is when VCPU consumes recoverable
> > > > uncorrected memory error (UER), which is not uncommon at all in modern
> > > > datacenter servers with large amounts of physical memory. Although SError
> > > > and guest panic is sufficient to stop the propagation of corrupted memory,
> > > > there is room to recover from an UER in a more graceful manner.
> > > >
> > > > Proposed Solution
> > > > =================
> > > >
> > > > The idea is, we can replay the SEA to the faulting VCPU. If the memory
> > > > error consumption or the fault that cause SEA is not from guest kernel,
> > > > the blast radius can be limited to the poison-consuming guest process,
> > > > while the VM can keep running.
>
> I like the idea of having a "guest-first"/"host-first" approach for APEI,
> letting userspace (likely rasdaemon) to decide to handle hardware errors
> either at the guest or at the host. Yet, it sounds wrong to have a flag
> called KVM_EXIT_ARM_SEA, as:
>
> 1. This is not exclusive to ARM;
> 2. There are other notification mechanisms that can rise an APEI
> errors. For instance QEMU code defines:
>
> ACPI_GHES_NOTIFY_POLLED = 0,
> ACPI_GHES_NOTIFY_EXTERNAL = 1,
> ACPI_GHES_NOTIFY_LOCAL = 2,
> ACPI_GHES_NOTIFY_SCI = 3,
> ACPI_GHES_NOTIFY_NMI = 4,
> ACPI_GHES_NOTIFY_CMCI = 5,
> ACPI_GHES_NOTIFY_MCE = 6,
> ACPI_GHES_NOTIFY_GPIO = 7,
> ACPI_GHES_NOTIFY_SEA = 8,
> ACPI_GHES_NOTIFY_SEI = 9,
> ACPI_GHES_NOTIFY_GSIV = 10,
> ACPI_GHES_NOTIFY_SDEI = 11,
> ACPI_GHES_NOTIFY_RESERVED = 12
>
> - even on arm. QEMU currently implements two mechanisms (SEA and GPIO);
> - once we implement the same feature on Intel, it will likely use
> NMI, MCE and/or SCI.
>
> So, IMO, the best would be to use a more generic name like
> KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really
> is meant: KVM_EXIT_ACPI_GUEST_FIRST.
This is not the sort of thing that I'd like to seen dressed up as an
arch-generic interface.
What Jiaqi is dealing with is the very sorry state of RAS on arm64,
giving userspace the opportunity to decide how an SEA is handled when a
platform's firmware couldn't be bothered to do so. The SEA is an
architecture-specific event so we provide the hardware context to
the VMM to sort things out.
If the APEI driver actually registers to handle the SEA then it will
continue to handle the SEA before ever involving the VMM. I'm not
aware of any system that does this. If you're lucky you'll take an
*asynchronous* vector after to process a CPER and still have to deal
with a 'bare' SEA.
And of course, none of this even matters for the several billion
DT-based hosts out in the wild.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-10-13 18:59 [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
` (3 preceding siblings ...)
2025-10-20 14:46 ` [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jason Gunthorpe
@ 2025-11-13 21:06 ` Oliver Upton
2025-11-13 22:14 ` Jiaqi Yan
4 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2025-11-13 21:06 UTC (permalink / raw)
To: maz, oliver.upton, Jiaqi Yan
Cc: Oliver Upton, duenwen, rananta, jthoughton, vsethi, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest, Jason Gunthorpe
On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> Problem
> =======
>
> When host APEI is unable to claim a synchronous external abort (SEA)
> during guest abort, today KVM directly injects an asynchronous SError
> into the VCPU then resumes it. The injected SError usually results in
> unpleasant guest kernel panic.
>
> [...]
I've gone ahead and done some cleanups, especially around documentation.
Applied to next, thanks!
[1/3] KVM: arm64: VM exit to userspace to handle SEA
https://git.kernel.org/kvmarm/kvmarm/c/ad9c62bd8946
[2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA
https://git.kernel.org/kvmarm/kvmarm/c/feee9ef7ac16
[3/3] Documentation: kvm: new UAPI for handling SEA
https://git.kernel.org/kvmarm/kvmarm/c/4debb5e8952e
--
Best,
Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-11-13 21:06 ` Oliver Upton
@ 2025-11-13 22:14 ` Jiaqi Yan
2025-11-13 22:33 ` Oliver Upton
0 siblings, 1 reply; 19+ messages in thread
From: Jiaqi Yan @ 2025-11-13 22:14 UTC (permalink / raw)
To: Oliver Upton, oliver.upton
Cc: maz, duenwen, rananta, jthoughton, vsethi, joey.gouly,
suzuki.poulose, yuzenghui, catalin.marinas, will, pbonzini,
corbet, shuah, kvm, kvmarm, linux-arm-kernel, linux-kernel,
linux-doc, linux-kselftest, Jason Gunthorpe
On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote:
>
> On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> > Problem
> > =======
> >
> > When host APEI is unable to claim a synchronous external abort (SEA)
> > during guest abort, today KVM directly injects an asynchronous SError
> > into the VCPU then resumes it. The injected SError usually results in
> > unpleasant guest kernel panic.
> >
> > [...]
>
> I've gone ahead and done some cleanups, especially around documentation.
>
> Applied to next, thanks!
Many thanks, Oliver!
I assume I still need to send out v5 with typo fixed, comments
addressed, and your cleanups applied? If so, what specific tag/release
you want me to rebase v5 onto?
>
> [1/3] KVM: arm64: VM exit to userspace to handle SEA
> https://git.kernel.org/kvmarm/kvmarm/c/ad9c62bd8946
> [2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA
> https://git.kernel.org/kvmarm/kvmarm/c/feee9ef7ac16
> [3/3] Documentation: kvm: new UAPI for handling SEA
> https://git.kernel.org/kvmarm/kvmarm/c/4debb5e8952e
>
> --
> Best,
> Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-11-13 22:14 ` Jiaqi Yan
@ 2025-11-13 22:33 ` Oliver Upton
2025-11-14 0:53 ` Jiaqi Yan
0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2025-11-13 22:33 UTC (permalink / raw)
To: Jiaqi Yan
Cc: oliver.upton, maz, duenwen, rananta, jthoughton, vsethi,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest, Jason Gunthorpe
On Thu, Nov 13, 2025 at 02:14:08PM -0800, Jiaqi Yan wrote:
> On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote:
> >
> > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> > > Problem
> > > =======
> > >
> > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > during guest abort, today KVM directly injects an asynchronous SError
> > > into the VCPU then resumes it. The injected SError usually results in
> > > unpleasant guest kernel panic.
> > >
> > > [...]
> >
> > I've gone ahead and done some cleanups, especially around documentation.
> >
> > Applied to next, thanks!
>
> Many thanks, Oliver!
>
> I assume I still need to send out v5 with typo fixed, comments
> addressed, and your cleanups applied? If so, what specific tag/release
> you want me to rebase v5 onto?
No need -- I took care of the issues I spotted when applying, LMK if
anything looks amiss on kvmarm/next.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA
2025-11-13 22:33 ` Oliver Upton
@ 2025-11-14 0:53 ` Jiaqi Yan
0 siblings, 0 replies; 19+ messages in thread
From: Jiaqi Yan @ 2025-11-14 0:53 UTC (permalink / raw)
To: Oliver Upton
Cc: oliver.upton, maz, duenwen, rananta, jthoughton, vsethi,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
pbonzini, corbet, shuah, kvm, kvmarm, linux-arm-kernel,
linux-kernel, linux-doc, linux-kselftest, Jason Gunthorpe
On Thu, Nov 13, 2025 at 2:34 PM Oliver Upton <oupton@kernel.org> wrote:
>
> On Thu, Nov 13, 2025 at 02:14:08PM -0800, Jiaqi Yan wrote:
> > On Thu, Nov 13, 2025 at 1:06 PM Oliver Upton <oupton@kernel.org> wrote:
> > >
> > > On Mon, 13 Oct 2025 18:59:00 +0000, Jiaqi Yan wrote:
> > > > Problem
> > > > =======
> > > >
> > > > When host APEI is unable to claim a synchronous external abort (SEA)
> > > > during guest abort, today KVM directly injects an asynchronous SError
> > > > into the VCPU then resumes it. The injected SError usually results in
> > > > unpleasant guest kernel panic.
> > > >
> > > > [...]
> > >
> > > I've gone ahead and done some cleanups, especially around documentation.
> > >
> > > Applied to next, thanks!
> >
> > Many thanks, Oliver!
> >
> > I assume I still need to send out v5 with typo fixed, comments
> > addressed, and your cleanups applied? If so, what specific tag/release
> > you want me to rebase v5 onto?
>
> No need -- I took care of the issues I spotted when applying, LMK if
> anything looks amiss on kvmarm/next.
I took a look and everything looks fixed, and thanks for nearly
rewriting the documentation!
>
> Thanks,
> Oliver
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-11-14 0:53 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 18:59 [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 1/3] KVM: arm64: VM exit to userspace to handle SEA Jiaqi Yan
2025-11-03 18:17 ` Jose Marinho
2025-11-03 20:45 ` Jiaqi Yan
2025-11-11 9:53 ` Oliver Upton
2025-11-11 23:32 ` Jiaqi Yan
2025-11-03 22:22 ` Marc Zyngier
2025-10-13 18:59 ` [PATCH v4 2/3] KVM: selftests: Test for KVM_EXIT_ARM_SEA Jiaqi Yan
2025-10-13 18:59 ` [PATCH v4 3/3] Documentation: kvm: new UAPI for handling SEA Jiaqi Yan
2025-10-14 1:51 ` Randy Dunlap
2025-10-21 16:13 ` Jiaqi Yan
2025-10-20 14:46 ` [PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jason Gunthorpe
2025-11-10 17:41 ` Jiaqi Yan
2025-11-13 13:54 ` Mauro Carvalho Chehab
2025-11-13 18:21 ` Oliver Upton
2025-11-13 21:06 ` Oliver Upton
2025-11-13 22:14 ` Jiaqi Yan
2025-11-13 22:33 ` Oliver Upton
2025-11-14 0:53 ` Jiaqi Yan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).