From: Oliver Upton <oliver.upton@linux.dev>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: maz@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
pbonzini@redhat.com, corbet@lwn.net, shuah@kernel.org,
kvm@vger.kernel.org, kvmarm@lists.linux.dev,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kselftest@vger.kernel.org, duenwen@google.com,
rananta@google.com, jthoughton@google.com
Subject: Re: [PATCH v2 1/6] KVM: arm64: VM exit to userspace to handle SEA
Date: Fri, 11 Jul 2025 12:39:50 -0700 [thread overview]
Message-ID: <aHFohmTb9qR_JG1E@linux.dev> (raw)
In-Reply-To: <20250604050902.3944054-2-jiaqiyan@google.com>
Hi Jiaqi,
On Wed, Jun 04, 2025 at 05:08:56AM +0000, Jiaqi Yan wrote:
> When APEI fails to handle a stage-2 synchronous external abort (SEA),
> today KVM directly injects an async SError to the VCPU then resumes it,
> which usually results in unpleasant guest kernel panic.
>
> One major situation of guest SEA is when vCPU consumes recoverable
> uncorrected memory error (UER). Although SError and guest kernel panic
> effectively stops the propagation of corrupted memory, there is room
> to recover from an UER in a more graceful manner.
>
> Alternatively KVM can redirect the synchronous SEA event to VMM to
> - Reduce blast radius if possible. VMM can inject a SEA to VCPU via
> KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
> consumption or fault is not from guest kernel, blast radius can be
> limited to the triggering thread in guest userspace, so VM can
> keep running.
> - VMM can protect from future memory poison consumption by unmapping
> the page from stage-2, or interrupt guest of the poisoned guest page
> so guest kernel can unmap it from stage-1.
> - VMM can also track SEA events that VM customers care about, restart
> VM when certain number of distinct poison events have happened,
> provide observability to customers in log management UI.
>
> Introduce an userspace-visible feature to enable VMM to handle SEA:
> - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
> when host APEI fails to claim a SEA, userspace can opt in this new
> capability to let KVM exit to userspace during SEA if it is not
> caused by access on memory of stage-2 translation table.
> - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
> KVM fills kvm_run.arm_sea with as much as possible information about
> the SEA, enabling VMM to emulate SEA to guest by itself.
> - Sanitized ESR_EL2. The general rule is to keep only the bits
> useful for userspace and relevant to guest memory. See code
> comments for why bits are hidden/reported.
> - If faulting guest virtual and physical addresses are available.
> - Faulting guest virtual address if available.
> - Faulting guest physical address if available.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
I was reviewing this locally and wound up making enough changes where it
just made more sense to share the diff. General comments:
- Avoid adding helpers to headers when they're used in a single
callsite / compilation unit
- Add some detail about FEAT_RAS where we may still exit to userspace
for host-controlled memory, as we cannot differentiate between a
stage-1 or stage-2 TTW SEA when taken on the descriptor PA
- Explicitly handle SEAs due to VNCR (I have a separate prereq patch)
From aac0bb8f90c43b5b17c3b4e50379cb8ca828812c Mon Sep 17 00:00:00 2001
From: Jiaqi Yan <jiaqiyan@google.com>
Date: Wed, 4 Jun 2025 05:08:56 +0000
Subject: [PATCH] KVM: arm64: VM exit to userspace to handle SEA
When APEI fails to handle a stage-2 synchronous external abort (SEA),
today KVM directly injects an async SError to the VCPU then resumes it,
which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable
uncorrected memory error (UER). Although SError and guest kernel panic
effectively stops the propagation of corrupted memory, there is room
to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via
KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
consumption or fault is not from guest kernel, blast radius can be
limited to the triggering thread in guest userspace, so VM can
keep running.
- VMM can protect from future memory poison consumption by unmapping
the page from stage-2, or interrupt guest of the poisoned guest page
so guest kernel can unmap it from stage-1.
- VMM can also track SEA events that VM customers care about, restart
VM when certain number of distinct poison events have happened,
provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
when host APEI fails to claim a SEA, userspace can opt in this new
capability to let KVM exit to userspace during SEA if it is not
caused by access on memory of stage-2 translation table.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
KVM fills kvm_run.arm_sea with as much as possible information about
the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits
useful for userspace and relevant to guest memory. See code
comments for why bits are hidden/reported.
- If faulting guest virtual and physical addresses are available.
- Faulting guest virtual address if available.
- Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Link: https://lore.kernel.org/r/20250604050902.3944054-2-jiaqiyan@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 5 +++
arch/arm64/kvm/mmu.c | 67 ++++++++++++++++++++++++++++++-
include/uapi/linux/kvm.h | 10 +++++
4 files changed, 83 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index e54d29feb469..98ce2d58ac8d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -349,6 +349,8 @@ struct kvm_arch {
#define KVM_ARCH_FLAG_GUEST_HAS_SVE 9
/* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
#define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
+ /* Unhandled SEAs are taken to userspace */
+#define KVM_ARCH_FLAG_EXIT_SEA 11
unsigned long flags;
/* VM-wide vCPU feature set */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 7a1a8210ff91..aec6034db1e7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -133,6 +133,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(&kvm->lock);
break;
+ case KVM_CAP_ARM_SEA_TO_USER:
+ r = 0;
+ set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
+ break;
default:
break;
}
@@ -322,6 +326,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_COUNTER_OFFSET:
case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
+ case KVM_CAP_ARM_SEA_TO_USER:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a34924d75069..26b2e71994be 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1813,8 +1813,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
read_unlock(&vcpu->kvm->mmu_lock);
}
+/*
+ * Returns true if the SEA should be handled locally within KVM if the abort is
+ * caused by a kernel memory allocation (e.g. stage-2 table memory).
+ */
+static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr)
+{
+ /*
+ * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
+ * taken from a guest EL to EL2 is due to a host-imposed access (e.g.
+ * stage-2 PTW).
+ */
+ if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
+ return true;
+
+ /* KVM owns the VNCR when the vCPU isn't in a nested context. */
+ if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
+ return true;
+
+ /*
+ * Determining if an external abort during a table walk happened at
+ * stage-2 is only possible with S1PTW is set. Otherwise, since KVM
+ * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the PA
+ * of the stage-1 descriptor) can reach here and are reported with a
+ * TTW ESR value.
+ */
+ return esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW);
+}
+
int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
{
+ u64 esr = kvm_vcpu_get_esr(vcpu);
+ struct kvm_run *run = vcpu->run;
+ struct kvm *kvm = vcpu->kvm;
+ u64 esr_mask = ESR_ELx_EC_MASK |
+ ESR_ELx_FnV |
+ ESR_ELx_EA |
+ ESR_ELx_CM |
+ ESR_ELx_WNR |
+ ESR_ELx_FSC;
+ u64 ipa;
+
+
/*
* Give APEI the opportunity to claim the abort before handling it
* within KVM. apei_claim_sea() expects to be called with IRQs
@@ -1824,7 +1864,32 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
if (apei_claim_sea(NULL) == 0)
return 1;
- return kvm_inject_serror(vcpu);
+ if (host_owns_sea(vcpu, esr) || !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags))
+ return kvm_inject_serror(vcpu);
+
+ /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */
+ if (kvm_has_ras(kvm))
+ esr_mask |= ESR_ELx_SET_MASK;
+
+ /*
+ * Exit to userspace, and provide faulting guest virtual and physical
+ * addresses in case userspace wants to emulate SEA to guest by
+ * writing to FAR_EL1 and HPFAR_EL1 registers.
+ */
+ memset(&run->arm_sea, 0, sizeof(run->arm_sea));
+ run->exit_reason = KVM_EXIT_ARM_SEA;
+ run->arm_sea.esr = esr & esr_mask;
+
+ if (!(esr & ESR_ELx_FnV))
+ run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu);
+
+ ipa = kvm_vcpu_get_fault_ipa(vcpu);
+ if (ipa != INVALID_GPA) {
+ run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
+ run->arm_sea.gpa = ipa;
+ }
+
+ return 0;
}
/**
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e4e566ff348b..b2cc3d74d769 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -179,6 +179,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_LOONGARCH_IOCSR 38
#define KVM_EXIT_MEMORY_FAULT 39
#define KVM_EXIT_TDX 40
+#define KVM_EXIT_ARM_SEA 41
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -469,6 +470,14 @@ struct kvm_run {
} get_tdvmcall_info;
};
} tdx;
+ /* KVM_EXIT_ARM_SEA */
+ struct {
+#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
+ __u64 flags;
+ __u64 esr;
+ __u64 gva;
+ __u64 gpa;
+ } arm_sea;
/* Fix the size of the union. */
char padding[256];
};
@@ -957,6 +966,7 @@ struct kvm_enable_cap {
#define KVM_CAP_ARM_EL2_E2H0 241
#define KVM_CAP_RISCV_MP_STATE_RESET 242
#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
+#define KVM_CAP_ARM_SEA_TO_USER 244
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.39.5
next prev parent reply other threads:[~2025-07-11 19:40 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-04 5:08 [PATCH v2 0/6] VMM can handle guest SEA via KVM_EXIT_ARM_SEA Jiaqi Yan
2025-06-04 5:08 ` [PATCH v2 1/6] KVM: arm64: VM exit to userspace to handle SEA Jiaqi Yan
2025-07-01 17:35 ` Jiaqi Yan
2025-07-11 19:39 ` Oliver Upton [this message]
2025-07-11 23:59 ` Jiaqi Yan
2025-07-12 19:57 ` Oliver Upton
2025-07-19 21:24 ` Jiaqi Yan
2025-07-25 22:54 ` Jiaqi Yan
2025-07-29 21:28 ` Oliver Upton
2025-07-31 21:06 ` Jiaqi Yan
2025-06-04 5:08 ` [PATCH v2 2/6] KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid Jiaqi Yan
2025-06-04 5:08 ` [PATCH v2 3/6] KVM: arm64: Allow userspace to inject external instruction aborts Jiaqi Yan
2025-07-11 19:42 ` Oliver Upton
2025-07-11 23:58 ` Jiaqi Yan
2025-07-12 19:47 ` Oliver Upton
2025-07-13 2:42 ` Jiaqi Yan
2025-06-04 5:08 ` [PATCH v2 4/6] KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER Jiaqi Yan
2025-06-04 5:09 ` [PATCH v2 5/6] KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT Jiaqi Yan
2025-07-11 19:44 ` Oliver Upton
2025-07-11 23:59 ` Jiaqi Yan
2025-06-04 5:09 ` [PATCH v2 6/6] Documentation: kvm: new uAPI for handling SEA Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aHFohmTb9qR_JG1E@linux.dev \
--to=oliver.upton@linux.dev \
--cc=catalin.marinas@arm.com \
--cc=corbet@lwn.net \
--cc=duenwen@google.com \
--cc=jiaqiyan@google.com \
--cc=joey.gouly@arm.com \
--cc=jthoughton@google.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=maz@kernel.org \
--cc=pbonzini@redhat.com \
--cc=rananta@google.com \
--cc=shuah@kernel.org \
--cc=suzuki.poulose@arm.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).