From: Marc Zyngier <maz@kernel.org>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, pbonzini@redhat.com,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
kvm@vger.kernel.org, duenwen@google.com, rananta@google.com
Subject: Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA
Date: Fri, 01 Nov 2024 09:03:16 +0000 [thread overview]
Message-ID: <86r07v1g2z.wl-maz@kernel.org> (raw)
In-Reply-To: <20241031212104.1429609-1-jiaqiyan@google.com>
On Thu, 31 Oct 2024 21:21:04 +0000,
Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> Currently KVM handles SEA in guest by injecting async SError into
> guest directly, bypassing VMM, usually results in guest kernel panic.
>
> One major situation of guest SEA is when vCPU consumes uncorrectable
> memory error on the physical memory. Although SError and guest kernel
> panic effectively stops the propagation of corrupted memory, it is not
> easy for VMM and guest to recover from memory error in a more graceful
> manner.
>
> Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> thread.
Can you elaborate on why the delivery of a signal is preferable to
simply exiting back to userspace with a description of the error?
Signals are usually not generated by KVM, and are a pretty twisted way
to generate an exit.
> In addition to the benifit that KVM's handling for SEA becomes aligned
> with core kernel behavior
> - The blast radius in VM can be limited to only the consuming thread
> in guest, instead of entire guest kernel, unless the consumption is
> from guest kernel.
> - VMM now has the chance to do its duties to stop the VM from repeatedly
> consuming corrupted data. For example, VMM can unmap the guest page
> from stage-2 table to intercept forseen memory poison consumption,
Not quite. The VMM doesn't manage stage-2. It can remove the page from
the VMA if it has it mapped, but that's it. The kernel deals with S2.
Which brings me to the next subject: when the kernel unmaps the page
at S2, it is likely to use CMOs. Can these CMOs create RAS error
themselves?
> and for every consumption injects SEA to EL1 with synthetic memory
> error CPER.
Why do we need to involve ACPI here? I would expect the production of
an architected error record instead. Or at least be given the option.
> Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> can opt in this new capability if it prefers SIGBUS than SError
> injection during VM init. Now SEA handling in KVM works as follows:
> 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> enabled for the VM, and the SEA is NOT about translation table,
> send SIGBUS BUS_OBJERR signal with host virtual address.
And what if it is? S1 PTs are backed by userspace memory, like
anything else. I don't think we should have a different treatment of
those, because the HW wouldn't treat them differently either.
> 3. Otherwise directly inject async SError to guest.
>
> Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> in VM allocated some memory buffer. The test used EINJ to inject an
> uncorrectable recoverable memory error at a page in the allocated memory
> buffer. The dummy application then consumed the memory error. Some hack
> was done to make core kernel's memory_failure triggered by poison
> generation to fail, so KVM had to deal with the SEA guest abort due to
> poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> valid host virtual address of the poisoned page. VMM then injected a SEA
> into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> guest survived and continued to run.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
> arch/arm64/include/asm/kvm_host.h | 2 +
> arch/arm64/include/asm/kvm_ras.h | 20 ++++----
> arch/arm64/kvm/Makefile | 2 +-
> arch/arm64/kvm/arm.c | 5 ++
> arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++
> arch/arm64/kvm/mmu.c | 8 +---
> include/uapi/linux/kvm.h | 1 +
> 7 files changed, 98 insertions(+), 17 deletions(-)
> create mode 100644 arch/arm64/kvm/kvm_ras.c
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index bf64fed9820ea..eb37a2489411a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -334,6 +334,8 @@ struct kvm_arch {
> /* Fine-Grained UNDEF initialised */
> #define KVM_ARCH_FLAG_FGU_INITIALIZED 8
> unsigned long flags;
> + /* Instead of injecting SError into guest, SIGBUS VMM */
> +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9
nit: why do you put this definition out of sequence (below 'flags')?
>
> /* VM-wide vCPU feature set */
> DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> index 87e10d9a635b5..4bb7a424e3f6c 100644
> --- a/arch/arm64/include/asm/kvm_ras.h
> +++ b/arch/arm64/include/asm/kvm_ras.h
> @@ -11,15 +11,17 @@
> #include <asm/acpi.h>
>
> /*
> - * Was this synchronous external abort a RAS notification?
> - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + * Handle synchronous external abort (SEA) in the following order:
> + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> + * are all done.
> + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> + * about translation table, send SIGBUS
> + * - si_code is BUS_OBJERR.
> + * - si_addr will be 0 when accurate HVA is unavailable.
> + * 3. Otherwise, directly inject an async SError to guest.
> + *
> + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> */
> -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> -{
> - /* apei_claim_sea(NULL) expects to mask interrupts itself */
> - lockdep_assert_irqs_enabled();
> -
> - return apei_claim_sea(NULL);
> -}
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>
> #endif /* __ARM64_KVM_RAS_H__ */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 3cf7adb2b5038..c4a3a6d4870e6 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> vgic/vgic-v3.o vgic/vgic-v4.o \
> vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> - vgic/vgic-its.o vgic/vgic-debug.o
> + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
>
> kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
> kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 48cafb65d6acf..bb97ad678dbec 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> }
> mutex_unlock(&kvm->slots_lock);
> break;
> + case KVM_CAP_ARM_SIGBUS_ON_SEA:
> + r = 0;
> + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
Shouldn't this be somehow gated on the VM being RAS aware?
> + break;
> default:
> break;
> }
> @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ARM_SYSTEM_SUSPEND:
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_COUNTER_OFFSET:
> + case KVM_CAP_ARM_SIGBUS_ON_SEA:
> r = 1;
> break;
> case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> new file mode 100644
> index 0000000000000..3225462bcbcda
> --- /dev/null
> +++ b/arch/arm64/kvm/kvm_ras.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/bitops.h>
> +#include <linux/kvm_host.h>
> +
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_ras.h>
> +#include <asm/system_misc.h>
> +
> +/*
> + * For synchrnous external instruction or data abort, not on translation
> + * table walk or hardware update of translation table, is FAR_EL2 valid?
> + */
> +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> +{
> + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> +}
> +
> +/*
> + * Was this synchronous external abort a RAS notification?
> + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + */
> +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> +{
> + /* apei_claim_sea(NULL) expects to mask interrupts itself */
> + lockdep_assert_irqs_enabled();
> + return apei_claim_sea(NULL);
> +}
> +
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> +{
> + bool sigbus_on_sea;
> + int idx;
> + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> + u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> + gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> + /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> + unsigned long hva = 0UL;
> +
> + /*
> + * For RAS the host kernel may handle this abort.
> + * There is no need to SIGBUS VMM, or pass the error into the guest.
> + */
> + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> + return;
> +
> + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> + &(vcpu->kvm->arch.flags));
> +
> + /*
> + * In addition to userspace opt-in, SIGBUS only makes sense if the
> + * abort is NOT about translation table walk and NOT about hardware
> + * update of translation table.
> + */
> + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> +
> + /* Pass the error directly into the guest. */
> + if (!sigbus_on_sea) {
> + kvm_inject_vabt(vcpu);
> + return;
> + }
> +
> + if (kvm_vcpu_sea_far_valid(vcpu)) {
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> + hva = gfn_to_hva(vcpu->kvm, gfn);
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + }
> +
> + /*
> + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> + * does if apei_claim_sea() fails.
> + */
> + arm64_notify_die("synchronous external abort",
> + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
This is the point where I really think we should simply trigger an
exit with all that syndrome information stashed in kvm_run, like any
other event requiring userspace help.
Also: where is the documentation?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2024-11-01 14:09 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-31 21:21 [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA Jiaqi Yan
2024-10-31 21:46 ` Oliver Upton
2024-11-08 4:54 ` Jiaqi Yan
2024-11-01 9:03 ` Marc Zyngier [this message]
2024-11-04 5:02 ` Jiaqi Yan
2024-11-08 21:18 ` Jiaqi Yan
2024-11-12 8:55 ` Marc Zyngier
2024-11-12 8:51 ` Marc Zyngier
2024-11-19 23:57 ` Jiaqi Yan
2024-11-20 7:46 ` Oliver Upton
2024-12-06 1:02 ` Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86r07v1g2z.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=duenwen@google.com \
--cc=jiaqiyan@google.com \
--cc=joey.gouly@arm.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=rananta@google.com \
--cc=suzuki.poulose@arm.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).