* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Gavin Shan @ 2026-05-28 5:30 UTC (permalink / raw)
To: Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-27-steven.price@arm.com>
Hi Steve,
On 5/13/26 11:17 PM, Steven Price wrote:
> The VMM needs to populate the realm with some data before starting (e.g.
> a kernel and initrd). This is measured by the RMM and used as part of
> the attestation later on.
>
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
> * Rename realm_create_protected_data_page() to realm_data_map_init().
> Changes since v12:
> * The ioctl now updates the structure with the amount populated rather
> than returning this through the ioctl return code.
> * Use the new RMM v2.0 range based RMI calls.
> * Adapt to upstream changes in kvm_gmem_populate().
> Changes since v11:
> * The multiplex CAP is gone and there's a new ioctl which makes use of
> the generic kvm_gmem_populate() functionality.
> Changes since v7:
> * Improve the error codes.
> * Other minor changes from review.
> Changes since v6:
> * Handle host potentially having a larger page size than the RMM
> granule.
> * Drop historic "par" (protected address range) from
> populate_par_region() - it doesn't exist within the current
> architecture.
> * Add a cond_resched() call in kvm_populate_realm().
> Changes since v5:
> * Refactor to use PFNs rather than tracking struct page in
> realm_create_protected_data_page().
> * Pull changes from a later patch (in the v5 series) for accessing
> pages from a guest memfd.
> * Do the populate in chunks to avoid holding locks for too long and
> triggering RCU stall warnings.
> ---
> arch/arm64/include/asm/kvm_rmi.h | 4 ++
> arch/arm64/kvm/Kconfig | 1 +
> arch/arm64/kvm/arm.c | 13 ++++
> arch/arm64/kvm/rmi.c | 106 +++++++++++++++++++++++++++++++
> 4 files changed, 124 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index 007249a13dbc..a2b6bc412a22 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -88,6 +88,10 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu);
> int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
> int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>
> +struct kvm_arm_rmi_populate;
> +
> +int kvm_arm_rmi_populate(struct kvm *kvm,
> + struct kvm_arm_rmi_populate *arg);
> void kvm_realm_unmap_range(struct kvm *kvm,
> unsigned long ipa,
> unsigned long size,
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 4e16719fda22..d0cd011cf672 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -38,6 +38,7 @@ menuconfig KVM
> select GUEST_PERF_EVENTS if PERF_EVENTS
> select KVM_GUEST_MEMFD
> select KVM_GENERIC_MEMORY_ATTRIBUTES
> + select HAVE_KVM_ARCH_GMEM_POPULATE
> help
> Support hosting virtualized guest machines.
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ed88a203b892..073ba9181da9 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2131,6 +2131,19 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> return -EFAULT;
> return kvm_vm_ioctl_get_reg_writable_masks(kvm, &range);
> }
> + case KVM_ARM_RMI_POPULATE: {
> + struct kvm_arm_rmi_populate req;
> + int ret;
> +
> + if (!kvm_is_realm(kvm))
> + return -ENXIO;
> + if (copy_from_user(&req, argp, sizeof(req)))
> + return -EFAULT;
> + ret = kvm_arm_rmi_populate(kvm, &req);
> + if (copy_to_user(argp, &req, sizeof(req)))
> + return -EFAULT;
> + return ret;
> + }
s/return ret/return 0; The variable 'ret' can be dropped.
> default:
> return -EINVAL;
> }
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index a89873a5eb77..209087bcf399 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm, unsigned long start,
> realm_unmap_private_range(kvm, start, end, may_block);
> }
>
> +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
> + kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
> + unsigned long flags)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + phys_addr_t rd = virt_to_phys(realm->rd);
> + phys_addr_t dst_phys, src_phys;
> + int ret;
> +
> + dst_phys = __pfn_to_phys(dst_pfn);
> + src_phys = __pfn_to_phys(src_pfn);
> +
> + if (rmi_delegate_page(dst_phys))
> + return -ENXIO;
> +
> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + /* Create missing RTTs and retry */
> + int level = RMI_RETURN_INDEX(ret);
> +
> + KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);> +
> + ret = realm_create_rtt_levels(realm, ipa, level,
> + KVM_PGTABLE_LAST_LEVEL, NULL);
> + if (!ret) {
> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys,
> + flags);
> + }
> + }
> +
> + if (ret) {
> + if (WARN_ON(rmi_undelegate_page(dst_phys))) {
> + /* Undelegate failed, so we leak the page */
> + get_page(pfn_to_page(dst_pfn));
> + }
> + }
> +
if (ret && WARN_ON(rmi_undelegate_page(dst_phys)) {
/* Leak the page that fails to be undelegated */
get_page(pfn_to_page(dst_pfn));
}
> + return ret;
> +}
> +
> +static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> + struct page *src_page, void *opaque)
> +{
> + unsigned long data_flags = *(unsigned long *)opaque;
> + phys_addr_t ipa = gfn_to_gpa(gfn);
> +
> + if (!src_page)
> + return -EOPNOTSUPP;
> +
> + return realm_data_map_init(kvm, ipa, pfn, page_to_pfn(src_page),
> + data_flags);
> +}
> +
> +static long populate_region(struct kvm *kvm,
> + gfn_t base_gfn,
> + unsigned long pages,
> + u64 uaddr,
> + unsigned long data_flags)
> +{
> + long ret = 0;
> +
> + mutex_lock(&kvm->slots_lock);
> + ret = kvm_gmem_populate(kvm, base_gfn, u64_to_user_ptr(uaddr), pages,
> + populate_region_cb, &data_flags);
> + mutex_unlock(&kvm->slots_lock);
> +
> + return ret;
> +}
> +
> enum ripas_action {
> RIPAS_INIT,
> RIPAS_SET,
> @@ -574,6 +643,43 @@ static int realm_ensure_created(struct kvm *kvm)
> return -ENXIO;
> }
>
> +int kvm_arm_rmi_populate(struct kvm *kvm,
> + struct kvm_arm_rmi_populate *args)
> +{
> + unsigned long data_flags = 0;
> + unsigned long ipa_start = args->base;
> + unsigned long ipa_end = ipa_start + args->size;
> + long pages_populated;
> + int ret;
> +
> + if (args->reserved ||
> + (args->flags & ~KVM_ARM_RMI_POPULATE_FLAGS_MEASURE) ||
> + !IS_ALIGNED(ipa_start, PAGE_SIZE) ||
> + !IS_ALIGNED(ipa_end, PAGE_SIZE) ||
> + !IS_ALIGNED(args->source_uaddr, PAGE_SIZE))
> + return -EINVAL;
> +
There are more conditions missed here:
args->size == 0, return 0;
args->base + args->size < args->base, return -EINVAL; // wrapped range
> + ret = realm_ensure_created(kvm);
> + if (ret)
> + return ret;
> +
> + if (args->flags & KVM_ARM_RMI_POPULATE_FLAGS_MEASURE)
> + data_flags |= RMI_MEASURE_CONTENT;
> +
> + pages_populated = populate_region(kvm, gpa_to_gfn(ipa_start),
> + args->size >> PAGE_SHIFT,
> + args->source_uaddr, data_flags);
> +
> + if (pages_populated < 0)
> + return pages_populated;
pages_populaged is 'unsigned long', this function returns a 'int' value.
> +
> + args->size -= pages_populated << PAGE_SHIFT;
> + args->source_uaddr += pages_populated << PAGE_SHIFT;
> + args->base += pages_populated << PAGE_SHIFT;
> +
> + return 0;
> +}
> +
> static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
> {
> struct kvm *kvm = vcpu->kvm;
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-05-28 4:52 UTC (permalink / raw)
To: Sohil Mehta
Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <e9b8083b-6747-4adf-9e48-ffae70dc6508@intel.com>
On Wed, May 27, 2026 at 10:09:41AM -0700, Sohil Mehta wrote:
> On 5/27/2026 3:38 AM, Xu Yilun wrote:
> >
> > Because for security purpose, these add-on features are always needed,
> > even if not all of them, so Extensions will most likely be enabled.
> >
>
> A cover letter is a good place to explain such nuances, alternate
> approaches, and tradeoffs.
>
> > And even if someone switched them off all and saved the memory, compared
> > to the memory of a typical TDX capable system (lets say 1TB), the saving
> > is still little (0.001%).
> >
>
> In this case percentages make it harder to understand. Does it need a
> fixed amount of memory (~50MB) irrespective of the feature or the number
> of features? If so, it would be good to mention that.
No the memory needed varies depends on the feature or the number of
features. But currently I see the total requirement is ~50MB.
Yes I can drop the percentage, just state the amount in MB.
>
>
> >> In addition, could you briefly describe the complexity we are trading off?
> >
> > If we delay the Extensions initialization to the first Extension
> > SEAMCALL, we need to maintain additional TDX state machine for
> > lifecycle, and we need mechanisms to synchronize parallel Extension
> > enabling request from multiple callers.
>
> This would be good to include in the cover as well.
Yes.
^ permalink raw reply
* Re: [PATCH v14 24/44] KVM: arm64: Handle realm MMIO emulation
From: Gavin Shan @ 2026-05-28 5:03 UTC (permalink / raw)
To: Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-25-steven.price@arm.com>
Hi Steve,
On 5/13/26 11:17 PM, Steven Price wrote:
> MMIO emulation for a realm cannot be done directly with the VM's
> registers as they are protected from the host. However, for emulatable
> data aborts, the RMM uses GPRS[0] to provide the read/written value.
> We can transfer this from/to the equivalent VCPU's register entry and
> then depend on the generic MMIO handling code in KVM.
>
> For a MMIO read, the value is placed in the shared RecExit structure
> during kvm_handle_mmio_return() rather than in the VCPU's register
> entry.
>
> Signed-off-by: Steven Price <steven.price@arm.com>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since v7:
> * New comment for rec_exit_sync_dabt() explaining the call to
> vcpu_set_reg().
> Changes since v5:
> * Inject SEA to the guest is an emulatable MMIO access triggers a data
> abort.
> * kvm_handle_mmio_return() - disable kvm_incr_pc() for a REC (as the PC
> isn't under the host's control) and move the REC_ENTER_EMULATED_MMIO
> flag setting to this location (as that tells the RMM to skip the
> instruction).
> ---
> arch/arm64/kvm/inject_fault.c | 4 +++-
> arch/arm64/kvm/mmio.c | 16 ++++++++++++----
> arch/arm64/kvm/rmi-exit.c | 14 ++++++++++++++
> 3 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> index 89982bd3345f..6492397b73d7 100644
> --- a/arch/arm64/kvm/inject_fault.c
> +++ b/arch/arm64/kvm/inject_fault.c
> @@ -228,7 +228,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt, u32 addr)
>
> static void __kvm_inject_sea(struct kvm_vcpu *vcpu, bool iabt, u64 addr)
> {
> - if (vcpu_el1_is_32bit(vcpu))
> + if (unlikely(vcpu_is_rec(vcpu)))
> + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_INJECT_SEA;
> + else if (vcpu_el1_is_32bit(vcpu))
> inject_abt32(vcpu, iabt, addr);
> else
> inject_abt64(vcpu, iabt, addr);
> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> index e2285ed8c91d..6a8cb927fcca 100644
> --- a/arch/arm64/kvm/mmio.c
> +++ b/arch/arm64/kvm/mmio.c
> @@ -6,6 +6,7 @@
>
> #include <linux/kvm_host.h>
> #include <asm/kvm_emulate.h>
> +#include <asm/rmi_smc.h>
> #include <trace/events/kvm.h>
>
> #include "trace.h"
> @@ -138,14 +139,21 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
> trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr,
> &data);
> data = vcpu_data_host_to_guest(vcpu, data, len);
> - vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
> +
> + if (vcpu_is_rec(vcpu))
> + vcpu->arch.rec.run->enter.gprs[0] = data;
> + else
> + vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
> }
>
> /*
> * The MMIO instruction is emulated and should not be re-executed
> * in the guest.
> */
> - kvm_incr_pc(vcpu);
> + if (vcpu_is_rec(vcpu))
> + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_EMULATED_MMIO;
> + else
> + kvm_incr_pc(vcpu);
>
> return 1;
> }
> @@ -167,14 +175,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> * No valid syndrome? Ask userspace for help if it has
> * volunteered to do so, and bail out otherwise.
> *
> - * In the protected VM case, there isn't much userspace can do
> + * In the protected/realm VM case, there isn't much userspace can do
> * though, so directly deliver an exception to the guest.
> */
> if (!kvm_vcpu_dabt_isvalid(vcpu)) {
> trace_kvm_mmio_nisv(*vcpu_pc(vcpu), esr,
> kvm_vcpu_get_hfar(vcpu), fault_ipa);
>
> - if (vcpu_is_protected(vcpu))
> + if (vcpu_is_protected(vcpu) || vcpu_is_rec(vcpu))
> return kvm_inject_sea_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
>
> if (test_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER,
> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
> index e7c51b6cf6ce..8ec0d179eba2 100644
> --- a/arch/arm64/kvm/rmi-exit.c
> +++ b/arch/arm64/kvm/rmi-exit.c
> @@ -25,6 +25,20 @@ static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
>
> static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
> {
> + struct realm_rec *rec = &vcpu->arch.rec;
> +
> + /*
> + * In the case of a write, copy over gprs[0] to the target GPR,
> + * preparing to handle MMIO write fault. The content to be written has
> + * been saved to gprs[0] by the RMM (even if another register was used
> + * by the guest). In the case of normal memory access this is redundant
> + * (the guest will replay the instruction), but the overhead is
> + * minimal.
> + */
> + if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu))
> + vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu),
> + rec->run->exit.gprs[0]);
> +
{ } is needed here.
> return kvm_handle_guest_abort(vcpu);
> }
>
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Xu Yilun @ 2026-05-28 4:25 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Xiaoyao Li, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan
In-Reply-To: <ahcDvwEES7vqLLvg@thinkstation>
On Wed, May 27, 2026 at 04:35:36PM +0100, Kiryl Shutsemau wrote:
> On Mon, May 25, 2026 at 02:54:40PM +0800, Xiaoyao Li wrote:
> > On 5/22/2026 11:41 AM, Xu Yilun wrote:
> > ...
> > > +static __init int get_tdx_sys_info_ext(struct tdx_sys_info_ext *sysinfo_ext)
> > > +{
> > > + int ret = 0;
> > > + u64 val;
> > > +
> > > + if (!ret && !(ret = read_sys_metadata_field(0x3100000100000000, &val)))
> > > + sysinfo_ext->memory_pool_required_pages = val;
> > > + if (!ret && !(ret = read_sys_metadata_field(0x3100000000000001, &val)))
> > > + sysinfo_ext->ext_required = val;
> > > +
> > > + return ret;
> > > +}
> > > +
> > > static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> > > {
> > > int ret = 0;
> > > @@ -116,5 +129,8 @@ static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> > > ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
> > > ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
> > > + if (sysinfo->features.tdx_features0 & TDX_FEATURES0_EXT)
> > > + ret = ret ?: get_tdx_sys_info_ext(&sysinfo->ext);
> >
> > Is it correct to read "memory_pool_required_pages" and "ext_required" so
> > early in get_tdx_sys_info()? get_tdx_sys_info() is called before
> > config_tdx_module() which calls TDH.SYS.CONFIG.
> >
> > If I read the TDX module base spec correctly, the amount of memory for
> > extensions and EXT_REQUIRED field depends on the enabled features, which is
> > determined by TDH.SYS.CONFIG/TDH.SYS.UPDATE ?
Yes.
>
> This is my read too. Looks like we need a separate step after
> config_tdx_module() to readout config-dependatant metadata.
The timing for when metadata becomes valid is now variable, e.g., the
TDX QUOTING metadata is only valid after TDH.QUOTE.INIT [1].
Based on recent discussion, I think we should introduce runtime metadata
reading interfaces for specific metadata sets as needed, rather than
another catch-all step right after config_tdx_module(). See [2] for the
proposed approach for Extensions metadata.
[1]: https://lore.kernel.org/all/20260522034128.3144354-7-yilun.xu@linux.intel.com/
[2]: https://lore.kernel.org/all/ahXAL41ZmIDHmgfu@yilunxu-OptiPlex-7050/
^ permalink raw reply
* Re: [PATCH v14 22/44] arm64: RMI: Handle realm enter/exit
From: Gavin Shan @ 2026-05-28 4:38 UTC (permalink / raw)
To: Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-23-steven.price@arm.com>
Hi Steve,
On 5/13/26 11:17 PM, Steven Price wrote:
> Entering a realm is done using a SMC call to the RMM. On exit the
> exit-codes need to be handled slightly differently to the normal KVM
> path so define our own functions for realm enter/exit and hook them
> in if the guest is a realm guest.
>
> Signed-off-by: Steven Price <steven.price@arm.com>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> ---
> Chanegs since v13:
> * The RMM is now required to provide an ESR value with the correct
> information to emulate MMIO, so we no longer need to hardcode 0s in
> rec_exit_sys_reg().
> * The PSCI changes mean that there is a potential race when turning on
> a VCPU which can cause a RMI_ERROR_REC return. Exit to user space
> with -EAGAIN in this case.
> Changes since v12:
> * Call guest_state_{enter,exit}_irqoff() around rmi_rec_enter().
> * Add handling of the IRQ exception case where IRQs need to be briefly
> enabled before exiting guest timing.
> Changes since v8:
> * Introduce kvm_rec_pre_enter() called before entering an atomic
> section to handle operations that might require memory allocation
> (specifically completing a RIPAS change introduced in a later patch).
> * Updates to align with upstream changes to hpfar_el2 which now (ab)uses
> HPFAR_EL2_NS as a valid flag.
> * Fix exit reason when racing with PSCI shutdown to return
> KVM_EXIT_SHUTDOWN rather than KVM_EXIT_UNKNOWN.
> Changes since v7:
> * A return of 0 from kvm_handle_sys_reg() doesn't mean the register has
> been read (although that can never happen in the current code). Tidy
> up the condition to handle any future refactoring.
> Changes since v6:
> * Use vcpu_err() rather than pr_err/kvm_err when there is an associated
> vcpu to the error.
> * Return -EFAULT for KVM_EXIT_MEMORY_FAULT as per the documentation for
> this exit type.
> * Split code handling a RIPAS change triggered by the guest to the
> following patch.
> Changes since v5:
> * For a RIPAS_CHANGE request from the guest perform the actual RIPAS
> change on next entry rather than immediately on the exit. This allows
> the VMM to 'reject' a RIPAS change by refusing to continue
> scheduling.
> Changes since v4:
> * Rename handle_rme_exit() to handle_rec_exit()
> * Move the loop to copy registers into the REC enter structure from the
> to rec_exit_handlers callbacks to kvm_rec_enter(). This fixes a bug
> where the handler exits to user space and user space wants to modify
> the GPRS.
> * Some code rearrangement in rec_exit_ripas_change().
> Changes since v2:
> * realm_set_ipa_state() now provides an output parameter for the
> top_iap that was changed. Use this to signal the VMM with the correct
> range that has been transitioned.
> * Adapt to previous patch changes.
> ---
> arch/arm64/include/asm/kvm_rmi.h | 4 +
> arch/arm64/kvm/Makefile | 2 +-
> arch/arm64/kvm/arm.c | 26 ++++-
> arch/arm64/kvm/rmi-exit.c | 186 +++++++++++++++++++++++++++++++
> arch/arm64/kvm/rmi.c | 42 +++++++
> 5 files changed, 254 insertions(+), 6 deletions(-)
> create mode 100644 arch/arm64/kvm/rmi-exit.c
>
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index d99bf4fc3c39..feb534a6678e 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -84,6 +84,10 @@ void kvm_destroy_realm(struct kvm *kvm);
> void kvm_realm_destroy_rtts(struct kvm *kvm);
> void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>
> +int kvm_rec_enter(struct kvm_vcpu *vcpu);
> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
> +
> static inline bool kvm_realm_is_private_address(struct realm *realm,
> unsigned long addr)
> {
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index ed3cf30eb06e..4a2d52fdb6a2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -16,7 +16,7 @@ CFLAGS_handle_exit.o += -Wno-override-init
> kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> inject_fault.o va_layout.o handle_exit.o config.o \
> guest.o debug.o reset.o sys_regs.o stacktrace.o \
> - vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o \
> + vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o rmi-exit.o \
> arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \
> vgic/vgic.o vgic/vgic-init.o \
> vgic/vgic-irqfd.o vgic/vgic-v2.o \
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 21d9dfdb1ea0..ed88a203b892 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1331,6 +1331,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> if (ret > 0)
> ret = check_vcpu_requests(vcpu);
>
> + if (ret > 0 && vcpu_is_rec(vcpu))
> + ret = kvm_rec_pre_enter(vcpu);
> +
> /*
> * Preparing the interrupts to be injected also
> * involves poking the GIC, which must be done in a
> @@ -1378,7 +1381,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> trace_kvm_entry(*vcpu_pc(vcpu));
> guest_timing_enter_irqoff();
>
> - ret = kvm_arm_vcpu_enter_exit(vcpu);
> + if (vcpu_is_rec(vcpu))
> + ret = kvm_rec_enter(vcpu);
> + else
> + ret = kvm_arm_vcpu_enter_exit(vcpu);
>
> vcpu->mode = OUTSIDE_GUEST_MODE;
> vcpu->stat.exits++;
> @@ -1424,7 +1430,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> * context synchronization event) is necessary to ensure that
> * pending interrupts are taken.
> */
> - if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
> + if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ ||
> + (vcpu_is_rec(vcpu) &&
> + vcpu->arch.rec.run->exit.exit_reason == RMI_EXIT_IRQ)) {
> local_irq_enable();
> isb();
> local_irq_disable();
The condition could be posssibly imprecise because ARM_EXCEPTION_CODE(ret)
can be ARM_EXCEPTION_IRQ even for a REC. So the precise condition would be:
if ((!vcpu_is_rec(vcpu) && ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) ||
(vcpu_is_rec(vcpu) && vcpu->arch.rec.run->exit.exit_reason == RMI_EXIT_IRQ)) {
> @@ -1436,8 +1444,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>
> trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
>
> - /* Exit types that need handling before we can be preempted */
> - handle_exit_early(vcpu, ret);
> + if (!vcpu_is_rec(vcpu)) {
> + /*
> + * Exit types that need handling before we can be
> + * preempted
> + */
> + handle_exit_early(vcpu, ret);
> + }
>
> kvm_nested_sync_hwstate(vcpu);
>
> @@ -1462,7 +1475,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> ret = ARM_EXCEPTION_IL;
> }
>
> - ret = handle_exit(vcpu, ret);
> + if (vcpu_is_rec(vcpu))
> + ret = handle_rec_exit(vcpu, ret);
> + else
> + ret = handle_exit(vcpu, ret);
> }
>
> /* Tell userspace about in-kernel device output levels */
> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
> new file mode 100644
> index 000000000000..e7c51b6cf6ce
> --- /dev/null
> +++ b/arch/arm64/kvm/rmi-exit.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <kvm/arm_hypercalls.h>
> +#include <kvm/arm_psci.h>
> +
> +#include <asm/rmi_smc.h>
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_rmi.h>
> +#include <asm/kvm_mmu.h>
> +
> +typedef int (*exit_handler_fn)(struct kvm_vcpu *vcpu);
> +
> +static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> +
> + vcpu_err(vcpu, "Unhandled exit reason from realm (ESR: %#llx)\n",
> + rec->run->exit.esr);
> + return -ENXIO;
> +}
> +
s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu), rec->run->exit.esr has been
copied to the storage space pointed by kvm_vcpu_get_esr() in its caller.
> +static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
> +{
> + return kvm_handle_guest_abort(vcpu);
> +}
> +
> +static int rec_exit_sync_iabt(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> +
> + vcpu_err(vcpu, "Unhandled instruction abort (ESR: %#llx).\n",
> + rec->run->exit.esr);
> + return -ENXIO;
> +}
> +
s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu)
> +static int rec_exit_sys_reg(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> + unsigned long esr = kvm_vcpu_get_esr(vcpu);
> + int rt = kvm_vcpu_sys_get_rt(vcpu);
> + bool is_write = (esr & ESR_ELx_SYS64_ISS_DIR_MASK) == ESR_ELx_SYS64_ISS_DIR_WRITE;
> + int ret;
> +
> + if (is_write)
> + vcpu_set_reg(vcpu, rt, rec->run->exit.gprs[rt]);
> +
> + ret = kvm_handle_sys_reg(vcpu);
> + if (!is_write)
> + rec->run->enter.gprs[rt] = vcpu_get_reg(vcpu, rt);
> +
> + return ret;
> +}
> +
> +static exit_handler_fn rec_exit_handlers[] = {
> + [0 ... ESR_ELx_EC_MAX] = rec_exit_reason_notimpl,
> + [ESR_ELx_EC_SYS64] = rec_exit_sys_reg,
> + [ESR_ELx_EC_DABT_LOW] = rec_exit_sync_dabt,
> + [ESR_ELx_EC_IABT_LOW] = rec_exit_sync_iabt
> +};
> +
> +static int rec_exit_psci(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> + int i;
> +
> + for (i = 0; i < REC_RUN_GPRS; i++)
> + vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]);
> +
> + return kvm_smccc_call_handler(vcpu);
> +}
> +
> +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + struct realm *realm = &kvm->arch.realm;
> + struct realm_rec *rec = &vcpu->arch.rec;
> + unsigned long base = rec->run->exit.ripas_base;
> + unsigned long top = rec->run->exit.ripas_top;
> + unsigned long ripas = rec->run->exit.ripas_value;
> +
> + if (!kvm_realm_is_private_address(realm, base) ||
> + !kvm_realm_is_private_address(realm, top - 1)) {
> + vcpu_err(vcpu, "Invalid RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n",
> + base, top, ripas);
> + /* Set RMI_REJECT bit */
> + rec->run->enter.flags = REC_ENTER_FLAG_RIPAS_RESPONSE;
> + return -EINVAL;
> + }
I doubt if the flag (REC_ENTER_FLAG_RIPAS_RESPONSE) will be handed over to RMM
since the negative return value forces we're exiting to VMM like QEMU where
how this problematic case can be handled is TBD.
> +
> + /* Exit to VMM, the actual RIPAS change is done on next entry */
> + kvm_prepare_memory_fault_exit(vcpu, base, top - base, false, false,
> + ripas == RMI_RAM);
> +
> + /*
> + * KVM_EXIT_MEMORY_FAULT requires an return code of -EFAULT, see the
> + * API documentation
> + */
> + return -EFAULT;
> +}
> +
> +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> +
> + __vcpu_assign_sys_reg(vcpu, CNTV_CTL_EL0, rec->run->exit.cntv_ctl);
> + __vcpu_assign_sys_reg(vcpu, CNTV_CVAL_EL0, rec->run->exit.cntv_cval);
> + __vcpu_assign_sys_reg(vcpu, CNTP_CTL_EL0, rec->run->exit.cntp_ctl);
> + __vcpu_assign_sys_reg(vcpu, CNTP_CVAL_EL0, rec->run->exit.cntp_cval);
> +
> + kvm_realm_timers_update(vcpu);
> +}
> +
> +/*
> + * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
> + * proper exit to userspace.
> + */
> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> + u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
> + unsigned long status, index;
> +
> + status = RMI_RETURN_STATUS(rec_run_ret);
> + index = RMI_RETURN_INDEX(rec_run_ret);
> +
> + /*
> + * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we might
> + * see the following status code and index indicating an attempt to run
> + * a REC when the RD state is SYSTEM_OFF. In this case, we just need to
> + * return to user space which can deal with the system event or will try
> + * to run the KVM VCPU again, at which point we will no longer attempt
> + * to enter the Realm because we will have a sleep request pending on
> + * the VCPU as a result of KVM's PSCI handling.
> + */
> + if (status == RMI_ERROR_REALM) {
> + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> + return 0;
> + }
> +
> + /*
> + * If a VCPU has been turned on, but the REC state hasn't been updated
> + * we may experience RMI_ERROR_REC. Exit to the userspace with -EAGAIN
> + * for a retry.
> + */
> + if (status == RMI_ERROR_REC)
> + return -EAGAIN;
> + if (rec_run_ret)
> + return -ENXIO;
> +
> + vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
> + vcpu->arch.fault.far_el2 = rec->run->exit.far;
> + /* HPFAR_EL2 is only valid for RMI_EXIT_SYNC */
> + vcpu->arch.fault.hpfar_el2 = 0;
> +
> + update_arch_timer_irq_lines(vcpu);
> +
> + /* Reset the emulation flags for the next run of the REC */
> + rec->run->enter.flags = 0;
> +
> + switch (rec->run->exit.exit_reason) {
> + case RMI_EXIT_SYNC:
> + /*
> + * HPFAR_EL2_NS is hijacked to indicate a valid HPFAR value,
> + * see __get_fault_info()
> + */
> + vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar | HPFAR_EL2_NS;
> + return rec_exit_handlers[esr_ec](vcpu);
> + case RMI_EXIT_IRQ:
> + case RMI_EXIT_FIQ:
> + case RMI_EXIT_SERROR:
> + return 1;
> + case RMI_EXIT_PSCI:
> + return rec_exit_psci(vcpu);
> + case RMI_EXIT_RIPAS_CHANGE:
> + return rec_exit_ripas_change(vcpu);
> + }
> +
> + kvm_pr_unimpl("Unsupported exit reason: %u\n",
> + rec->run->exit.exit_reason);
> + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> + return 0;
> +}
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index 353a5ca45e78..d8a5fb12db2d 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -173,6 +173,48 @@ static int realm_ensure_created(struct kvm *kvm)
> return -ENXIO;
> }
>
> +/*
> + * kvm_rec_pre_enter - Complete operations before entering a REC
> + *
> + * Some operations require work to be completed before entering a realm. That
> + * work may require memory allocation so cannot be done in the kvm_rec_enter()
> + * call.
> + *
> + * Return: 1 if we should enter the guest
> + * 0 if we should exit to userspace
> + * < 0 if we should exit to userspace, where the return value indicates
> + * an error
> + */
> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> +
> + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
> + return -EINVAL;
> +
> + switch (rec->run->exit.exit_reason) {
> + case RMI_EXIT_HOST_CALL:
> + for (int i = 0; i < REC_RUN_GPRS; i++)
> + rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
> + break;
> + }
> +
> + return 1;
> +}
> +
> +int noinstr kvm_rec_enter(struct kvm_vcpu *vcpu)
> +{
> + struct realm_rec *rec = &vcpu->arch.rec;
> + int ret;
> +
> + guest_state_enter_irqoff();
> + ret = rmi_rec_enter(virt_to_phys(rec->rec_page),
> + virt_to_phys(rec->run));
> + guest_state_exit_irqoff();
> +
> + return ret;
> +}
> +
> static int kvm_create_rec(struct kvm_vcpu *vcpu)
> {
> struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Xu Yilun @ 2026-05-28 3:48 UTC (permalink / raw)
To: Sohil Mehta
Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <cdfa241c-2d8e-4aab-8491-1de4a54e8a59@intel.com>
On Wed, May 27, 2026 at 10:17:36AM -0700, Sohil Mehta wrote:
> On 5/27/2026 12:11 AM, Xu Yilun wrote:
>
> >>> +struct tdx_sys_info_ext {
> >>> + u16 memory_pool_required_pages;
> >>> + u8 ext_required;
> >>
> >> The name ext_required seems like a boolean. It is also used like a
> >> boolean later.
> >> if (!tdx_sysinfo.ext.ext_required)
> >> return 0;
> >>
> >> But, IIUC, is it actually a mask that lists any feature that needs
> >
> > No it is just a bool about Extentions needs to be initialized or not.
> >
> How does the kernel know which features need Extensions? Is there any
> hardware enumeration or the kernel just keeps a static list?
There is no HW enumeration, mm... seems this is an important reason that
we don't delay the Extensions enabling, kernel doesn't have to keep in
mind which features need Extensions.
^ permalink raw reply
* Re: [PATCH v14 21/44] KVM: arm64: Support timers in realm RECs
From: Gavin Shan @ 2026-05-28 4:11 UTC (permalink / raw)
To: Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-22-steven.price@arm.com>
Hi Steve,
On 5/13/26 11:17 PM, Steven Price wrote:
> The RMM keeps track of the timer while the realm REC is running, but on
> exit to the normal world KVM is responsible for handling the timers.
>
> A later patch adds the support for propagating the timer values from the
> exit data structure and calling kvm_realm_timers_update().
>
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v12:
> * Adapt to upstream changes.
> Changes since v11:
> * Drop the kvm_is_realm() check from timer_set_offset(). We already
> ensure that the offset is 0 when calling the function.
> Changes since v10:
> * KVM_CAP_COUNTER_OFFSET is now already hidden by a previous patch.
> Changes since v9:
> * No need to move the call to kvm_timer_unblocking() in
> kvm_timer_vcpu_load().
> Changes since v7:
> * Hide KVM_CAP_COUNTER_OFFSET for realm guests.
> ---
> arch/arm64/kvm/arch_timer.c | 28 +++++++++++++++++++++++++---
> include/kvm/arm_arch_timer.h | 2 ++
> 2 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index cbea4d9ee955..88ed01edc136 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -470,6 +470,21 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
> timer_ctx);
> }
>
> +void kvm_realm_timers_update(struct kvm_vcpu *vcpu)
> +{
> + struct arch_timer_cpu *arch_timer = &vcpu->arch.timer_cpu;
> + int i;
> +
> + for (i = 0; i < NR_KVM_EL0_TIMERS; i++) {
> + struct arch_timer_context *timer = &arch_timer->timers[i];
> + bool status = timer_get_ctl(timer) & ARCH_TIMER_CTRL_IT_STAT;
> + bool level = kvm_timer_irq_can_fire(timer) && status;
> +
> + if (level != timer->irq.level)
> + kvm_timer_update_irq(vcpu, level, timer);
> + }
> +}
> +
> /* Only called for a fully emulated timer */
> static void timer_emulate(struct arch_timer_context *ctx)
> {
> @@ -1079,7 +1094,7 @@ static void timer_context_init(struct kvm_vcpu *vcpu, int timerid)
>
> ctxt->timer_id = timerid;
>
> - if (!kvm_vm_is_protected(vcpu->kvm)) {
> + if (!kvm_vm_is_protected(vcpu->kvm) && !kvm_is_realm(vcpu->kvm)) {
> if (timerid == TIMER_VTIMER)
> ctxt->offset.vm_offset = &kvm->arch.timer_data.voffset;
> else
s/!kvm_is_realm(vcpu->kvm)/!vcpu_is_rec(vcpu)
> @@ -1110,7 +1125,7 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
> timer_context_init(vcpu, i);
>
> /* Synchronize offsets across timers of a VM if not already provided */
> - if (!vcpu_is_protected(vcpu) &&
> + if (!vcpu_is_protected(vcpu) && !kvm_is_realm(vcpu->kvm) &&
> !test_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, &vcpu->kvm->arch.flags)) {
> timer_set_offset(vcpu_vtimer(vcpu), kvm_phys_timer_read());
> timer_set_offset(vcpu_ptimer(vcpu), 0);
Same as above.
> @@ -1611,6 +1626,13 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
> return -EINVAL;
> }
>
> + /*
> + * We don't use mapped IRQs for Realms because the RMI doesn't allow
> + * us setting the LR.HW bit in the VGIC.
> + */
> + if (vcpu_is_rec(vcpu))
> + return 0;
> +
> get_timer_map(vcpu, &map);
>
> ops = vgic_is_v5(vcpu->kvm) ? &arch_timer_irq_ops_vgic_v5 :
> @@ -1740,7 +1762,7 @@ int kvm_vm_ioctl_set_counter_offset(struct kvm *kvm,
> if (offset->reserved)
> return -EINVAL;
>
> - if (kvm_vm_is_protected(kvm))
> + if (kvm_vm_is_protected(kvm) || kvm_is_realm(kvm))
> return -EINVAL;
>
> mutex_lock(&kvm->lock);
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index bf8cc9589bd0..ffdb90dcad58 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -113,6 +113,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>
> +void kvm_realm_timers_update(struct kvm_vcpu *vcpu);
> +
> u64 kvm_phys_timer_read(void);
>
> void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH v14 20/44] arm64: RMI: Support for the VGIC in realms
From: Gavin Shan @ 2026-05-28 4:07 UTC (permalink / raw)
To: Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-21-steven.price@arm.com>
Hi Steve,
On 5/13/26 11:17 PM, Steven Price wrote:
> The RMM provides emulation of a VGIC to the realm guest. With RMM v2.0
> the registers are passed in the system registers so this works similar
> to a normal guest, but kvm_arch_vcpu_put() need reordering to early out,
> and realm guests don't support GICv2 even if the host does.
>
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes from v12:
> * GIC registers are now passed in the system registers rather than via
> rec_entry/rec_exit which removes most of the changes.
> Changes from v11:
> * Minor changes to align with the previous patches. Note that the VGIC
> handling will change with RMM v2.0.
> Changes from v10:
> * Make sure we sync the VGIC v4 state, and only populate valid lrs from
> the list.
> Changes from v9:
> * Copy gicv3_vmcr from the RMM at the same time as gicv3_hcr rather
> than having to handle that as a special case.
> Changes from v8:
> * Propagate gicv3_hcr to from the RMM.
> Changes from v5:
> * Handle RMM providing fewer GIC LRs than the hardware supports.
> ---
> arch/arm64/kvm/arm.c | 11 ++++++++---
> arch/arm64/kvm/vgic/vgic-init.c | 2 +-
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 93d34762db91..21d9dfdb1ea0 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -786,19 +786,24 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> kvm_call_hyp_nvhe(__pkvm_vcpu_put);
> }
>
> + kvm_timer_vcpu_put(vcpu);
> + kvm_vgic_put(vcpu);
> +
> + vcpu->cpu = -1;
> +
> + if (vcpu_is_rec(vcpu))
> + return;
> +
For a REC, kvm_vcpu_{load, put}_debug() becomes unbalanced in kvm_arch_vcpu_{load, put}().
kvm_vcpu_load_debug() is called in kvm_arch_vcpu_load(), but kvm_vcpu_put_debug() won't
be called in kvm_arch_vcpu_put() after this whole series is applied.
> kvm_vcpu_put_debug(vcpu);
> kvm_arch_vcpu_put_fp(vcpu);
> if (has_vhe())
> kvm_vcpu_put_vhe(vcpu);
> - kvm_timer_vcpu_put(vcpu);
> - kvm_vgic_put(vcpu);
> kvm_vcpu_pmu_restore_host(vcpu);
> if (vcpu_has_nv(vcpu))
> kvm_vcpu_put_hw_mmu(vcpu);
> kvm_arm_vmid_clear_active();
>
> vcpu_clear_on_unsupported_cpu(vcpu);
> - vcpu->cpu = -1;
> }
>
> static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index 933983bb2005..a9db963dfd23 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -81,7 +81,7 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
> * the proper checks already.
> */
> if (type == KVM_DEV_TYPE_ARM_VGIC_V2 &&
> - !kvm_vgic_global_state.can_emulate_gicv2)
> + (!kvm_vgic_global_state.can_emulate_gicv2 || kvm_is_realm(kvm)))
> return -ENODEV;
>
> /*
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-28 0:43 UTC (permalink / raw)
To: Dave Hansen
Cc: Ashish Kalra, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <eea0497f-6930-43e3-947d-dae139e657ad@intel.com>
On Wed, May 27, 2026 at 02:38:05PM -0700, Dave Hansen wrote:
> This one is my doing.
I know.
But hey, maybe we should not disagree on the public ML because the submitter
might disappear like the last one. :-P
> wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
> also not very flexible because it needs a 'struct msr __percpu *msrs'
> argument where each MSR has a value in memory.
Right, we did that a looong time ago.
The only reason I'd have for per-CPU MSR structs is reading different MSR
values on different cores, modifying only the bits you need and then *keeping*
the remaining values as they were. And that interface allows you to do that
while this new thing won't.
And I'm going to venture a guess here that adding a simpler interface which
simply forces a new value ontop of a whole MSR could cause a lot of subtle
bugs when people don't pay attention to keep the old values.
> The use case for RMPOPT is that all CPUs get the same value. It'd be a
> little awkward to go create a percpu data structure to duplcate the same
> value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
> performance sensitive since it's done during boot. It should do the IPIs
> in parallel.
Oh sure, my meaning was to create something that serves both purposes.
> toggle_ecc_err_reporting(), on the other hand, is done at module init
> time. It's not really performance sensitive. It's probably pretty easy
> to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
> something slightly less efficient.
Sure. That's fine.
> Yeah, the
>
> wrmsr_on_cpus()
> wrmsrq_on_cpus()
>
> naming pain is real. There's little chance of bugs coming from it
> because the function signatures are *SO* different. But, it certainly
> could confuse humans for a minute.
Yap.
> But the real solution to this is axing wrmsr_on_cpus().
Yap, for example. Basically reingeneering the whole
write-MSRs-on-multiple-CPUs functionality is what I meant.
> Which I think we could do after killing its one user which the attached
> (completely untested) patch does. The only downside of the patch is that it
> does RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not
> sure anyone would care. If anyone did, I _think_ all those MSRs have the
> same value and the code could be simplified further. But that would take
> more than 3 minutes.
>
> It's also possible that my grepping was bad or I'm completely
> misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.
Looks ok to me, we can surely do that. I even hw to test it. I think...
> BTW, I also don't feel the need to make Ashish go do any of this edac
> cleanup. I think it can just be done in parallel. But I wouldn't stop
> him if he volunteered.
Why not?
It has always been the case: cleanups and bug fixes first, new features ontop.
So yeah, modulo figuring out how to redefine the *msr_on_cpus() interface,
I think this all makes sense.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v3 09/41] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: Wei Liu @ 2026-05-27 22:50 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-10-seanjc@google.com>
On Fri, May 15, 2026 at 12:19:10PM -0700, Sean Christopherson wrote:
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V reference counter, drop the
> invocation of the "old" save/restore callbacks. When the registration of
> the PV sched_clock was done separately from overriding the save/restore
> hooks, it was possible for Hyper-V to clobber the TSC save/restore
> callbacks without actually switching to the Hyper-V refcounter.
>
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
^ permalink raw reply
* Re: [PATCH v3 08/41] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: Wei Liu @ 2026-05-27 22:50 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-9-seanjc@google.com>
On Fri, May 15, 2026 at 12:19:09PM -0700, Sean Christopherson wrote:
> Now that all of the Hyper-V reference counter sched_clock code is located
> in a single file, drop the superfluous wrappers for the save/restore flows.
>
> No functional change intended.
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
^ permalink raw reply
* Re: [PATCH v3 07/41] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: Wei Liu @ 2026-05-27 22:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-8-seanjc@google.com>
On Fri, May 15, 2026 at 12:19:08PM -0700, Sean Christopherson wrote:
> Register the Hyper-V reference counter (refcounter) callbacks for saving
> and restoring its PV sched_clock, if and only if the refcounter is
> actually being used for sched_clock. Currently, Hyper-V overrides the
> save/restore hooks if the reference TSC available, whereas the Hyper-V
> refcounter code only overrides sched_clock if the reference TSC is
> available *and* it's not invariant. The flaw is effectively papered over
> by invoking the "old" save/restore callbacks as part of save/restore, but
> that's unnecessary and fragile.
>
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
>
> Note, while the Hyper-V refcounter code is intended to be architecture
> neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
> x86 specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
^ permalink raw reply
* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Dan Williams (nvidia) @ 2026-05-27 22:49 UTC (permalink / raw)
To: Aneesh Kumar K.V, Dan Williams (nvidia), Alexey Kardashevskiy,
linux-coco, iommu, linux-kernel, kvm
Cc: Bjorn Helgaas, Dan Williams, Jason Gunthorpe, Joerg Roedel,
Jonathan Cameron, Kevin Tian, Nicolin Chen, Samuel Ortiz,
Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun,
Shameer Kolothum, Paolo Bonzini, Tony Krowiak, Halil Pasic,
Jason Herne, Harald Freudenberger, Holger Dengler, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
Eric Farman, linux-s390
In-Reply-To: <yq5aldd4spyc.fsf@kernel.org>
Aneesh Kumar K.V wrote:
> >> I am leaning towards the latter at this point.
> >
> > But we already have struct pci_tsm_ops::guest_req, which is specific to
> > the underlying CC architecture. From the above, pci_tsm_req_scope also
> > appears to carry the same information. Is that useful?
> >
>
> I think there is value in having the VMM express the guest’s
> confidential computing architecture, so that the TSM backend can
> validate whether it should handle that guest request ?.
Yes, that is the idea.
> So it would not be the IOMMU validating the scope value, but rather
> pci_tsm_ops::guest_req.
>
> static ssize_t cca_tsm_guest_req(struct pci_tdi *tdi, enum pci_tsm_req_scope scope,
> sockptr_t req, size_t req_len, sockptr_t resp,
> size_t resp_len, u64 *tsm_code)
> {
> struct pci_dev *pdev = tdi->pdev;
>
> /* reject the guest request if VMM was using the link tsm wrongly. The guest
> * was using a wrong CC archiecture with this link tsm
> */
> if (scope != TSM_REQ_TYPE_CCA)
> return -EINVAL;
Right, iommufd is tunneling TSM requests. The tunnel should have an
envelope of TSM_REQ_TYPE_* and an @op field. The TSM driver gets those
from iommufd, validates the envelope and then processes @req.
This self-consistency and explicitness also buys some future-proofing.
It allows for alternate command sets within an arch, cross TSM
implementation shared commands, IOMMUFD-to-TSM requests outside of guest
requests.
> Jason Gunthorpe <jgg@ziepe.ca> writes:
>
> > On Tue, May 26, 2026 at 11:17:50PM -0700, Dan Williams (nvidia) wrote:
> >
> >> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
> >>
> >> TSM_REQ_TYPE_CCA
> >> TSM_REQ_TYPE_SEV
> >> TSM_REQ_TYPE_TDX
> >>
> >> I am leaning towards the latter at this point.
> >
> > Yeah, this sounds good. I would also include an common op field that
> > can be decoded by the TSM driver based on the TYPE above, and the
> > usual in/out message buffers.
>
> We already have iommufd_vdevice_tsm_op_ioctl() to handle common
> operations.
Per above, I believe this is about an @op value in a common location
that iommufd can forward to the backend for validation of guest
requests.
> Right now, it handles IOMMU_VDEVICE_TSM_BIND and
> IOMMU_VDEVICE_TSM_UNBIND. I guess we should move TSM_REQ_SET_TDI_STATE
> operations to that as well?
I think we can wait to move it to its own IOMMU operation unless/until
there is a need to set RUN outside of an explicit guest request, right?
^ permalink raw reply
* Re: [PATCH v2 3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding
From: Sean Christopherson @ 2026-05-27 22:43 UTC (permalink / raw)
To: Ackerley Tng
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <CAEvNRgHnxEO+jXqEKuewiDgFBDQDSr68A8MnJJSyf9K5AgbwWg@mail.gmail.com>
On Wed, May 27, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > On Fri, May 22, 2026, Ackerley Tng wrote:
> > If @entry is non-NULL, xa_store_range() pre-creates the entire range, before
> > storing anything into the range:
> >
> > if (entry) {
> > unsigned int order = BITS_PER_LONG;
> > if (last + 1)
> > order = __ffs(last + 1);
> > xas_set_order(&xas, last, order);
> > xas_create(&xas, true);
> > if (xas_error(&xas))
> > goto unlock;
> > }
> >
>
> xa_store_range() doesn't actually always iterate: if last + 1 is some
> clean power of 2, it'll create a higher order xarray node.
>
> Otherwise, it falls back to creating and storing 1 index/node at a time:
Ugh, _that's_ what the code is doing? Argh, I missed that "first" is incremented
by whatever the batch size happened to be.
first += xas_size(&xas); <====
} while (first <= last);
> if the above did manage to create an xarray node, xas_error() returns
> false, it goes on to the store below.
>
> > Yes, the API handles failure on the subsequent xas_store(), but I can't imagine
> > that failure is actually, barring garbage input from KVM:
> >
> > do {
> > xas_set_range(&xas, first, last);
> > xas_store(&xas, entry);
> > if (xas_error(&xas))
> > goto unlock;
> > first += xas_size(&xas);
> > } while (first <= last);
> >
>
> So if a later xas_create() fails because it runs out of memory, the
> earlier stores would have already been committed.
>
> This ignores -EEXIST being returned since earlier in kvm_gmem_bind()
> conflicts were already checked.
>
> > Purely from a design perspective, providing an API that can fail partway through
> > under normal operation, with no indication of where failure occured (AFAICT),
> > would be awful.
> >
>
> Do you mean the API of xas_store_range()?
No, I mean xa_store_range(). AFAICT, on failure, it doesn't actually communicate
"where" failure occurred. That's quite nasty.
> xas is updated by xas_set_range() so that should track the last store. Since
> the cleanup is storing NULLs and won't allocate, I thought it would be fine
> to just store NULL on the entire range on error.
Yeah, it's totally fine, and AFAICT the only remotely sane approach.
^ permalink raw reply
* [PATCH] MAINTAINERS: Move Rick Edgecombe to TDX maintainer
From: Rick Edgecombe @ 2026-05-27 22:13 UTC (permalink / raw)
To: kas, dave.hansen, x86, linux-coco, kvm, pbonzini, seanjc; +Cc: Rick Edgecombe
Per some offline discussion with Kiryl, he could use some help on the TDX
host side. I have worked on the TDX host side for the past few years
including wrangling the initial KVM support, and can help with this.
I am already listed as TDX reviewer. Move it to maintainer.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kiryl Shutsemau <kas@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index 882214b0e7db5..a838ff047d891 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -28943,8 +28943,8 @@ F: arch/x86/kernel/unwind_*.c
X86 TRUST DOMAIN EXTENSIONS (TDX)
M: Kiryl Shutsemau <kas@kernel.org>
+M: Rick Edgecombe <rick.p.edgecombe@intel.com>
R: Dave Hansen <dave.hansen@linux.intel.com>
-R: Rick Edgecombe <rick.p.edgecombe@intel.com>
L: x86@kernel.org
L: linux-coco@lists.linux.dev
L: kvm@vger.kernel.org
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Sean Christopherson @ 2026-05-27 22:08 UTC (permalink / raw)
To: Ackerley Tng
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <CAEvNRgEW8nph=Hz-3pcQD7ZxRuAOcjcOfj8afgJMsRgESD-Wbw@mail.gmail.com>
On Wed, May 27, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > I like pgoff_t more than size_t because, for KVM, it's really all about addressing
> > memory, thanks to the offset into guest_memfd being associated 1:1 with a GPA.
>
> The offset into guest_memfd is associated 1:1 with a GPA, and this
> offset is
>
> offset = index << PAGE_SHIFT
>
> > It's not perfect, because GPAs are tracked as 64-bit values, whereas the kernel
> > restricts itself to "unsigned long". But that's a non-issue in practice since
> > guest_memfd is 64-bit only.
> >
> > But conceptually, I like tracking the gmem offset as a pgoff_t to tie it back
> > to using GPAs to offset/index into gmem. And for all intents and purposes, gmem
> > is nothing more than a glorified pagecache :-)
>
> So we actually want to use u64s for gmem offsets (where offset = index
> << PAGE_SHIFT),
Hrm, poking around, I guess what we really should use for the byte offset is
uoff_t. My only hesitation to using uoff_t was that it's hardly used anywhere,
but it does seem to fix exactly what we're trying to do.
I don't want to use a raw u64, because I dislike using u{8,16,32,64} (in KVM)
unless something absolutely _must_ be that size (and ideally _exactly_ that size).
Limiting use of raw uNN helps identify fields/variables that correspond to some
hardware asset, versus fields/variables that just need to be "big enough". It's
not a 100% comprehensive rule of anything, e.g. there are still many "naturally
sized" hardware assets that need to be tracked with "unsigned long", but I still
find it helpful/valuable to highlight hardware-derived fields/variables.
> and pgoff_t for indices, since indices (aka page
> offsets) are semantically the offset, counted in units of pages?
Yeah, I agree the distinction will help us differentiate between byte offsets
and pfn offsets, especially with another compile-time assert to show the
relationship:
BUILD_BUG_ON(sizeof(gpa_t) != sizeof(offset));
BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
> I pulled this conclusion together from filemap-related code like
> filemap_add_folio() takes a pgoff_t index, so I thought gmem should
> follow that and stick with pgoff_t for index/indices.
^ permalink raw reply
* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Dave Hansen @ 2026-05-27 21:38 UTC (permalink / raw)
To: Borislav Petkov, Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260527210603.GCahdcu8zvVjfKfGEL@fat_crate.local>
[-- Attachment #1: Type: text/plain, Size: 2757 bytes --]
On 5/27/26 14:06, Borislav Petkov wrote:
> On Mon, May 18, 2026 at 09:42:15PM +0000, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
>> callers to allocate and populate per-cpu storage even when every CPU
>> receives the same value. This is unnecessary overhead for the common
>> case of writing a single uniform u64 to a per-CPU MSR across multiple
>> CPUs.
>>
>> Add wrmsrq_on_cpus() which writes the same u64 value to the specified
>> MSR on all CPUs in the given cpumask.
>
> So let's add yet another function which name differs from the other one by
> a single letter and have people go look at the implementation to know which is
> which...?
>
> Instead of unifying what's there and extending this one to do what you want it
> to do?
>
> And now you have a wrmsrQ_on_cpus() but no rdmsrQ_on_cpus()?
>
> Because if you look at the code, you'll see how those are used: first you
> rdmsr on CPUs, modify each value and then wrmsr on same CPUs.
This one is my doing.
wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
also not very flexible because it needs a 'struct msr __percpu *msrs'
argument where each MSR has a value in memory.
The use case for RMPOPT is that all CPUs get the same value. It'd be a
little awkward to go create a percpu data structure to duplcate the same
value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
performance sensitive since it's done during boot. It should do the IPIs
in parallel.
toggle_ecc_err_reporting(), on the other hand, is done at module init
time. It's not really performance sensitive. It's probably pretty easy
to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
something slightly less efficient.
Yeah, the
wrmsr_on_cpus()
wrmsrq_on_cpus()
naming pain is real. There's little chance of bugs coming from it
because the function signatures are *SO* different. But, it certainly
could confuse humans for a minute.
But the real solution to this is axing wrmsr_on_cpus(). Which I think we
could do after killing its one user which the attached (completely
untested) patch does. The only downside of the patch is that it does
RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not sure
anyone would care. If anyone did, I _think_ all those MSRs have the same
value and the code could be simplified further. But that would take more
than 3 minutes.
It's also possible that my grepping was bad or I'm completely
misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.
BTW, I also don't feel the need to make Ashish go do any of this edac
cleanup. I think it can just be done in parallel. But I wouldn't stop
him if he volunteered.
[-- Attachment #2: axe-wrmsr_on_cpus.patch --]
[-- Type: text/x-patch, Size: 2609 bytes --]
---
b/drivers/edac/amd64_edac.c | 35 +++++++++++------------------------
1 file changed, 11 insertions(+), 24 deletions(-)
diff -puN drivers/edac/amd64_edac.c~axe-wrmsr_on_cpus drivers/edac/amd64_edac.c
--- a/drivers/edac/amd64_edac.c~axe-wrmsr_on_cpus 2026-05-27 14:25:07.881694152 -0700
+++ b/drivers/edac/amd64_edac.c 2026-05-27 14:33:03.278139821 -0700
@@ -14,8 +14,6 @@ static struct edac_pci_ctl_info *pci_ctl
static int ecc_enable_override;
module_param(ecc_enable_override, int, 0644);
-static struct msr __percpu *msrs;
-
static inline u32 get_umc_reg(struct amd64_pvt *pvt, u32 reg)
{
if (!pvt->flags.zn_regs_v2)
@@ -3215,14 +3213,14 @@ static bool nb_mce_bank_enabled_on_node(
get_cpus_on_this_dct_cpumask(mask, nid);
- rdmsr_on_cpus(mask, MSR_IA32_MCG_CTL, msrs);
-
for_each_cpu(cpu, mask) {
- struct msr *reg = per_cpu_ptr(msrs, cpu);
- nbe = reg->l & MSR_MCGCTL_NBE;
+ u64 mcg_ctl;
+
+ rdmsrq_on_cpu(cpu, MSR_IA32_MCG_CTL, &mcg_ctl);
+ nbe = mcg_ctl & MSR_MCGCTL_NBE;
edac_dbg(0, "core: %u, MCG_CTL: 0x%llx, NB MSR is %s\n",
- cpu, reg->q, str_enabled_disabled(nbe));
+ cpu, mcg_ctl, str_enabled_disabled(nbe));
if (!nbe)
goto out;
@@ -3246,26 +3244,25 @@ static int toggle_ecc_err_reporting(stru
get_cpus_on_this_dct_cpumask(cmask, nid);
- rdmsr_on_cpus(cmask, MSR_IA32_MCG_CTL, msrs);
-
for_each_cpu(cpu, cmask) {
+ u64 mcg_ctl;
- struct msr *reg = per_cpu_ptr(msrs, cpu);
+ rdmsrq_on_cpu(cpu, MSR_IA32_MCG_CTL, &mcg_ctl);
if (on) {
- if (reg->l & MSR_MCGCTL_NBE)
+ if (mcg_ctl & MSR_MCGCTL_NBE)
s->flags.nb_mce_enable = 1;
- reg->l |= MSR_MCGCTL_NBE;
+ mcg_ctl |= MSR_MCGCTL_NBE;
} else {
/*
* Turn off NB MCE reporting only when it was off before
*/
if (!s->flags.nb_mce_enable)
- reg->l &= ~MSR_MCGCTL_NBE;
+ mcg_ctl &= ~MSR_MCGCTL_NBE;
}
+ wrmsrq_on_cpu(cpu, MSR_IA32_MCG_CTL, mcg_ctl);
}
- wrmsr_on_cpus(cmask, MSR_IA32_MCG_CTL, msrs);
free_cpumask_var(cmask);
@@ -4153,10 +4150,6 @@ static int __init amd64_edac_init(void)
if (!ecc_stngs)
goto err_free;
- msrs = msrs_alloc();
- if (!msrs)
- goto err_free;
-
for (i = 0; i < amd_nb_num(); i++) {
err = probe_one_instance(i);
if (err) {
@@ -4190,9 +4183,6 @@ static int __init amd64_edac_init(void)
err_pci:
pci_ctl_dev = NULL;
- msrs_free(msrs);
- msrs = NULL;
-
err_free:
kfree(ecc_stngs);
ecc_stngs = NULL;
@@ -4220,9 +4210,6 @@ static void __exit amd64_edac_exit(void)
ecc_stngs = NULL;
pci_ctl_dev = NULL;
-
- msrs_free(msrs);
- msrs = NULL;
}
module_init(amd64_edac_init);
_
^ permalink raw reply
* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-27 21:06 UTC (permalink / raw)
To: Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <c9fe5c2fef063f5006cc9bfa03eec824ac015db7.1779133590.git.ashish.kalra@amd.com>
On Mon, May 18, 2026 at 09:42:15PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
> callers to allocate and populate per-cpu storage even when every CPU
> receives the same value. This is unnecessary overhead for the common
> case of writing a single uniform u64 to a per-CPU MSR across multiple
> CPUs.
>
> Add wrmsrq_on_cpus() which writes the same u64 value to the specified
> MSR on all CPUs in the given cpumask.
So let's add yet another function which name differs from the other one by
a single letter and have people go look at the implementation to know which is
which...?
Instead of unifying what's there and extending this one to do what you want it
to do?
And now you have a wrmsrQ_on_cpus() but no rdmsrQ_on_cpus()?
Because if you look at the code, you'll see how those are used: first you
rdmsr on CPUs, modify each value and then wrmsr on same CPUs.
So no, try again pls.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v5 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Borislav Petkov @ 2026-05-27 20:17 UTC (permalink / raw)
To: Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <305b625c0528f16a95542001c66e643fbd3a2622.1779133590.git.ashish.kalra@amd.com>
On Mon, May 18, 2026 at 09:41:53PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> Add a flag indicating whether RMPOPT instruction is supported.
>
> RMPOPT is a new instruction designed to minimize the performance
> overhead of RMP checks on the hypervisor and on non-SNP guests by
> allowing RMP checks to be skipped when 1G regions of memory are known
> not to contain any SEV-SNP guest memory.
Streamline:
"RMPOPT is a new instruction that reduces the performance overhead of RMP
checks for the hypervisor and non‑SNP guests by allowing those checks to be
skipped when 1‑GB memory regions are known to contain no SEV‑SNP guest
memory."
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Ackerley Tng @ 2026-05-27 20:17 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahdFSxXnWhhsjhKe@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
>> >> [Use size_t for size instead of u64]
>> >
>> > Why? Oh, right, because kvm_memory_slot.npages is an "unsigned long". The
>> > discrepancy between a u64 for the offset and a size_t for the size is confusing,
>> > as they are both conceptually in the same "domain".
>> >
>> > Rather than u64 and size_t, we should use pgoff_t, which is what KVM already uses
>> > as the storage for kvm_memory_slot.gmem.pgoff.
>> >
>>
>> I picked size_t more because I thought it was semantically correct to
>> use the size type for a size. size_t may have different sizes (64 vs
>> 32), but in the comparison offset + size > i_size_read(inode), size is
>> promoted to 64 bits, and signed inode size is cast to unsigned for
>> comparison, so I think that works.
>>
>> pgoff_t is also unsigned, but I think that should be reserved for page
>> offsets/indices?
>
> Just to avoid confusion over the definition of an offset/idnex:
>
> * The type of an index into the pagecache.
>
> I.e. it's not the 12-bit offset into a 4KiB page. Which I'm pretty sure you were
> saying as well, just want to ensure we're on the same page.
>
Wait yes I meant index as in the key if you think of the xarray of the
page cache as a key-value store.
> I like pgoff_t more than size_t because, for KVM, it's really all about addressing
> memory, thanks to the offset into guest_memfd being associated 1:1 with a GPA.
The offset into guest_memfd is associated 1:1 with a GPA, and this
offset is
offset = index << PAGE_SHIFT
> It's not perfect, because GPAs are tracked as 64-bit values, whereas the kernel
> restricts itself to "unsigned long". But that's a non-issue in practice since
> guest_memfd is 64-bit only.
>
> But conceptually, I like tracking the gmem offset as a pgoff_t to tie it back
> to using GPAs to offset/index into gmem. And for all intents and purposes, gmem
> is nothing more than a glorified pagecache :-)
So we actually want to use u64s for gmem offsets (where offset = index
<< PAGE_SHIFT), and pgoff_t for indices, since indices (aka page
offsets) are semantically the offset, counted in units of pages?
I pulled this conclusion together from filemap-related code like
filemap_add_folio() takes a pgoff_t index, so I thought gmem should
follow that and stick with pgoff_t for index/indices.
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Sean Christopherson @ 2026-05-27 19:26 UTC (permalink / raw)
To: Ackerley Tng
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <CAEvNRgEN6syKVLjvFX-s=xU=r3CBZ3KtmeKvVYOC09uvFGXSFg@mail.gmail.com>
On Wed, May 27, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > For shortlogs (and changeloges), when possible, describe the _change_ itself, not
> > its impact is. Sometimes "Fix xyz" is the best shortlog, e.g. when fixing build
> > failures, but here, I would go with:
> >
> > KVM: guest_memfd: Treat memslot binding offset+size as unsigned values
> >
> > for two reasons. First, it provides a lot more context for future readers, versus
> > "Fix possible signed integer overflow" which doesn't even capture what flow is
> > affected, how the overflow is being fixed, etc. Second, if the fix is wrong,
> > incomplete, etc., we don't end up with a follow-up patch that start with "Really
> > fix ...".
> >
>
> Thanks for explaining!
>
> > Oh, actually, three reasons. This doesn't only affect the overflow check. The
> > check on a negative offset is flawed, as it means KVM would incorrectly reject
> > bindings with (comically) large offsets.
> >
>
> Makes sense.
>
> > LOL, four. There is no bug. The size of the memslot is ((1UL << 31) - 1)
> > pages, i.e. 0x7FF_FFFFF000:
> >
> > if (id < KVM_USER_MEM_SLOTS &&
> > (mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> > return -EINVAL;
> >
> > and so "loff_t size" can never be negative.
>
> I think the bug was that the sum of offset + size in kvm_gmem_bind()
> when interpreted as signed integers could be smaller than
> i_size_read(inode) and allow binding.
>
> So IIUC even if size is small (and not negative), nothing catches a
> large enough offset where offset + size (interpreted as unsigned
> integers) doesn't overflow, but offset + size (interpreted as signed
> integers) overflows.
Oooh, duh, if @offset is positive, but @offset+size is negative. Yes, that's a
real bug, confirmed via selftest. I'll send a fix along with a selftest testcase.
Thanks much!
> >> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
> >> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >> [Use size_t for size instead of u64]
> >
> > Why? Oh, right, because kvm_memory_slot.npages is an "unsigned long". The
> > discrepancy between a u64 for the offset and a size_t for the size is confusing,
> > as they are both conceptually in the same "domain".
> >
> > Rather than u64 and size_t, we should use pgoff_t, which is what KVM already uses
> > as the storage for kvm_memory_slot.gmem.pgoff.
> >
>
> I picked size_t more because I thought it was semantically correct to
> use the size type for a size. size_t may have different sizes (64 vs
> 32), but in the comparison offset + size > i_size_read(inode), size is
> promoted to 64 bits, and signed inode size is cast to unsigned for
> comparison, so I think that works.
>
> pgoff_t is also unsigned, but I think that should be reserved for page
> offsets/indices?
Just to avoid confusion over the definition of an offset/idnex:
* The type of an index into the pagecache.
I.e. it's not the 12-bit offset into a 4KiB page. Which I'm pretty sure you were
saying as well, just want to ensure we're on the same page.
I like pgoff_t more than size_t because, for KVM, it's really all about addressing
memory, thanks to the offset into guest_memfd being associated 1:1 with a GPA.
It's not perfect, because GPAs are tracked as 64-bit values, whereas the kernel
restricts itself to "unsigned long". But that's a non-issue in practice since
guest_memfd is 64-bit only.
But conceptually, I like tracking the gmem offset as a pgoff_t to tie it back
to using GPAs to offset/index into gmem. And for all intents and purposes, gmem
is nothing more than a glorified pagecache :-)
^ permalink raw reply
* Re: [PATCH v2 5/5] KVM: SNP: Mark source page dirty in sev_gmem_post_populate
From: Ackerley Tng @ 2026-05-27 19:14 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahXOiy0dBSBOQli2@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, May 22, 2026, Ackerley Tng wrote:
>> Mark the folio as dirty after copying data into the source page in
>> sev_gmem_post_populate. After the memcpy, failing to mark the page dirty
>> can lead to the memory management subsystem discarding the changes if the
>> page is reclaimed or otherwise processed by the swap subsystem.
>>
>> Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> arch/x86/kvm/svm/sev.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index dbf75326a40f4..1a361f08c7a3d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2395,6 +2395,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> void *dst_vaddr = kmap_local_pfn(pfn);
>>
>> memcpy(src_vaddr, dst_vaddr, PAGE_SIZE);
>> + folio_mark_dirty(page_folio(src_page));
>
> I'd rather use set_page_dirty(). I'll fixup when applying, unless someon objects.
>
I was looking for page, dirty but set_page_dirty() somehow escaped my
search. This works, thanks!
>> kunmap_local(dst_vaddr);
>> kunmap_local(src_vaddr);
>>
>> --
>> 2.54.0.794.g4f17f83d09-goog
>>
^ permalink raw reply
* Re: [PATCH v2 3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding
From: Ackerley Tng @ 2026-05-27 19:11 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahXMzPz1cQl6kZge@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, May 22, 2026, Ackerley Tng wrote:
>> Unhandled errors from xa_store_range() means kvm_gmem_bind() might falsely
>> reporting success, leading to false assumptions in guest_memfd's lifecycle
>> later.
>>
>> On error, restore the unbound state and return the error to userspace.
>>
>> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> virt/kvm/guest_memfd.c | 11 +++++++++--
>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index d203135969d13..5b4911ffa208a 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -648,6 +648,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>> struct inode *inode;
>> struct file *file;
>> int r = -EINVAL;
>> + void *result;
>
> I would rather go with "xr". "result" is too generic, e.g. begs the question of
> "result of what?"
>
Good to go with xr too.
> Actually, I don't think we even need an intermediate variable.
>
>> BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
>>
>> @@ -688,7 +689,14 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>> if (kvm_gmem_supports_mmap(inode))
>> slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
>>
>> - xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
>> + result = xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
>> + if (xa_is_err(result)) {
>> + r = xa_err(result);
>> + xa_store_range(&f->bindings, start, end - 1, NULL, GFP_KERNEL);
>
> I'm not convinced this is necessary. Sashiko "asked" the question:
>
> : If xa_store_range() fails midway through storing a large range (for example,
> : returning -ENOMEM), does it leave the already-processed entries in the
> : f->bindings XArray?
> :
> : When this error is propagated back, the caller __kvm_set_memory_region()
> : will abort the operation and free the memslot without calling
> : kvm_gmem_unbind().
> :
> : Since the partial XArray updates aren't rolled back here, could this leave
> : dangling pointers to the freed memslot in f->bindings? If so, when the file
> : is eventually closed, kvm_gmem_release() might iterate over these dangling
> : pointers and write to slot->gmem.file, resulting in a use-after-free.
>
> but I think Sashiko is hallicunating.
>
When I updated this I kind of just assumed xa_store_range() always
iterates indices (so for a range [0, 10], it would store 11 times), and
an earlier index could be set, and a later store could result in
-ENOMEM.
Since you called this out, I dug into it more.
> If @entry is non-NULL, xa_store_range() pre-creates the entire range, before
> storing anything into the range:
>
> if (entry) {
> unsigned int order = BITS_PER_LONG;
> if (last + 1)
> order = __ffs(last + 1);
> xas_set_order(&xas, last, order);
> xas_create(&xas, true);
> if (xas_error(&xas))
> goto unlock;
> }
>
xa_store_range() doesn't actually always iterate: if last + 1 is some
clean power of 2, it'll create a higher order xarray node.
Otherwise, it falls back to creating and storing 1 index/node at a time:
if the above did manage to create an xarray node, xas_error() returns
false, it goes on to the store below.
> Yes, the API handles failure on the subsequent xas_store(), but I can't imagine
> that failure is actually, barring garbage input from KVM:
>
> do {
> xas_set_range(&xas, first, last);
> xas_store(&xas, entry);
> if (xas_error(&xas))
> goto unlock;
> first += xas_size(&xas);
> } while (first <= last);
>
So if a later xas_create() fails because it runs out of memory, the
earlier stores would have already been committed.
This ignores -EEXIST being returned since earlier in kvm_gmem_bind()
conflicts were already checked.
> Purely from a design perspective, providing an API that can fail partway through
> under normal operation, with no indication of where failure occured (AFAICT),
> would be awful.
>
Do you mean the API of xas_store_range()? xas is updated by
xas_set_range() so that should track the last store. Since the cleanup
is storing NULLs and won't allocate, I thought it would be fine to just
store NULL on the entire range on error.
>> + } else {
>> + r = 0;
>> + }
>> +
>> filemap_invalidate_unlock(inode->i_mapping);
>>
>> /*
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Ackerley Tng @ 2026-05-27 18:26 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahXB_tBWakUUjt6q@google.com>
Sean Christopherson <seanjc@google.com> writes:
> For shortlogs (and changeloges), when possible, describe the _change_ itself, not
> its impact is. Sometimes "Fix xyz" is the best shortlog, e.g. when fixing build
> failures, but here, I would go with:
>
> KVM: guest_memfd: Treat memslot binding offset+size as unsigned values
>
> for two reasons. First, it provides a lot more context for future readers, versus
> "Fix possible signed integer overflow" which doesn't even capture what flow is
> affected, how the overflow is being fixed, etc. Second, if the fix is wrong,
> incomplete, etc., we don't end up with a follow-up patch that start with "Really
> fix ...".
>
Thanks for explaining!
> Oh, actually, three reasons. This doesn't only affect the overflow check. The
> check on a negative offset is flawed, as it means KVM would incorrectly reject
> bindings with (comically) large offsets.
>
Makes sense.
> LOL, four. There is no bug. The size of the memslot is ((1UL << 31) - 1)
> pages, i.e. 0x7FF_FFFFF000:
>
> if (id < KVM_USER_MEM_SLOTS &&
> (mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> return -EINVAL;
>
> and so "loff_t size" can never be negative.
>
I think the bug was that the sum of offset + size in kvm_gmem_bind()
when interpreted as signed integers could be smaller than
i_size_read(inode) and allow binding.
So IIUC even if size is small (and not negative), nothing catches a
large enough offset where offset + size (interpreted as unsigned
integers) doesn't overflow, but offset + size (interpreted as signed
integers) overflows.
> As for the offset, the negative check is intentional, because KVM_CREATE_GUEST_MEMFD
> takes loff_t for the size, and so an offset that is negative would also be larger
> than the size of the file.
>
> I still think it's worth taking unsigned values, because teasing out all of that
> information wasn't exactly easy.
>
Yup it's still easier this way, and your proposed shortlog is good.
> On Fri, May 22, 2026, Ackerley Tng wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> The caller, kvm_set_memory_region(), checks for an overflow in an unsigned
>> u64 guest_memfd_offset. When guest_memfd_offset is passed to kvm_gmem_bind,
>> it is cast into a signed 64-bit integer.
>>
>> Hence, a large 64-bit offset could result in a negative loff_t, which could
>> result in the overflow checks failing.
>>
>> Make kvm_gmem_bind() take u64 instead of loff_t to consistently deal with
>> unsigned values to avoid this issue.
>>
>> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> [Use size_t for size instead of u64]
>
> Why? Oh, right, because kvm_memory_slot.npages is an "unsigned long". The
> discrepancy between a u64 for the offset and a size_t for the size is confusing,
> as they are both conceptually in the same "domain".
>
> Rather than u64 and size_t, we should use pgoff_t, which is what KVM already uses
> as the storage for kvm_memory_slot.gmem.pgoff.
>
I picked size_t more because I thought it was semantically correct to
use the size type for a size. size_t may have different sizes (64 vs
32), but in the comparison offset + size > i_size_read(inode), size is
promoted to 64 bits, and signed inode size is cast to unsigned for
comparison, so I think that works.
pgoff_t is also unsigned, but I think that should be reserved for page
offsets/indices? Is that the way to think when choosing types? What does
"domain" mean above?
> I'll send a new version as a standalone patch.
^ permalink raw reply
* Re: [PATCH v2 0/5] guest_memfd fixes for bind and populate
From: Sean Christopherson @ 2026-05-27 18:19 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen,
Ackerley Tng
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <20260522-fix-sev-gmem-post-populate-v2-0-3f196bfad5a1@google.com>
On Fri, 22 May 2026 15:46:05 -0700, Ackerley Tng wrote:
> This series is a group of fixes for the bind and populate flows for
> guest_memfd, and fixes some issues reported by Sashiko after reviewing the
> guest_memfd in-place conversions series [1] and another fixup series Sean
> posted [3].
>
> Changes in v2:
>
> [...]
Applied 1, 4, and 5 to kvm-x86 sev, with massaged shortlogs+changelogs.
[1/5] KVM: SEV: Pin source page for write when adding CPUID data for SNP guest
https://github.com/kvm-x86/linux/commit/f13e90059908
[2/5] KVM: guest_memfd: Fix possible signed integer overflow
[SKIP]
[3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding
[SKIP]
[4/5] KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements
https://github.com/kvm-x86/linux/commit/138f5f9cbe37
[5/5] KVM: SEV: Mark source page dirty when writing back CPUID data on failure
https://github.com/kvm-x86/linux/commit/97cd21d57e9b
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox