Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Dan Williams (nvidia) @ 2026-05-27 22:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Dan Williams (nvidia), Alexey Kardashevskiy,
	linux-coco, iommu, linux-kernel, kvm
  Cc: Bjorn Helgaas, Dan Williams, Jason Gunthorpe, Joerg Roedel,
	Jonathan Cameron, Kevin Tian, Nicolin Chen, Samuel Ortiz,
	Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun,
	Shameer Kolothum, Paolo Bonzini, Tony Krowiak, Halil Pasic,
	Jason Herne, Harald Freudenberger, Holger Dengler, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
	Eric Farman, linux-s390
In-Reply-To: <yq5aldd4spyc.fsf@kernel.org>

Aneesh Kumar K.V wrote:
> >> I am leaning towards the latter at this point.
> >
> > But we already have struct pci_tsm_ops::guest_req, which is specific to
> > the underlying CC architecture. From the above, pci_tsm_req_scope also
> > appears to carry the same information. Is that useful?
> >
> 
> I think there is value in having the VMM express the guest’s
> confidential computing architecture, so that the TSM backend can
> validate whether it should handle that guest request ?.

Yes, that is the idea.

> So it would not be the IOMMU validating the scope value, but rather
> pci_tsm_ops::guest_req.
> 
> static ssize_t cca_tsm_guest_req(struct pci_tdi *tdi, enum pci_tsm_req_scope scope,
> 		sockptr_t req, size_t req_len, sockptr_t resp,
> 		size_t resp_len, u64 *tsm_code)
> {
> 	struct pci_dev *pdev = tdi->pdev;
> 
> 	/* reject the guest request if VMM was using the link tsm wrongly. The guest
> 	 * was using a wrong CC archiecture with this link tsm
> 	 */
> 	if (scope != TSM_REQ_TYPE_CCA)
> 		return -EINVAL;

Right, iommufd is tunneling TSM requests. The tunnel should have an
envelope of TSM_REQ_TYPE_* and an @op field. The TSM driver gets those
from iommufd, validates the envelope and then processes @req.

This self-consistency and explicitness also buys some future-proofing.
It allows for alternate command sets within an arch, cross TSM
implementation shared commands, IOMMUFD-to-TSM requests outside of guest
requests.

> Jason Gunthorpe <jgg@ziepe.ca> writes:
> 
> > On Tue, May 26, 2026 at 11:17:50PM -0700, Dan Williams (nvidia) wrote:
> >
> >> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
> >> 
> >> TSM_REQ_TYPE_CCA
> >> TSM_REQ_TYPE_SEV
> >> TSM_REQ_TYPE_TDX
> >> 
> >> I am leaning towards the latter at this point.
> >
> > Yeah, this sounds good. I would also include an common op field that
> > can be decoded by the TSM driver based on the TYPE above, and the
> > usual in/out message buffers.
> 
> We already have iommufd_vdevice_tsm_op_ioctl() to handle common
> operations.

Per above, I believe this is about an @op value in a common location
that iommufd can forward to the backend for validation of guest
requests.

> Right now, it handles IOMMU_VDEVICE_TSM_BIND and
> IOMMU_VDEVICE_TSM_UNBIND. I guess we should move TSM_REQ_SET_TDI_STATE
> operations to that as well?

I think we can wait to move it to its own IOMMU operation unless/until
there is a need to set RUN outside of an explicit guest request, right?

^ permalink raw reply

* Re: [PATCH v3 07/41] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: Wei Liu @ 2026-05-27 22:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-8-seanjc@google.com>

On Fri, May 15, 2026 at 12:19:08PM -0700, Sean Christopherson wrote:
> Register the Hyper-V reference counter (refcounter) callbacks for saving
> and restoring its PV sched_clock, if and only if the refcounter is
> actually being used for sched_clock.  Currently, Hyper-V overrides the
> save/restore hooks if the reference TSC available, whereas the Hyper-V
> refcounter code only overrides sched_clock if the reference TSC is
> available *and* it's not invariant.  The flaw is effectively papered over
> by invoking the "old" save/restore callbacks as part of save/restore, but
> that's unnecessary and fragile.
> 
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
> 
> Note, while the Hyper-V refcounter code is intended to be architecture
> neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
> x86 specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Acked-by: Wei Liu <wei.liu@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 08/41] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: Wei Liu @ 2026-05-27 22:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-9-seanjc@google.com>

On Fri, May 15, 2026 at 12:19:09PM -0700, Sean Christopherson wrote:
> Now that all of the Hyper-V reference counter sched_clock code is located
> in a single file, drop the superfluous wrappers for the save/restore flows.
> 
> No functional change intended.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Acked-by: Wei Liu <wei.liu@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 09/41] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: Wei Liu @ 2026-05-27 22:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-10-seanjc@google.com>

On Fri, May 15, 2026 at 12:19:10PM -0700, Sean Christopherson wrote:
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V reference counter, drop the
> invocation of the "old" save/restore callbacks.  When the registration of
> the PV sched_clock was done separately from overriding the save/restore
> hooks, it was possible for Hyper-V to clobber the TSC save/restore
> callbacks without actually switching to the Hyper-V refcounter.
> 
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Acked-by: Wei Liu <wei.liu@kernel.org>

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-28  0:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ashish Kalra, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <eea0497f-6930-43e3-947d-dae139e657ad@intel.com>

On Wed, May 27, 2026 at 02:38:05PM -0700, Dave Hansen wrote:
> This one is my doing.

I know.

But hey, maybe we should not disagree on the public ML because the submitter
might disappear like the last one. :-P

> wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
> also not very flexible because it needs a 'struct msr __percpu *msrs'
> argument where each MSR has a value in memory.

Right, we did that a looong time ago.

The only reason I'd have for per-CPU MSR structs is reading different MSR
values on different cores, modifying only the bits you need and then *keeping*
the remaining values as they were. And that interface allows you to do that
while this new thing won't.

And I'm going to venture a guess here that adding a simpler interface which
simply forces a new value ontop of a whole MSR could cause a lot of subtle
bugs when people don't pay attention to keep the old values.

> The use case for RMPOPT is that all CPUs get the same value. It'd be a
> little awkward to go create a percpu data structure to duplcate the same
> value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
> performance sensitive since it's done during boot. It should do the IPIs
> in parallel.

Oh sure, my meaning was to create something that serves both purposes.

> toggle_ecc_err_reporting(), on the other hand, is done at module init
> time. It's not really performance sensitive. It's probably pretty easy
> to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
> something slightly less efficient.

Sure. That's fine.

> Yeah, the
> 
> 	wrmsr_on_cpus()
> 	wrmsrq_on_cpus()
> 
> naming pain is real. There's little chance of bugs coming from it
> because the function signatures are *SO* different. But, it certainly
> could confuse humans for a minute.

Yap.

> But the real solution to this is axing wrmsr_on_cpus(). 

Yap, for example. Basically reingeneering the whole
write-MSRs-on-multiple-CPUs functionality is what I meant.

> Which I think we could do after killing its one user which the attached
> (completely untested) patch does. The only downside of the patch is that it
> does RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not
> sure anyone would care. If anyone did, I _think_ all those MSRs have the
> same value and the code could be simplified further. But that would take
> more than 3 minutes.
> 
> It's also possible that my grepping was bad or I'm completely
> misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.

Looks ok to me, we can surely do that. I even hw to test it. I think...

> BTW, I also don't feel the need to make Ashish go do any of this edac
> cleanup. I think it can just be done in parallel. But I wouldn't stop
> him if he volunteered.

Why not?

It has always been the case: cleanups and bug fixes first, new features ontop.

So yeah, modulo figuring out how to redefine the *msr_on_cpus() interface,
I think this all makes sense.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v14 20/44] arm64: RMI: Support for the VGIC in realms
From: Gavin Shan @ 2026-05-28  4:07 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-21-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> The RMM provides emulation of a VGIC to the realm guest. With RMM v2.0
> the registers are passed in the system registers so this works similar
> to a normal guest, but kvm_arch_vcpu_put() need reordering to early out,
> and realm guests don't support GICv2 even if the host does.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes from v12:
>   * GIC registers are now passed in the system registers rather than via
>     rec_entry/rec_exit which removes most of the changes.
> Changes from v11:
>   * Minor changes to align with the previous patches. Note that the VGIC
>     handling will change with RMM v2.0.
> Changes from v10:
>   * Make sure we sync the VGIC v4 state, and only populate valid lrs from
>     the list.
> Changes from v9:
>   * Copy gicv3_vmcr from the RMM at the same time as gicv3_hcr rather
>     than having to handle that as a special case.
> Changes from v8:
>   * Propagate gicv3_hcr to from the RMM.
> Changes from v5:
>   * Handle RMM providing fewer GIC LRs than the hardware supports.
> ---
>   arch/arm64/kvm/arm.c            | 11 ++++++++---
>   arch/arm64/kvm/vgic/vgic-init.c |  2 +-
>   2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 93d34762db91..21d9dfdb1ea0 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -786,19 +786,24 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>   		kvm_call_hyp_nvhe(__pkvm_vcpu_put);
>   	}
>   
> +	kvm_timer_vcpu_put(vcpu);
> +	kvm_vgic_put(vcpu);
> +
> +	vcpu->cpu = -1;
> +
> +	if (vcpu_is_rec(vcpu))
> +		return;
> +

For a REC, kvm_vcpu_{load, put}_debug() becomes unbalanced in kvm_arch_vcpu_{load, put}().
kvm_vcpu_load_debug() is called in kvm_arch_vcpu_load(), but kvm_vcpu_put_debug() won't
be called in kvm_arch_vcpu_put() after this whole series is applied.

>   	kvm_vcpu_put_debug(vcpu);
>   	kvm_arch_vcpu_put_fp(vcpu);
>   	if (has_vhe())
>   		kvm_vcpu_put_vhe(vcpu);
> -	kvm_timer_vcpu_put(vcpu);
> -	kvm_vgic_put(vcpu);
>   	kvm_vcpu_pmu_restore_host(vcpu);
>   	if (vcpu_has_nv(vcpu))
>   		kvm_vcpu_put_hw_mmu(vcpu);
>   	kvm_arm_vmid_clear_active();
>   
>   	vcpu_clear_on_unsupported_cpu(vcpu);
> -	vcpu->cpu = -1;
>   }
>   
>   static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index 933983bb2005..a9db963dfd23 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -81,7 +81,7 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
>   	 * the proper checks already.
>   	 */
>   	if (type == KVM_DEV_TYPE_ARM_VGIC_V2 &&
> -		!kvm_vgic_global_state.can_emulate_gicv2)
> +	    (!kvm_vgic_global_state.can_emulate_gicv2 || kvm_is_realm(kvm)))
>   		return -ENODEV;
>   
>   	/*

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v14 21/44] KVM: arm64: Support timers in realm RECs
From: Gavin Shan @ 2026-05-28  4:11 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-22-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> The RMM keeps track of the timer while the realm REC is running, but on
> exit to the normal world KVM is responsible for handling the timers.
> 
> A later patch adds the support for propagating the timer values from the
> exit data structure and calling kvm_realm_timers_update().
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v12:
>   * Adapt to upstream changes.
> Changes since v11:
>   * Drop the kvm_is_realm() check from timer_set_offset(). We already
>     ensure that the offset is 0 when calling the function.
> Changes since v10:
>   * KVM_CAP_COUNTER_OFFSET is now already hidden by a previous patch.
> Changes since v9:
>   * No need to move the call to kvm_timer_unblocking() in
>     kvm_timer_vcpu_load().
> Changes since v7:
>   * Hide KVM_CAP_COUNTER_OFFSET for realm guests.
> ---
>   arch/arm64/kvm/arch_timer.c  | 28 +++++++++++++++++++++++++---
>   include/kvm/arm_arch_timer.h |  2 ++
>   2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index cbea4d9ee955..88ed01edc136 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -470,6 +470,21 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
>   			    timer_ctx);
>   }
>   
> +void kvm_realm_timers_update(struct kvm_vcpu *vcpu)
> +{
> +	struct arch_timer_cpu *arch_timer = &vcpu->arch.timer_cpu;
> +	int i;
> +
> +	for (i = 0; i < NR_KVM_EL0_TIMERS; i++) {
> +		struct arch_timer_context *timer = &arch_timer->timers[i];
> +		bool status = timer_get_ctl(timer) & ARCH_TIMER_CTRL_IT_STAT;
> +		bool level = kvm_timer_irq_can_fire(timer) && status;
> +
> +		if (level != timer->irq.level)
> +			kvm_timer_update_irq(vcpu, level, timer);
> +	}
> +}
> +
>   /* Only called for a fully emulated timer */
>   static void timer_emulate(struct arch_timer_context *ctx)
>   {
> @@ -1079,7 +1094,7 @@ static void timer_context_init(struct kvm_vcpu *vcpu, int timerid)
>   
>   	ctxt->timer_id = timerid;
>   
> -	if (!kvm_vm_is_protected(vcpu->kvm)) {
> +	if (!kvm_vm_is_protected(vcpu->kvm) && !kvm_is_realm(vcpu->kvm)) {
>   		if (timerid == TIMER_VTIMER)
>   			ctxt->offset.vm_offset = &kvm->arch.timer_data.voffset;
>   		else

s/!kvm_is_realm(vcpu->kvm)/!vcpu_is_rec(vcpu)

> @@ -1110,7 +1125,7 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
>   		timer_context_init(vcpu, i);
>   
>   	/* Synchronize offsets across timers of a VM if not already provided */
> -	if (!vcpu_is_protected(vcpu) &&
> +	if (!vcpu_is_protected(vcpu) && !kvm_is_realm(vcpu->kvm) &&
>   	    !test_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, &vcpu->kvm->arch.flags)) {
>   		timer_set_offset(vcpu_vtimer(vcpu), kvm_phys_timer_read());
>   		timer_set_offset(vcpu_ptimer(vcpu), 0);

Same as above.

> @@ -1611,6 +1626,13 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
>   		return -EINVAL;
>   	}
>   
> +	/*
> +	 * We don't use mapped IRQs for Realms because the RMI doesn't allow
> +	 * us setting the LR.HW bit in the VGIC.
> +	 */
> +	if (vcpu_is_rec(vcpu))
> +		return 0;
> +
>   	get_timer_map(vcpu, &map);
>   
>   	ops = vgic_is_v5(vcpu->kvm) ? &arch_timer_irq_ops_vgic_v5 :
> @@ -1740,7 +1762,7 @@ int kvm_vm_ioctl_set_counter_offset(struct kvm *kvm,
>   	if (offset->reserved)
>   		return -EINVAL;
>   
> -	if (kvm_vm_is_protected(kvm))
> +	if (kvm_vm_is_protected(kvm) || kvm_is_realm(kvm))
>   		return -EINVAL;
>   
>   	mutex_lock(&kvm->lock);
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index bf8cc9589bd0..ffdb90dcad58 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -113,6 +113,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   
> +void kvm_realm_timers_update(struct kvm_vcpu *vcpu);
> +
>   u64 kvm_phys_timer_read(void);
>   
>   void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Xu Yilun @ 2026-05-28  3:48 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <cdfa241c-2d8e-4aab-8491-1de4a54e8a59@intel.com>

On Wed, May 27, 2026 at 10:17:36AM -0700, Sohil Mehta wrote:
> On 5/27/2026 12:11 AM, Xu Yilun wrote:
> 
> >>> +struct tdx_sys_info_ext {
> >>> +	u16 memory_pool_required_pages;
> >>> +	u8 ext_required;
> >>
> >> The name ext_required seems like a boolean. It is also used like a
> >> boolean later.
> >> 	if (!tdx_sysinfo.ext.ext_required)
> >> 		return 0;
> >>
> >> But, IIUC, is it actually a mask that lists any feature that needs
> > 
> > No it is just a bool about Extentions needs to be initialized or not.
> > 
> How does the kernel know which features need Extensions? Is there any
> hardware enumeration or the kernel just keeps a static list?

There is no HW enumeration, mm... seems this is an important reason that
we don't delay the Extensions enabling, kernel doesn't have to keep in
mind which features need Extensions.

^ permalink raw reply

* Re: [PATCH v14 22/44] arm64: RMI: Handle realm enter/exit
From: Gavin Shan @ 2026-05-28  4:38 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-23-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> Entering a realm is done using a SMC call to the RMM. On exit the
> exit-codes need to be handled slightly differently to the normal KVM
> path so define our own functions for realm enter/exit and hook them
> in if the guest is a realm guest.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> ---
> Chanegs since v13:
>   * The RMM is now required to provide an ESR value with the correct
>     information to emulate MMIO, so we no longer need to hardcode 0s in
>     rec_exit_sys_reg().
>   * The PSCI changes mean that there is a potential race when turning on
>     a VCPU which can cause a RMI_ERROR_REC return. Exit to user space
>     with -EAGAIN in this case.
> Changes since v12:
>   * Call guest_state_{enter,exit}_irqoff() around rmi_rec_enter().
>   * Add handling of the IRQ exception case where IRQs need to be briefly
>     enabled before exiting guest timing.
> Changes since v8:
>   * Introduce kvm_rec_pre_enter() called before entering an atomic
>     section to handle operations that might require memory allocation
>     (specifically completing a RIPAS change introduced in a later patch).
>   * Updates to align with upstream changes to hpfar_el2 which now (ab)uses
>     HPFAR_EL2_NS as a valid flag.
>   * Fix exit reason when racing with PSCI shutdown to return
>     KVM_EXIT_SHUTDOWN rather than KVM_EXIT_UNKNOWN.
> Changes since v7:
>   * A return of 0 from kvm_handle_sys_reg() doesn't mean the register has
>     been read (although that can never happen in the current code). Tidy
>     up the condition to handle any future refactoring.
> Changes since v6:
>   * Use vcpu_err() rather than pr_err/kvm_err when there is an associated
>     vcpu to the error.
>   * Return -EFAULT for KVM_EXIT_MEMORY_FAULT as per the documentation for
>     this exit type.
>   * Split code handling a RIPAS change triggered by the guest to the
>     following patch.
> Changes since v5:
>   * For a RIPAS_CHANGE request from the guest perform the actual RIPAS
>     change on next entry rather than immediately on the exit. This allows
>     the VMM to 'reject' a RIPAS change by refusing to continue
>     scheduling.
> Changes since v4:
>   * Rename handle_rme_exit() to handle_rec_exit()
>   * Move the loop to copy registers into the REC enter structure from the
>     to rec_exit_handlers callbacks to kvm_rec_enter(). This fixes a bug
>     where the handler exits to user space and user space wants to modify
>     the GPRS.
>   * Some code rearrangement in rec_exit_ripas_change().
> Changes since v2:
>   * realm_set_ipa_state() now provides an output parameter for the
>     top_iap that was changed. Use this to signal the VMM with the correct
>     range that has been transitioned.
>   * Adapt to previous patch changes.
> ---
>   arch/arm64/include/asm/kvm_rmi.h |   4 +
>   arch/arm64/kvm/Makefile          |   2 +-
>   arch/arm64/kvm/arm.c             |  26 ++++-
>   arch/arm64/kvm/rmi-exit.c        | 186 +++++++++++++++++++++++++++++++
>   arch/arm64/kvm/rmi.c             |  42 +++++++
>   5 files changed, 254 insertions(+), 6 deletions(-)
>   create mode 100644 arch/arm64/kvm/rmi-exit.c
> 
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index d99bf4fc3c39..feb534a6678e 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -84,6 +84,10 @@ void kvm_destroy_realm(struct kvm *kvm);
>   void kvm_realm_destroy_rtts(struct kvm *kvm);
>   void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>   
> +int kvm_rec_enter(struct kvm_vcpu *vcpu);
> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
> +
>   static inline bool kvm_realm_is_private_address(struct realm *realm,
>   						unsigned long addr)
>   {
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index ed3cf30eb06e..4a2d52fdb6a2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -16,7 +16,7 @@ CFLAGS_handle_exit.o += -Wno-override-init
>   kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
>   	 inject_fault.o va_layout.o handle_exit.o config.o \
>   	 guest.o debug.o reset.o sys_regs.o stacktrace.o \
> -	 vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o \
> +	 vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o rmi-exit.o \
>   	 arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \
>   	 vgic/vgic.o vgic/vgic-init.o \
>   	 vgic/vgic-irqfd.o vgic/vgic-v2.o \
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 21d9dfdb1ea0..ed88a203b892 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1331,6 +1331,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>   		if (ret > 0)
>   			ret = check_vcpu_requests(vcpu);
>   
> +		if (ret > 0 && vcpu_is_rec(vcpu))
> +			ret = kvm_rec_pre_enter(vcpu);
> +
>   		/*
>   		 * Preparing the interrupts to be injected also
>   		 * involves poking the GIC, which must be done in a
> @@ -1378,7 +1381,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>   		trace_kvm_entry(*vcpu_pc(vcpu));
>   		guest_timing_enter_irqoff();
>   
> -		ret = kvm_arm_vcpu_enter_exit(vcpu);
> +		if (vcpu_is_rec(vcpu))
> +			ret = kvm_rec_enter(vcpu);
> +		else
> +			ret = kvm_arm_vcpu_enter_exit(vcpu);
>   
>   		vcpu->mode = OUTSIDE_GUEST_MODE;
>   		vcpu->stat.exits++;
> @@ -1424,7 +1430,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>   		 * context synchronization event) is necessary to ensure that
>   		 * pending interrupts are taken.
>   		 */
> -		if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
> +		if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ ||
> +		    (vcpu_is_rec(vcpu) &&
> +		     vcpu->arch.rec.run->exit.exit_reason == RMI_EXIT_IRQ)) {
>   			local_irq_enable();
>   			isb();
>   			local_irq_disable();

The condition could be posssibly imprecise because ARM_EXCEPTION_CODE(ret)
can be ARM_EXCEPTION_IRQ even for a REC. So the precise condition would be:

		if ((!vcpu_is_rec(vcpu) && ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) ||
		    (vcpu_is_rec(vcpu) && vcpu->arch.rec.run->exit.exit_reason == RMI_EXIT_IRQ)) {

> @@ -1436,8 +1444,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>   
>   		trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
>   
> -		/* Exit types that need handling before we can be preempted */
> -		handle_exit_early(vcpu, ret);
> +		if (!vcpu_is_rec(vcpu)) {
> +			/*
> +			 * Exit types that need handling before we can be
> +			 * preempted
> +			 */
> +			handle_exit_early(vcpu, ret);
> +		}
>   
>   		kvm_nested_sync_hwstate(vcpu);
>   
> @@ -1462,7 +1475,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>   			ret = ARM_EXCEPTION_IL;
>   		}
>   
> -		ret = handle_exit(vcpu, ret);
> +		if (vcpu_is_rec(vcpu))
> +			ret = handle_rec_exit(vcpu, ret);
> +		else
> +			ret = handle_exit(vcpu, ret);
>   	}
>   
>   	/* Tell userspace about in-kernel device output levels */
> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
> new file mode 100644
> index 000000000000..e7c51b6cf6ce
> --- /dev/null
> +++ b/arch/arm64/kvm/rmi-exit.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <kvm/arm_hypercalls.h>
> +#include <kvm/arm_psci.h>
> +
> +#include <asm/rmi_smc.h>
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_rmi.h>
> +#include <asm/kvm_mmu.h>
> +
> +typedef int (*exit_handler_fn)(struct kvm_vcpu *vcpu);
> +
> +static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +
> +	vcpu_err(vcpu, "Unhandled exit reason from realm (ESR: %#llx)\n",
> +		 rec->run->exit.esr);
> +	return -ENXIO;
> +}
> +

s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu), rec->run->exit.esr has been
copied to the storage space pointed by kvm_vcpu_get_esr() in its caller.

> +static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_handle_guest_abort(vcpu);
> +}
> +
> +static int rec_exit_sync_iabt(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +
> +	vcpu_err(vcpu, "Unhandled instruction abort (ESR: %#llx).\n",
> +		 rec->run->exit.esr);
> +	return -ENXIO;
> +}
> +

s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu)

> +static int rec_exit_sys_reg(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +	unsigned long esr = kvm_vcpu_get_esr(vcpu);
> +	int rt = kvm_vcpu_sys_get_rt(vcpu);
> +	bool is_write = (esr & ESR_ELx_SYS64_ISS_DIR_MASK) == ESR_ELx_SYS64_ISS_DIR_WRITE;
> +	int ret;
> +
> +	if (is_write)
> +		vcpu_set_reg(vcpu, rt, rec->run->exit.gprs[rt]);
> +
> +	ret = kvm_handle_sys_reg(vcpu);
> +	if (!is_write)
> +		rec->run->enter.gprs[rt] = vcpu_get_reg(vcpu, rt);
> +
> +	return ret;
> +}
> +
> +static exit_handler_fn rec_exit_handlers[] = {
> +	[0 ... ESR_ELx_EC_MAX]	= rec_exit_reason_notimpl,
> +	[ESR_ELx_EC_SYS64]	= rec_exit_sys_reg,
> +	[ESR_ELx_EC_DABT_LOW]	= rec_exit_sync_dabt,
> +	[ESR_ELx_EC_IABT_LOW]	= rec_exit_sync_iabt
> +};
> +
> +static int rec_exit_psci(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +	int i;
> +
> +	for (i = 0; i < REC_RUN_GPRS; i++)
> +		vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]);
> +
> +	return kvm_smccc_call_handler(vcpu);
> +}
> +
> +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct realm *realm = &kvm->arch.realm;
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +	unsigned long base = rec->run->exit.ripas_base;
> +	unsigned long top = rec->run->exit.ripas_top;
> +	unsigned long ripas = rec->run->exit.ripas_value;
> +
> +	if (!kvm_realm_is_private_address(realm, base) ||
> +	    !kvm_realm_is_private_address(realm, top - 1)) {
> +		vcpu_err(vcpu, "Invalid RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n",
> +			 base, top, ripas);
> +		/* Set RMI_REJECT bit */
> +		rec->run->enter.flags = REC_ENTER_FLAG_RIPAS_RESPONSE;
> +		return -EINVAL;
> +	}

I doubt if the flag (REC_ENTER_FLAG_RIPAS_RESPONSE) will be handed over to RMM
since the negative return value forces we're exiting to VMM like QEMU where
how this problematic case can be handled is TBD.

> +
> +	/* Exit to VMM, the actual RIPAS change is done on next entry */
> +	kvm_prepare_memory_fault_exit(vcpu, base, top - base, false, false,
> +				      ripas == RMI_RAM);
> +
> +	/*
> +	 * KVM_EXIT_MEMORY_FAULT requires an return code of -EFAULT, see the
> +	 * API documentation
> +	 */
> +	return -EFAULT;
> +}
> +
> +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +
> +	__vcpu_assign_sys_reg(vcpu, CNTV_CTL_EL0, rec->run->exit.cntv_ctl);
> +	__vcpu_assign_sys_reg(vcpu, CNTV_CVAL_EL0, rec->run->exit.cntv_cval);
> +	__vcpu_assign_sys_reg(vcpu, CNTP_CTL_EL0, rec->run->exit.cntp_ctl);
> +	__vcpu_assign_sys_reg(vcpu, CNTP_CVAL_EL0, rec->run->exit.cntp_cval);
> +
> +	kvm_realm_timers_update(vcpu);
> +}
> +
> +/*
> + * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
> + * proper exit to userspace.
> + */
> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +	u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
> +	unsigned long status, index;
> +
> +	status = RMI_RETURN_STATUS(rec_run_ret);
> +	index = RMI_RETURN_INDEX(rec_run_ret);
> +
> +	/*
> +	 * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we might
> +	 * see the following status code and index indicating an attempt to run
> +	 * a REC when the RD state is SYSTEM_OFF.  In this case, we just need to
> +	 * return to user space which can deal with the system event or will try
> +	 * to run the KVM VCPU again, at which point we will no longer attempt
> +	 * to enter the Realm because we will have a sleep request pending on
> +	 * the VCPU as a result of KVM's PSCI handling.
> +	 */
> +	if (status == RMI_ERROR_REALM) {
> +		vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> +		return 0;
> +	}
> +
> +	/*
> +	 * If a VCPU has been turned on, but the REC state hasn't been updated
> +	 * we may experience RMI_ERROR_REC. Exit to the userspace with -EAGAIN
> +	 * for a retry.
> +	 */
> +	if (status == RMI_ERROR_REC)
> +		return -EAGAIN;
> +	if (rec_run_ret)
> +		return -ENXIO;
> +
> +	vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
> +	vcpu->arch.fault.far_el2 = rec->run->exit.far;
> +	/* HPFAR_EL2 is only valid for RMI_EXIT_SYNC */
> +	vcpu->arch.fault.hpfar_el2 = 0;
> +
> +	update_arch_timer_irq_lines(vcpu);
> +
> +	/* Reset the emulation flags for the next run of the REC */
> +	rec->run->enter.flags = 0;
> +
> +	switch (rec->run->exit.exit_reason) {
> +	case RMI_EXIT_SYNC:
> +		/*
> +		 * HPFAR_EL2_NS is hijacked to indicate a valid HPFAR value,
> +		 * see __get_fault_info()
> +		 */
> +		vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar | HPFAR_EL2_NS;
> +		return rec_exit_handlers[esr_ec](vcpu);
> +	case RMI_EXIT_IRQ:
> +	case RMI_EXIT_FIQ:
> +	case RMI_EXIT_SERROR:
> +		return 1;
> +	case RMI_EXIT_PSCI:
> +		return rec_exit_psci(vcpu);
> +	case RMI_EXIT_RIPAS_CHANGE:
> +		return rec_exit_ripas_change(vcpu);
> +	}
> +
> +	kvm_pr_unimpl("Unsupported exit reason: %u\n",
> +		      rec->run->exit.exit_reason);
> +	vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> +	return 0;
> +}
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index 353a5ca45e78..d8a5fb12db2d 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -173,6 +173,48 @@ static int realm_ensure_created(struct kvm *kvm)
>   	return -ENXIO;
>   }
>   
> +/*
> + * kvm_rec_pre_enter - Complete operations before entering a REC
> + *
> + * Some operations require work to be completed before entering a realm. That
> + * work may require memory allocation so cannot be done in the kvm_rec_enter()
> + * call.
> + *
> + * Return: 1 if we should enter the guest
> + *	   0 if we should exit to userspace
> + *	   < 0 if we should exit to userspace, where the return value indicates
> + *	   an error
> + */
> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +
> +	if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
> +		return -EINVAL;
> +
> +	switch (rec->run->exit.exit_reason) {
> +	case RMI_EXIT_HOST_CALL:
> +		for (int i = 0; i < REC_RUN_GPRS; i++)
> +			rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
> +		break;
> +	}
> +
> +	return 1;
> +}
> +
> +int noinstr kvm_rec_enter(struct kvm_vcpu *vcpu)
> +{
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +	int ret;
> +
> +	guest_state_enter_irqoff();
> +	ret = rmi_rec_enter(virt_to_phys(rec->rec_page),
> +			    virt_to_phys(rec->run));
> +	guest_state_exit_irqoff();
> +
> +	return ret;
> +}
> +
>   static int kvm_create_rec(struct kvm_vcpu *vcpu)
>   {
>   	struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Xu Yilun @ 2026-05-28  4:25 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Xiaoyao Li, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan
In-Reply-To: <ahcDvwEES7vqLLvg@thinkstation>

On Wed, May 27, 2026 at 04:35:36PM +0100, Kiryl Shutsemau wrote:
> On Mon, May 25, 2026 at 02:54:40PM +0800, Xiaoyao Li wrote:
> > On 5/22/2026 11:41 AM, Xu Yilun wrote:
> > ...
> > > +static __init int get_tdx_sys_info_ext(struct tdx_sys_info_ext *sysinfo_ext)
> > > +{
> > > +	int ret = 0;
> > > +	u64 val;
> > > +
> > > +	if (!ret && !(ret = read_sys_metadata_field(0x3100000100000000, &val)))
> > > +		sysinfo_ext->memory_pool_required_pages = val;
> > > +	if (!ret && !(ret = read_sys_metadata_field(0x3100000000000001, &val)))
> > > +		sysinfo_ext->ext_required = val;
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >   static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> > >   {
> > >   	int ret = 0;
> > > @@ -116,5 +129,8 @@ static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> > >   	ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
> > >   	ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
> > > +	if (sysinfo->features.tdx_features0 & TDX_FEATURES0_EXT)
> > > +		ret = ret ?: get_tdx_sys_info_ext(&sysinfo->ext);
> > 
> > Is it correct to read "memory_pool_required_pages" and "ext_required" so
> > early in get_tdx_sys_info()? get_tdx_sys_info() is called before
> > config_tdx_module() which calls TDH.SYS.CONFIG.
> > 
> > If I read the TDX module base spec correctly, the amount of memory for
> > extensions and EXT_REQUIRED field depends on the enabled features, which is
> > determined by TDH.SYS.CONFIG/TDH.SYS.UPDATE ?

Yes.

> 
> This is my read too. Looks like we need a separate step after
> config_tdx_module() to readout config-dependatant metadata.


The timing for when metadata becomes valid is now variable, e.g., the
TDX QUOTING metadata is only valid after TDH.QUOTE.INIT [1].

Based on recent discussion, I think we should introduce runtime metadata
reading interfaces for specific metadata sets as needed, rather than
another catch-all step right after config_tdx_module(). See [2] for the
proposed approach for Extensions metadata.

[1]: https://lore.kernel.org/all/20260522034128.3144354-7-yilun.xu@linux.intel.com/
[2]: https://lore.kernel.org/all/ahXAL41ZmIDHmgfu@yilunxu-OptiPlex-7050/


^ permalink raw reply

* Re: [PATCH v14 24/44] KVM: arm64: Handle realm MMIO emulation
From: Gavin Shan @ 2026-05-28  5:03 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-25-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> MMIO emulation for a realm cannot be done directly with the VM's
> registers as they are protected from the host. However, for emulatable
> data aborts, the RMM uses GPRS[0] to provide the read/written value.
> We can transfer this from/to the equivalent VCPU's register entry and
> then depend on the generic MMIO handling code in KVM.
> 
> For a MMIO read, the value is placed in the shared RecExit structure
> during kvm_handle_mmio_return() rather than in the VCPU's register
> entry.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since v7:
>   * New comment for rec_exit_sync_dabt() explaining the call to
>     vcpu_set_reg().
> Changes since v5:
>   * Inject SEA to the guest is an emulatable MMIO access triggers a data
>     abort.
>   * kvm_handle_mmio_return() - disable kvm_incr_pc() for a REC (as the PC
>     isn't under the host's control) and move the REC_ENTER_EMULATED_MMIO
>     flag setting to this location (as that tells the RMM to skip the
>     instruction).
> ---
>   arch/arm64/kvm/inject_fault.c |  4 +++-
>   arch/arm64/kvm/mmio.c         | 16 ++++++++++++----
>   arch/arm64/kvm/rmi-exit.c     | 14 ++++++++++++++
>   3 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> index 89982bd3345f..6492397b73d7 100644
> --- a/arch/arm64/kvm/inject_fault.c
> +++ b/arch/arm64/kvm/inject_fault.c
> @@ -228,7 +228,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt, u32 addr)
>   
>   static void __kvm_inject_sea(struct kvm_vcpu *vcpu, bool iabt, u64 addr)
>   {
> -	if (vcpu_el1_is_32bit(vcpu))
> +	if (unlikely(vcpu_is_rec(vcpu)))
> +		vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_INJECT_SEA;
> +	else if (vcpu_el1_is_32bit(vcpu))
>   		inject_abt32(vcpu, iabt, addr);
>   	else
>   		inject_abt64(vcpu, iabt, addr);
> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> index e2285ed8c91d..6a8cb927fcca 100644
> --- a/arch/arm64/kvm/mmio.c
> +++ b/arch/arm64/kvm/mmio.c
> @@ -6,6 +6,7 @@
>   
>   #include <linux/kvm_host.h>
>   #include <asm/kvm_emulate.h>
> +#include <asm/rmi_smc.h>
>   #include <trace/events/kvm.h>
>   
>   #include "trace.h"
> @@ -138,14 +139,21 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
>   		trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr,
>   			       &data);
>   		data = vcpu_data_host_to_guest(vcpu, data, len);
> -		vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
> +
> +		if (vcpu_is_rec(vcpu))
> +			vcpu->arch.rec.run->enter.gprs[0] = data;
> +		else
> +			vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>   	}
>   
>   	/*
>   	 * The MMIO instruction is emulated and should not be re-executed
>   	 * in the guest.
>   	 */
> -	kvm_incr_pc(vcpu);
> +	if (vcpu_is_rec(vcpu))
> +		vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_EMULATED_MMIO;
> +	else
> +		kvm_incr_pc(vcpu);
>   
>   	return 1;
>   }
> @@ -167,14 +175,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
>   	 * No valid syndrome? Ask userspace for help if it has
>   	 * volunteered to do so, and bail out otherwise.
>   	 *
> -	 * In the protected VM case, there isn't much userspace can do
> +	 * In the protected/realm VM case, there isn't much userspace can do
>   	 * though, so directly deliver an exception to the guest.
>   	 */
>   	if (!kvm_vcpu_dabt_isvalid(vcpu)) {
>   		trace_kvm_mmio_nisv(*vcpu_pc(vcpu), esr,
>   				    kvm_vcpu_get_hfar(vcpu), fault_ipa);
>   
> -		if (vcpu_is_protected(vcpu))
> +		if (vcpu_is_protected(vcpu) || vcpu_is_rec(vcpu))
>   			return kvm_inject_sea_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
>   
>   		if (test_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER,
> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
> index e7c51b6cf6ce..8ec0d179eba2 100644
> --- a/arch/arm64/kvm/rmi-exit.c
> +++ b/arch/arm64/kvm/rmi-exit.c
> @@ -25,6 +25,20 @@ static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
>   
>   static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
>   {
> +	struct realm_rec *rec = &vcpu->arch.rec;
> +
> +	/*
> +	 * In the case of a write, copy over gprs[0] to the target GPR,
> +	 * preparing to handle MMIO write fault. The content to be written has
> +	 * been saved to gprs[0] by the RMM (even if another register was used
> +	 * by the guest). In the case of normal memory access this is redundant
> +	 * (the guest will replay the instruction), but the overhead is
> +	 * minimal.
> +	 */
> +	if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu))
> +		vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu),
> +			     rec->run->exit.gprs[0]);
> +

{ } is needed here.

>   	return kvm_handle_guest_abort(vcpu);
>   }
>   

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-05-28  4:52 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <e9b8083b-6747-4adf-9e48-ffae70dc6508@intel.com>

On Wed, May 27, 2026 at 10:09:41AM -0700, Sohil Mehta wrote:
> On 5/27/2026 3:38 AM, Xu Yilun wrote:
> > 
> > Because for security purpose, these add-on features are always needed,
> > even if not all of them, so Extensions will most likely be enabled.
> > 
> 
> A cover letter is a good place to explain such nuances, alternate
> approaches, and tradeoffs.
> 
> > And even if someone switched them off all and saved the memory, compared
> > to the memory of a typical TDX capable system (lets say 1TB), the saving
> > is still little (0.001%).
> > 
> 
> In this case percentages make it harder to understand. Does it need a
> fixed amount of memory (~50MB) irrespective of the feature or the number
> of features? If so, it would be good to mention that.

No the memory needed varies depends on the feature or the number of
features. But currently I see the total requirement is ~50MB.

Yes I can drop the percentage, just state the amount in MB.

> 
> 
> >> In addition, could you briefly describe the complexity we are trading off?
> > 
> > If we delay the Extensions initialization to the first Extension
> > SEAMCALL, we need to maintain additional TDX state machine for
> > lifecycle, and we need mechanisms to synchronize parallel Extension
> > enabling request from multiple callers.
> 
> This would be good to include in the cover as well.

Yes.

^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Gavin Shan @ 2026-05-28  5:30 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-27-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> The VMM needs to populate the realm with some data before starting (e.g.
> a kernel and initrd). This is measured by the RMM and used as part of
> the attestation later on.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>   * Rename realm_create_protected_data_page() to realm_data_map_init().
> Changes since v12:
>   * The ioctl now updates the structure with the amount populated rather
>     than returning this through the ioctl return code.
>   * Use the new RMM v2.0 range based RMI calls.
>   * Adapt to upstream changes in kvm_gmem_populate().
> Changes since v11:
>   * The multiplex CAP is gone and there's a new ioctl which makes use of
>     the generic kvm_gmem_populate() functionality.
> Changes since v7:
>   * Improve the error codes.
>   * Other minor changes from review.
> Changes since v6:
>   * Handle host potentially having a larger page size than the RMM
>     granule.
>   * Drop historic "par" (protected address range) from
>     populate_par_region() - it doesn't exist within the current
>     architecture.
>   * Add a cond_resched() call in kvm_populate_realm().
> Changes since v5:
>   * Refactor to use PFNs rather than tracking struct page in
>     realm_create_protected_data_page().
>   * Pull changes from a later patch (in the v5 series) for accessing
>     pages from a guest memfd.
>   * Do the populate in chunks to avoid holding locks for too long and
>     triggering RCU stall warnings.
> ---
>   arch/arm64/include/asm/kvm_rmi.h |   4 ++
>   arch/arm64/kvm/Kconfig           |   1 +
>   arch/arm64/kvm/arm.c             |  13 ++++
>   arch/arm64/kvm/rmi.c             | 106 +++++++++++++++++++++++++++++++
>   4 files changed, 124 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index 007249a13dbc..a2b6bc412a22 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -88,6 +88,10 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu);
>   int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>   int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>   
> +struct kvm_arm_rmi_populate;
> +
> +int kvm_arm_rmi_populate(struct kvm *kvm,
> +			 struct kvm_arm_rmi_populate *arg);
>   void kvm_realm_unmap_range(struct kvm *kvm,
>   			   unsigned long ipa,
>   			   unsigned long size,
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 4e16719fda22..d0cd011cf672 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -38,6 +38,7 @@ menuconfig KVM
>   	select GUEST_PERF_EVENTS if PERF_EVENTS
>   	select KVM_GUEST_MEMFD
>   	select KVM_GENERIC_MEMORY_ATTRIBUTES
> +	select HAVE_KVM_ARCH_GMEM_POPULATE
>   	help
>   	  Support hosting virtualized guest machines.
>   
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ed88a203b892..073ba9181da9 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2131,6 +2131,19 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>   			return -EFAULT;
>   		return kvm_vm_ioctl_get_reg_writable_masks(kvm, &range);
>   	}
> +	case KVM_ARM_RMI_POPULATE: {
> +		struct kvm_arm_rmi_populate req;
> +		int ret;
> +
> +		if (!kvm_is_realm(kvm))
> +			return -ENXIO;
> +		if (copy_from_user(&req, argp, sizeof(req)))
> +			return -EFAULT;
> +		ret = kvm_arm_rmi_populate(kvm, &req);
> +		if (copy_to_user(argp, &req, sizeof(req)))
> +			return -EFAULT;
> +		return ret;
> +	}

s/return ret/return 0; The variable 'ret' can be dropped.

>   	default:
>   		return -EINVAL;
>   	}
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index a89873a5eb77..209087bcf399 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm, unsigned long start,
>   		realm_unmap_private_range(kvm, start, end, may_block);
>   }
>   
> +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
> +			       kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
> +			       unsigned long flags)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +	phys_addr_t rd = virt_to_phys(realm->rd);
> +	phys_addr_t dst_phys, src_phys;
> +	int ret;
> +
> +	dst_phys = __pfn_to_phys(dst_pfn);
> +	src_phys = __pfn_to_phys(src_pfn);
> +
> +	if (rmi_delegate_page(dst_phys))
> +		return -ENXIO;
> +
> +	ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
> +	if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> +		/* Create missing RTTs and retry */
> +		int level = RMI_RETURN_INDEX(ret);
> +
> +		KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);

		KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);> +
> +		ret = realm_create_rtt_levels(realm, ipa, level,
> +					      KVM_PGTABLE_LAST_LEVEL, NULL);
> +		if (!ret) {
> +			ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys,
> +						    flags);
> +		}
> +	}
> +
> +	if (ret) {
> +		if (WARN_ON(rmi_undelegate_page(dst_phys))) {
> +			/* Undelegate failed, so we leak the page */
> +			get_page(pfn_to_page(dst_pfn));
> +		}
> +	}
> +

	if (ret && WARN_ON(rmi_undelegate_page(dst_phys)) {
		/* Leak the page that fails to be undelegated */
		get_page(pfn_to_page(dst_pfn));
	}

> +	return ret;
> +}
> +
> +static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +			      struct page *src_page, void *opaque)
> +{
> +	unsigned long data_flags = *(unsigned long *)opaque;
> +	phys_addr_t ipa = gfn_to_gpa(gfn);
> +
> +	if (!src_page)
> +		return -EOPNOTSUPP;
> +
> +	return realm_data_map_init(kvm, ipa, pfn, page_to_pfn(src_page),
> +				   data_flags);
> +}
> +
> +static long populate_region(struct kvm *kvm,
> +			    gfn_t base_gfn,
> +			    unsigned long pages,
> +			    u64 uaddr,
> +			    unsigned long data_flags)
> +{
> +	long ret = 0;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	ret = kvm_gmem_populate(kvm, base_gfn, u64_to_user_ptr(uaddr), pages,
> +				populate_region_cb, &data_flags);
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return ret;
> +}
> +
>   enum ripas_action {
>   	RIPAS_INIT,
>   	RIPAS_SET,
> @@ -574,6 +643,43 @@ static int realm_ensure_created(struct kvm *kvm)
>   	return -ENXIO;
>   }
>   
> +int kvm_arm_rmi_populate(struct kvm *kvm,
> +			 struct kvm_arm_rmi_populate *args)
> +{
> +	unsigned long data_flags = 0;
> +	unsigned long ipa_start = args->base;
> +	unsigned long ipa_end = ipa_start + args->size;
> +	long pages_populated;
> +	int ret;
> +
> +	if (args->reserved ||
> +	    (args->flags & ~KVM_ARM_RMI_POPULATE_FLAGS_MEASURE) ||
> +	    !IS_ALIGNED(ipa_start, PAGE_SIZE) ||
> +	    !IS_ALIGNED(ipa_end, PAGE_SIZE) ||
> +	    !IS_ALIGNED(args->source_uaddr, PAGE_SIZE))
> +		return -EINVAL;
> +

There are more conditions missed here:

	args->size == 0, return 0;
	args->base + args->size < args->base, return -EINVAL;  // wrapped range

> +	ret = realm_ensure_created(kvm);
> +	if (ret)
> +		return ret;
> +
> +	if (args->flags & KVM_ARM_RMI_POPULATE_FLAGS_MEASURE)
> +		data_flags |= RMI_MEASURE_CONTENT;
> +
> +	pages_populated = populate_region(kvm, gpa_to_gfn(ipa_start),
> +					  args->size >> PAGE_SHIFT,
> +					  args->source_uaddr, data_flags);
> +
> +	if (pages_populated < 0)
> +		return pages_populated;

pages_populaged is 'unsigned long', this function returns a 'int' value.

> +
> +	args->size -= pages_populated << PAGE_SHIFT;
> +	args->source_uaddr += pages_populated << PAGE_SHIFT;
> +	args->base += pages_populated << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +
>   static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm *kvm = vcpu->kvm;

Thanks,
Gavin


^ permalink raw reply

* Re: [RFC PATCH v4 01/14] coco: host: arm64: Add host TSM callback and IDE stream allocation support
From: Dan Williams (nvidia) @ 2026-05-28  5:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm), linux-coco, kvmarm, linux-arm-kernel,
	linux-kernel
  Cc: Aneesh Kumar K.V (Arm), Alexey Kardashevskiy, Catalin Marinas,
	Dan Williams, Jason Gunthorpe, Jonathan Cameron, Marc Zyngier,
	Samuel Ortiz, Steven Price, Suzuki K Poulose, Will Deacon,
	Xu Yilun
In-Reply-To: <20260427065121.916615-2-aneesh.kumar@kernel.org>

Aneesh Kumar K.V (Arm) wrote:
> Register the TSM callback when the DA feature is supported by KVM.
> 
> This driver handles IDE stream setup for both the root port and PCIe
> endpoints. Root port IDE stream enablement itself is managed by RMM.
> 
> In addition, the driver registers pci_tsm_ops with the TSM subsystem.

Do you want to call out that this is an infrastructure / scaffolding
patch that only handles the PCI-TSM skeleton. The CCA meat comes later,
in particular IDE key management. Tell a bit more of the story 

Otherwise, mostly looks good.

Minor comments below...

> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  arch/arm64/include/asm/rmi_smc.h         |   2 +
>  drivers/firmware/smccc/rmm.c             |  12 ++
>  drivers/firmware/smccc/rmm.h             |   8 +
>  drivers/firmware/smccc/smccc.c           |   1 +
>  drivers/virt/coco/Kconfig                |   2 +
>  drivers/virt/coco/Makefile               |   1 +
>  drivers/virt/coco/arm-cca-host/Kconfig   |  19 ++
>  drivers/virt/coco/arm-cca-host/Makefile  |   5 +
>  drivers/virt/coco/arm-cca-host/arm-cca.c | 225 +++++++++++++++++++++++
>  drivers/virt/coco/arm-cca-host/rmi-da.h  |  46 +++++
>  10 files changed, 321 insertions(+)
>  create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig
>  create mode 100644 drivers/virt/coco/arm-cca-host/Makefile
>  create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c
>  create mode 100644 drivers/virt/coco/arm-cca-host/rmi-da.h
> 
> diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
> index fa23818e1b4c..109d6cc6ef37 100644
> --- a/arch/arm64/include/asm/rmi_smc.h
> +++ b/arch/arm64/include/asm/rmi_smc.h
[..]
> diff --git a/drivers/firmware/smccc/rmm.c b/drivers/firmware/smccc/rmm.c
> index 2a6187df3285..7444cc3a588c 100644
> --- a/drivers/firmware/smccc/rmm.c
> +++ b/drivers/firmware/smccc/rmm.c
[..]
> diff --git a/drivers/firmware/smccc/rmm.h b/drivers/firmware/smccc/rmm.h
> index a47a650d4f51..37d0d95a099e 100644
> --- a/drivers/firmware/smccc/rmm.h
> +++ b/drivers/firmware/smccc/rmm.h
[..]
> diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
> index fc9b44b7c687..2bf2d59e686d 100644
> --- a/drivers/firmware/smccc/smccc.c
> +++ b/drivers/firmware/smccc/smccc.c
> @@ -97,6 +97,7 @@ static int __init smccc_devices_init(void)
>  		 * the required SMCCC function IDs at a supported revision.
>  		 */
>  		register_rsi_device(pdev);
> +		register_rmi_device(pdev);
>  	}

Would splitting the above three hunks make this series stand on its own
relative to the base CCA series? I assume likely not as soon as we get
to patch2.

Otherwise, just curious what your intended merge strategy is for this,
tsm.git or arm.git, and what help this needs?

[..]
snip code that looks good.

> diff --git a/drivers/virt/coco/arm-cca-host/Makefile b/drivers/virt/coco/arm-cca-host/Makefile
> new file mode 100644
> index 000000000000..c236827f002c
> --- /dev/null
> +++ b/drivers/virt/coco/arm-cca-host/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +obj-$(CONFIG_ARM_CCA_HOST) += arm-cca-host.o
> +
> +arm-cca-host-y	+=  arm-cca.o
> diff --git a/drivers/virt/coco/arm-cca-host/arm-cca.c b/drivers/virt/coco/arm-cca-host/arm-cca.c
> new file mode 100644
> index 000000000000..67f7e80106e8
> --- /dev/null
> +++ b/drivers/virt/coco/arm-cca-host/arm-cca.c
> @@ -0,0 +1,225 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2026 ARM Ltd.
> + */
> +
> +#include <linux/auxiliary_bus.h>
> +#include <linux/pci-tsm.h>
> +#include <linux/pci-ide.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/tsm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/cleanup.h>
> +
> +#include "rmi-da.h"
> +
> +/* Total number of stream id supported at root port level */
> +#define MAX_STREAM_ID	256
> +
> +static struct pci_tsm *cca_tsm_pci_probe(struct tsm_dev *tsm_dev, struct pci_dev *pdev)
> +{
> +	int ret;
> +
> +	if (!is_pci_tsm_pf0(pdev)) {
> +		struct cca_host_fn_dsc *fn_dsc __free(kfree) =
> +			kzalloc(sizeof(*fn_dsc), GFP_KERNEL);

kzalloc_obj(*fn_dsc)

> +
> +		if (!fn_dsc)
> +			return NULL;
> +
> +		ret = pci_tsm_link_constructor(pdev, &fn_dsc->pci, tsm_dev);
> +		if (ret)
> +			return NULL;
> +
> +		return &no_free_ptr(fn_dsc)->pci;
> +	}
> +
> +	if (!pdev->ide_cap)
> +		return NULL;

Bailing early?

Maybe the RMM knows something about this device not needing IDE? I have
a similar question in patch2 around trusted sources for whether a device
is internal or not. 

> +
> +	struct cca_host_pf0_ep_dsc *pf0_ep_dsc __free(kfree) =
> +		kzalloc(sizeof(*pf0_ep_dsc), GFP_KERNEL);
> +	if (!pf0_ep_dsc)
> +		return NULL;
> +
> +	ret = pci_tsm_pf0_constructor(pdev, &pf0_ep_dsc->pci, tsm_dev);
> +	if (ret)
> +		return NULL;
> +
> +	pci_dbg(pdev, "tsm enabled\n");
> +	return &no_free_ptr(pf0_ep_dsc)->pci.base_tsm;
> +}
> +
> +static void cca_tsm_pci_remove(struct pci_tsm *tsm)
> +{
> +	struct pci_dev *pdev = tsm->pdev;
> +
> +	if (is_pci_tsm_pf0(pdev)) {
> +		struct cca_host_pf0_ep_dsc *pf0_ep_dsc = to_cca_pf0_ep_dsc(pdev);
> +
> +		pci_tsm_pf0_destructor(&pf0_ep_dsc->pci);
> +		kfree(pf0_ep_dsc);
> +	} else {
> +		kfree(to_cca_fn_dsc(pdev));
> +	}
> +}
> +
> +/* For now global for simplicity. Protected by pci_tsm_rwsem */
> +static DECLARE_BITMAP(cca_stream_ids, MAX_STREAM_ID);
> +static int alloc_stream_id(struct pci_host_bridge *hb)
> +{
> +	int stream_id;
> +
> +redo_alloc:
> +	stream_id = find_first_zero_bit(cca_stream_ids, MAX_STREAM_ID);
> +	if (stream_id == MAX_STREAM_ID)
> +		return stream_id;
> +
> +	if (ida_exists(&hb->ide_stream_ids_ida, stream_id)) {
> +		/* mark the stream allocated in the global bitmap. */
> +		set_bit(stream_id, cca_stream_ids);
> +		goto redo_alloc;
> +	}
> +	return stream_id;

Is 256 total an RMM limit, and/or does it require globally unique
stream-ids? If not you could do what SEV-TIO does and just set stream-id
== stream-index.

> +}
> +
> +static inline bool cca_pdev_need_sel_ide_streams(struct pci_dev *pdev)
> +{
> +	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
> +}
> +
> +static int cca_tsm_connect(struct pci_dev *pdev)
> +{
> +	struct pci_dev *rp = pcie_find_root_port(pdev);
> +	struct cca_host_pf0_ep_dsc *pf0_ep_dsc;
> +	struct pci_ide *ide;
> +	int ret, stream_id = 0;
> +
> +	/* Only function 0 supports connect in host */
> +	if (WARN_ON(!is_pci_tsm_pf0(pdev)))
> +		return -EIO;
> +
> +	pf0_ep_dsc = to_cca_pf0_ep_dsc(pdev);
> +	if (cca_pdev_need_sel_ide_streams(pdev)) {
> +		/* Allocate stream id */
> +		stream_id = alloc_stream_id(pci_find_host_bridge(pdev->bus));
> +		if (stream_id == MAX_STREAM_ID)
> +			return -EBUSY;
> +		set_bit(stream_id, cca_stream_ids);
> +
> +		ide = pci_ide_stream_alloc(pdev);
> +		if (!ide) {
> +			ret = -ENOMEM;
> +			goto err_stream_alloc;
> +		}
> +
> +		pf0_ep_dsc->sel_stream = ide;
> +		ide->stream_id = stream_id;
> +		ret = pci_ide_stream_register(ide);
> +		if (ret)
> +			goto err_stream;
> +		/*
> +		 * Configure IDE capability for target device
> +		 *
> +		 * Some test devices work only with DEFAULT_STREAM enabled.
> +		 * For simplicity, enable DEFAULT_STREAM for all devices. A
> +		 * future decent solution may be to have a quirk table to
> +		 * specify which devices need DEFAULT_STREAM.
> +		 */
> +		ide->partner[PCI_IDE_EP].default_stream = 1;
> +		pci_ide_stream_setup(pdev, ide);
> +		pci_ide_stream_setup(rp, ide);
> +
> +		ret = tsm_ide_stream_register(ide);
> +		if (ret)
> +			goto err_tsm;
> +
> +		/*
> +		 * Once ide is setup, enable the stream at the endpoint
> +		 * Root port will be done by RMM
> +		 */
> +		pci_ide_stream_enable(pdev, ide);

The end point of these patches follows the spec recommendation of
delaying enable until after key programming.

> +	}
> +	return 0;

Should this be making security claims to userspace without taking any
action for non-endpoint devices that happen to be passed in?

Thinking about a bisection case this should either fail here, print a
message that is removed in the final enabling patch, or do the
__maybe_unused arrangement to land all the CCA bits first and then do
this hookup. Up to you.

^ permalink raw reply

* Re: [PATCH v14 28/44] arm64: RMI: Create the realm descriptor
From: Gavin Shan @ 2026-05-28  5:51 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-29-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> Creating a realm involves first creating a realm descriptor (RD). This
> involves passing the configuration information to the RMM. Do this as
> part of realm_ensure_created() so that the realm is created when it is
> first needed.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>   * The RMM no longer uses AUX granules, so no need to ask it how many it
>     needs.
>   * Adapted to other changes.
> Changes since v12:
>   * Since RMM page size is now equal to the host's page size various
>     calculations are simplified.
>   * Switch to using range based APIs to delegate/undelegate.
>   * VMID handling is now handled entirely by the RMM.
> ---
>   arch/arm64/kvm/rmi.c | 88 +++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index fb96bcaa73ed..cae29fd3353c 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -418,6 +418,77 @@ static void realm_unmap_shared_range(struct kvm *kvm,
>   			     start, end);
>   }
>   
> +static int realm_create_rd(struct kvm *kvm)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +	struct realm_params *params = realm->params;
> +	void *rd = NULL;
> +	phys_addr_t rd_phys, params_phys;
> +	size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
> +	int r;
> +
> +	realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
> +
> +	if (WARN_ON(realm->rd || !realm->params))
> +		return -EEXIST;
> +
> +	rd = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!rd)
> +		return -ENOMEM;
> +
> +	rd_phys = virt_to_phys(rd);
> +	if (rmi_delegate_page(rd_phys)) {
> +		r = -ENXIO;
> +		goto free_rd;
> +	}
> +
> +	if (rmi_delegate_range(kvm->arch.mmu.pgd_phys, pgd_size)) {
> +		r = -ENXIO;
> +		goto out_undelegate_tables;
> +	}
> +
> +	params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
> +	params->rtt_level_start = get_start_level(realm);
> +	params->rtt_num_start = pgd_size / PAGE_SIZE;
> +	params->rtt_base = kvm->arch.mmu.pgd_phys;
> +
> +	if (kvm->arch.arm_pmu) {
> +		params->pmu_num_ctrs = kvm->arch.nr_pmu_counters;
> +		params->flags |= RMI_REALM_PARAM_FLAG_PMU;
> +	}
> +
> +	if (kvm_lpa2_is_enabled())
> +		params->flags |= RMI_REALM_PARAM_FLAG_LPA2;
> +
> +	params_phys = virt_to_phys(params);
> +
> +	if (rmi_realm_create(rd_phys, params_phys)) {
> +		r = -ENXIO;
> +		goto out_undelegate_tables;
> +	}
> +
> +	realm->rd = rd;
> +	kvm_set_realm_state(kvm, REALM_STATE_NEW);
> +	/* The realm is up, free the parameters.  */
> +	free_page((unsigned long)realm->params);
> +	realm->params = NULL;
> +
> +	return 0;
> +
> +out_undelegate_tables:
> +	if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size))) {
> +		/* Leak the pages if they cannot be returned */
> +		kvm->arch.mmu.pgt = NULL;
> +	}

In the latest RMM implementation (topics/rmm-v2.0-poc_2), rmi_delegate_range() works
with the granularity of granule (4KB) and it can fail on any granule. For example,
we have 16x granule as the root RTT and rmi_delegate_range() fails on the first
granule, we're going to undelegate all these 16x granules, which were never delegated
to RMM. It eventually leads to error and memory leakage.

For this, rmi_delegate_range() could be improved to return the number of granules that
have been delegated. The return value can be used by the caller to handle the erroneous
case by passing the correct range to rmi_undelegate_page().

> +	if (WARN_ON(rmi_undelegate_page(rd_phys))) {
> +		/* Leak the page if it isn't returned */
> +		return r;
> +	}
> +free_rd:
> +	free_page((unsigned long)rd);
> +	return r;
> +}
> +
>   static void realm_unmap_private_range(struct kvm *kvm,
>   				      unsigned long start,
>   				      unsigned long end,
> @@ -647,8 +718,21 @@ static int realm_init_ipa_state(struct kvm *kvm,
>   
>   static int realm_ensure_created(struct kvm *kvm)
>   {
> -	/* Provided in later patch */
> -	return -ENXIO;
> +	int ret;
> +
> +	switch (kvm_realm_state(kvm)) {
> +	case REALM_STATE_NONE:
> +		break;
> +	case REALM_STATE_NEW:
> +		return 0;
> +	case REALM_STATE_DEAD:
> +		return -ENXIO;
> +	default:
> +		return -EBUSY;
> +	}
> +
> +	ret = realm_create_rd(kvm);
> +	return ret;
>   }
>   
>   static int set_ripas_of_protected_regions(struct kvm *kvm)

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v14 32/44] KVM: arm64: Handle Realm PSCI requests
From: Gavin Shan @ 2026-05-28  6:55 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-33-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> The RMM needs to be informed of the target REC when a PSCI call is made
> with an MPIDR argument.
> 
> This requirement will be removed in a future release of the RMM 2.0
> specification but is still required for v2.0-bet1.
> 
> Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Chanegs since v13:
>   * The ioctl KVM_ARM_VCPU_RMI_PSCI_COMPLETE has gone. The RMI call is
>     made automatically just before entering the REC again.
> Changes since v12:
>   * Chance return code for non-realms to -ENXIO to better represent that
>     the ioctl is invalid for non-realms (checkpatch is insistent that
>     "ENOSYS means 'invalid syscall nr' and nothing else").
> Changes since v11:
>   * RMM->RMI renaming.
> Changes since v6:
>   * Use vcpu_is_rec() rather than kvm_is_realm(vcpu->kvm).
>   * Minor renaming/formatting fixes.
> ---
>   arch/arm64/include/asm/kvm_rmi.h |  3 ++
>   arch/arm64/kvm/psci.c            | 15 ++++++++-
>   arch/arm64/kvm/rmi.c             | 58 ++++++++++++++++++++++++++++++++
>   3 files changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index b65cfec10dee..eacf82a7467d 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -109,6 +109,9 @@ int realm_map_non_secure(struct realm *realm,
>   			 unsigned long size,
>   			 enum kvm_pgtable_prot prot,
>   			 struct kvm_mmu_memory_cache *memcache);
> +int realm_psci_complete(struct kvm_vcpu *source,
> +			struct kvm_vcpu *target,
> +			unsigned long status);
>   
>   static inline bool kvm_realm_is_private_address(struct realm *realm,
>   						unsigned long addr)
> diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> index 3b5dbe9a0a0e..a2cd55dc7b5b 100644
> --- a/arch/arm64/kvm/psci.c
> +++ b/arch/arm64/kvm/psci.c
> @@ -103,7 +103,6 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
>   
>   	reset_state->reset = true;
>   	kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> -

This change isn't supposed to be part of this patch :-)

>   	/*
>   	 * Make sure the reset request is observed if the RUNNABLE mp_state is
>   	 * observed.
> @@ -142,6 +141,20 @@ static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu)
>   	/* Ignore other bits of target affinity */
>   	target_affinity &= target_affinity_mask;
>   
> +	if (vcpu_is_rec(vcpu)) {
> +		struct kvm_vcpu *target_vcpu;
> +
> +		/* RMM supports only zero affinity level */
> +		if (lowest_affinity_level != 0)
> +			return PSCI_RET_INVALID_PARAMS;
> +
> +		target_vcpu = kvm_mpidr_to_vcpu(kvm, target_affinity);
> +		if (!target_vcpu)
> +			return PSCI_RET_INVALID_PARAMS;
> +
> +		return PSCI_RET_SUCCESS;
> +	}
> +
>   	/*
>   	 * If one or more VCPU matching target affinity are running
>   	 * then ON else OFF
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index 761b38a4071c..2b03e962ee41 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -3,6 +3,7 @@
>    * Copyright (C) 2023-2025 ARM Ltd.
>    */
>   
> +#include <uapi/linux/psci.h>
>   #include <linux/kvm_host.h>
>   
>   #include <asm/kvm_emulate.h>
> @@ -127,6 +128,25 @@ static void free_rtt(phys_addr_t phys)
>   	kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>   }
>   
> +int realm_psci_complete(struct kvm_vcpu *source, struct kvm_vcpu *target,
> +			unsigned long status)
> +{
> +	int ret;
> +
> +	/*
> +	 * XXX: RMM-v2.0 doesn't require the target REC address for completing
> +	 * PSCI requests. Temporary hack until RMM implementation catches up
> +	 * to the full spec.
> +	 */
> +	ret = rmi_psci_complete(virt_to_phys(source->arch.rec.rec_page),
> +				virt_to_phys(target->arch.rec.rec_page),
> +				status);
> +	if (ret)
> +		return -EINVAL;

		return -ENXIO;

> +
> +	return 0;
> +}
> +
>   static int realm_rtt_create(struct realm *realm,
>   			    unsigned long addr,
>   			    int level,
> @@ -1004,6 +1024,41 @@ static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
>   	rec->run->exit.ripas_base = base;
>   }
>   
> +static void kvm_rec_complete_psci(struct kvm_vcpu *vcpu)
> +{
> +	struct rec_run *run = vcpu->arch.rec.run;
> +	unsigned long status = PSCI_RET_DENIED;
> +	unsigned long ret = vcpu_get_reg(vcpu, 0);
> +	struct kvm_vcpu *target;
> +
> +	switch (run->exit.gprs[0]) {
> +	/*
> +	 * XXX: RMM-v2.0 doesn't cause RMI_EXIT_PSCI for AFFINITY_INFO
> +	 * Temporary hack until tf-RMM gets the REC to MPIDR mapping via
> +	 * RD Auxiliary granules.
> +	 * For now always report SUCCESS
> +	 */
> +	case PSCI_0_2_FN64_AFFINITY_INFO:
> +		status = PSCI_RET_SUCCESS;
> +		break;
> +	case PSCI_0_2_FN64_CPU_ON: {
> +		if (ret != PSCI_RET_SUCCESS &&
> +		    ret != PSCI_RET_ALREADY_ON)
> +			status = PSCI_RET_DENIED;
> +		else
> +			status = PSCI_RET_SUCCESS;
> +		break;
> +	}
> +	default:
> +		return;
> +	}
> +
> +	target = kvm_mpidr_to_vcpu(vcpu->kvm, run->exit.gprs[1]);
> +	/* RMM makes sure that we don't get RMI_EXIT_PSCI for invalid mpidrs */
> +	if (target)
> +		realm_psci_complete(vcpu, target, status);
> +}
> +
>   /*
>    * kvm_rec_pre_enter - Complete operations before entering a REC
>    *
> @@ -1028,6 +1083,9 @@ int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
>   		for (int i = 0; i < REC_RUN_GPRS; i++)
>   			rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>   		break;
> +	case RMI_EXIT_PSCI:
> +		kvm_rec_complete_psci(vcpu);
> +		break;
>   	case RMI_EXIT_RIPAS_CHANGE:
>   		kvm_complete_ripas_change(vcpu);
>   		break;

Thanks,
Gavin


^ permalink raw reply

* Re: SVSM Development Call May 27th, 2026
From: Jörg Rödel @ 2026-05-28  6:52 UTC (permalink / raw)
  To: Stefano Garzarella; +Cc: coconut-svsm, linux-coco
In-Reply-To: <CAGxU2F5hP=7pA1rKRvq5sgr0t2y1YoUzYCmH8hzaCS58U4+Y3A@mail.gmail.com>

Thanks Stefano for running this weeks meeting! Also thanks to Nicola and Tanish
for introducing themselves and their GSoC projects. Welcome to the COCONUT
community :)

The minutes for the meeting are now available for review:

	https://github.com/coconut-svsm/governance/pull/109

-Joerg

^ permalink raw reply

* Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.
From: Marc Zyngier @ 2026-05-28  7:10 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-15-steven.price@arm.com>

On Wed, 13 May 2026 14:17:22 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> Introduce the skeleton functions for creating and destroying a realm.
> The IPA size requested is checked against what the RMM supports.
> 
> The actual work of constructing the realm will be added in future
> patches.

Again, $SUBJECT doesn't reflect that this is purely a KVM patch.

> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>  * Rebased and updated to RMM-v2.0-bet1.
>  * Auxiliary granules have been removed in RMM-v2.0-bet1
> Changes since v12:
>  * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now configured to
>    be the same as the host's page size.
>  * Rework delegate/undelegate functions to use the new RMI range based
>    operations.
> Changes since v11:
>  * Major rework to drop the realm configuration and make the
>    construction of realms implicit rather than driven by the VMM
>    directly.
>  * The code to create RDs, handle VMIDs etc is moved to later patches.
> Changes since v10:
>  * Rename from RME to RMI.
>  * Move the stage2 cleanup to a later patch.
> Changes since v9:
>  * Avoid walking the stage 2 page tables when destroying the realm -
>    the real ones are not accessible to the non-secure world, and the RMM
>    may leave junk in the physical pages when returning them.
>  * Fix an error path in realm_create_rd() to actually return an error value.
> Changes since v8:
>  * Fix free_delegated_granule() to not call kvm_account_pgtable_pages();
>    a separate wrapper will be introduced in a later patch to deal with
>    RTTs.
>  * Minor code cleanups following review.
> Changes since v7:
>  * Minor code cleanup following Gavin's review.
> Changes since v6:
>  * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>    host page size to be larger than 4k while still communicating with an
>    RMM which uses 4k granules.
> Changes since v5:
>  * Introduce free_delegated_granule() to replace many
>    undelegate/free_page() instances and centralise the comment on
>    leaking when the undelegate fails.
>  * Several other minor improvements suggested by reviews - thanks for
>    the feedback!
> Changes since v2:
>  * Improved commit description.
>  * Improved return failures for rmi_check_version().
>  * Clear contents of PGD after it has been undelegated in case the RMM
>    left stale data.
>  * Minor changes to reflect changes in previous patches.
> ---
>  arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
>  arch/arm64/include/asm/kvm_rmi.h     | 51 +++++++++++++++++++++++++
>  arch/arm64/kvm/arm.c                 | 12 ++++++
>  arch/arm64/kvm/mmu.c                 | 12 +++++-
>  arch/arm64/kvm/rmi.c                 | 57 ++++++++++++++++++++++++++++
>  5 files changed, 159 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 5bf3d7e1d92c..82fd777bd9bb 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu *vcpu)
>  			vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
>  	}
>  }
> +
> +static inline bool kvm_is_realm(struct kvm *kvm)
> +{
> +	if (static_branch_unlikely(&kvm_rmi_is_available))
> +		return kvm->arch.is_realm;
> +	return false;
> +}
> +
> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> +{
> +	return READ_ONCE(kvm->arch.realm.state);
> +}
> +
> +static inline void kvm_set_realm_state(struct kvm *kvm,
> +				       enum realm_state new_state)
> +{
> +	WRITE_ONCE(kvm->arch.realm.state, new_state);
> +}
> +
> +static inline bool kvm_realm_is_created(struct kvm *kvm)
> +{
> +	return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE;
> +}
> +
> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
> +{
> +	return false;
> +}
> +
>  #endif /* __ARM64_KVM_EMULATE_H__ */
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index 4936007947fd..9de34983ee52 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -6,12 +6,63 @@
>  #ifndef __ASM_KVM_RMI_H
>  #define __ASM_KVM_RMI_H
>  
> +#include <asm/rmi_smc.h>
> +
> +/**
> + * enum realm_state - State of a Realm
> + */
> +enum realm_state {
> +	/**
> +	 * @REALM_STATE_NONE:
> +	 *      Realm has not yet been created. rmi_realm_create() has not
> +	 *      yet been called.
> +	 */
> +	REALM_STATE_NONE,
> +	/**
> +	 * @REALM_STATE_NEW:
> +	 *      Realm is under construction, rmi_realm_create() has been
> +	 *      called, but it is not yet activated. Pages may be populated.
> +	 */
> +	REALM_STATE_NEW,
> +	/**
> +	 * @REALM_STATE_ACTIVE:
> +	 *      Realm has been created and is eligible for execution with
> +	 *      rmi_rec_enter(). Pages may no longer be populated with
> +	 *      rmi_data_create().
> +	 */
> +	REALM_STATE_ACTIVE,
> +	/**
> +	 * @REALM_STATE_DYING:
> +	 *      Realm is in the process of being destroyed or has already been
> +	 *      destroyed.
> +	 */
> +	REALM_STATE_DYING,
> +	/**
> +	 * @REALM_STATE_DEAD:
> +	 *      Realm has been destroyed.
> +	 */
> +	REALM_STATE_DEAD
> +};

What is the ABI status of this state? Is it purely internal to KVM? Or
is it something that the RMM actively tracks?

> +
>  /**
>   * struct realm - Additional per VM data for a Realm
> + *
> + * @state: The lifetime state machine for the realm
> + * @rd: Kernel mapping of the Realm Descriptor (RD)
> + * @params: Parameters for the RMI_REALM_CREATE command
> + * @ia_bits: Number of valid Input Address bits in the IPA
>   */
>  struct realm {
> +	enum realm_state state;
> +	void *rd;

Why is this void? Doesn't it have a proper type?

> +	struct realm_params *params;
> +	unsigned int ia_bits;

Consider reordering this structure to avoid holes.

>  };
>  
>  void kvm_init_rmi(void);
> +u32 kvm_realm_ipa_limit(void);

The use of 'realm' is confusing. This is not a per-realm property, but
something global. I'd rather reserve the term 'realm' for CCA VMs (cue
the two prototypes below).

> +
> +int kvm_init_realm(struct kvm *kvm);
> +void kvm_destroy_realm(struct kvm *kvm);
>  
>  #endif /* __ASM_KVM_RMI_H */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 247e03b33035..18251e561524 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>  
> +	/* Initialise the realm bits after the generic bits are enabled */
> +	if (kvm_is_realm(kvm)) {
> +		ret = kvm_init_realm(kvm);
> +		if (ret)
> +			goto err_uninit_mmu;
> +	}
> +
>  	return 0;
>  
>  err_uninit_mmu:
> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_unshare_hyp(kvm, kvm + 1);
>  
>  	kvm_arm_teardown_hypercalls(kvm);
> +	if (kvm_is_realm(kvm))
> +		kvm_destroy_realm(kvm);
>  }
>  
>  static bool kvm_has_full_ptr_auth(void)
> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		else
>  			r = kvm_supports_cacheable_pfnmap();
>  		break;
> +	case KVM_CAP_ARM_RMI:
> +		r = static_key_enabled(&kvm_rmi_is_available);
> +		break;
>  
>  	default:
>  		r = 0;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..ba8286472286 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  
>  static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
>  {
> +	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>  	u32 kvm_ipa_limit = get_kvm_ipa_limit();
>  	u64 mmfr0, mmfr1;
>  	u32 phys_shift;
>  
> +	if (kvm_is_realm(kvm))
> +		kvm_ipa_limit = kvm_realm_ipa_limit();
> +
>  	phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>  	if (is_protected_kvm_enabled()) {
>  		phys_shift = kvm_ipa_limit;
> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>  		return -EINVAL;
>  	}
>  
> +	mmu->arch = &kvm->arch;
> +
>  	err = kvm_init_ipa_range(mmu, type);
>  	if (err)
>  		return err;
> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>  	if (!pgt)
>  		return -ENOMEM;
>  
> -	mmu->arch = &kvm->arch;

Why moving this init?

>  	err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu, &kvm_s2_mm_ops);
>  	if (err)
>  		goto out_free_pgtable;
> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>  	write_unlock(&kvm->mmu_lock);
>  
>  	if (pgt) {
> -		kvm_stage2_destroy(pgt);
> +		if (!kvm_is_realm(kvm))
> +			kvm_stage2_destroy(pgt);
> +		else
> +			kvm_pgtable_stage2_destroy_pgd(pgt);

Why can't you make kvm_stage2_destroy() do the right thing? Surely the
PTs have to be reclaimed one way or another.

>  		kfree(pgt);
>  	}
>  }
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index 6e28b669ded2..f51ec667445e 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -5,6 +5,8 @@
>  
>  #include <linux/kvm_host.h>
>  
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
>  #include <asm/kvm_pgtable.h>
>  #include <asm/rmi_cmds.h>
>  #include <asm/virt.h>
> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
>  	return !!u64_get_bits(rmm_feat_reg0, feature);
>  }
>  
> +u32 kvm_realm_ipa_limit(void)
> +{
> +	return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
> +}
> +
> +void kvm_destroy_realm(struct kvm *kvm)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +	size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
> +
> +	if (realm->params) {
> +		free_page((unsigned long)realm->params);
> +		realm->params = NULL;
> +	}
> +
> +	if (!kvm_realm_is_created(kvm))
> +		return;
> +
> +	kvm_set_realm_state(kvm, REALM_STATE_DYING);
> +
> +	write_lock(&kvm->mmu_lock);
> +	kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
> +			       BIT(realm->ia_bits - 1), true);
> +	write_unlock(&kvm->mmu_lock);
> +
> +	if (realm->rd) {
> +		phys_addr_t rd_phys = virt_to_phys(realm->rd);
> +
> +		if (WARN_ON(rmi_realm_terminate(rd_phys)))
> +			return;
> +
> +		if (WARN_ON(rmi_realm_destroy(rd_phys)))
> +			return;
> +		free_delegated_page(rd_phys);
> +		realm->rd = NULL;
> +	}
> +
> +	if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size)))
> +		return;
> +
> +	kvm_set_realm_state(kvm, REALM_STATE_DEAD);
> +
> +	/* Now that the Realm is destroyed, free the entry level RTTs */
> +	kvm_free_stage2_pgd(&kvm->arch.mmu);
> +}

This really needs documentation: what happens at each stage? What
memory is reclaimed when?

But even more importantly, why is this built in a completely parallel
way, potentially deviating from the existing KVM S2 management?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Move Rick Edgecombe to TDX maintainer
From: Kiryl Shutsemau @ 2026-05-28  9:45 UTC (permalink / raw)
  To: Rick Edgecombe; +Cc: dave.hansen, x86, linux-coco, kvm, pbonzini, seanjc
In-Reply-To: <20260527221342.415814-1-rick.p.edgecombe@intel.com>

On Wed, May 27, 2026 at 03:13:42PM -0700, Rick Edgecombe wrote:
> Per some offline discussion with Kiryl, he could use some help on the TDX
> host side. I have worked on the TDX host side for the past few years
> including wrangling the initial KVM support, and can help with this.
> 
> I am already listed as TDX reviewer. Move it to maintainer.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Kiryl Shutsemau <kas@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Kiryl Shutsemau @ 2026-05-28 10:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Sean Christopherson, Borys Tsyrulnikov, linux-kernel,
	linux-coco, kvm, stable
In-Reply-To: <5ed6121c-314e-4cf0-9a11-b0661c87c694@intel.com>

On Wed, May 27, 2026 at 10:45:28AM -0700, Dave Hansen wrote:
> On 5/27/26 05:05, Kiryl Shutsemau (Meta) wrote:
> ...
> > -	/* Update part of the register affected by the emulated instruction */
> > -	regs->ax &= ~mask;
> > +	/*
> > +	 * IN writes the result into a sub-register of RAX. Only the
> > +	 * 32-bit form zero-extends; the smaller forms leave the upper
> > +	 * bits untouched:
> > +	 *
> > +	 *   insn  dest  size  bits written     bits preserved
> > +	 *   inb   AL    1     RAX[ 7: 0]       RAX[63: 8]
> > +	 *   inw   AX    2     RAX[15: 0]       RAX[63:16]
> > +	 *   inl   EAX   4     RAX[63: 0]       (none, zero-extended)
> > +	 *
> > +	 * 'mask' only covers the low 'size' bytes, which is exactly the
> > +	 * range affected for size 1 and 2. For size 4 the write also
> > +	 * clears RAX[63:32], so widen the clear-mask.
> > +	 */
> > +	if (size == 4)
> > +		regs->ax = 0;
> > +	else
> > +		regs->ax &= ~mask;
> > +
> 
> Is there any way we could do this with fewer comments and more code?
> 
> I mean, there's only three cases. Why have;
> 
> 	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
> 
> When there are only 3 possible cases:
> 
> 	1 => 0xf
> 	2 => 0xff
> 	4 => 0xffff
> 
> and one of those cases needs a special case on top of it.
> 
> Maybe something like this?
> 
> 	/* Clear out part of RAX so part of args.r11 can be OR'd in: */
> 	switch (size) {
> 	case 1:
> 		/* inb consumes lower 8 bits of r11: */
> 		regs->ax &= ~GENMASK_ULL(7, 0);
> 		args.r11 &=  GENMASK_ULL(7, 0);
> 		break;
> 	case 2:
> 		/* inw consumes lower 16 bits of r11: */
> 		regs->ax &= ~GENMASK_ULL(15, 0);
> 		args.r11 &=  GENMASK_ULL(15, 0);
> 		break;
> 	case 4:
> 		/* inl is weird and zeros the whole register: */
> 		regs->ax &= ~GENMASK_ULL(63, 0);
> 		/* But only consumes 32-bits from r11: */
> 		args.r11 &=  GENMASK_ULL(31, 0);
> 		break;
> 	default:
> 		/* Probable TDX module bug. Illegal in[bwl] size: */
> 		WARN_ON_ONCE(1);
> 		success = 0;
> 	}
> 
> 	if (success)
> 		regs->ax |= args.r11;
> 
> It might need a temporary variable for args.r11, but you get the point.
> That's basically the data from the comment but written as code.

I hate how verbose it is. All these GENMASK_ULL() make it hard to
follow.

What about the patch below. Inspired by kvm's assign_register().

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 65119362f9a2..460b9fbabf14 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -693,8 +693,8 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 		.r13 = PORT_READ,
 		.r14 = port,
 	};
-	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
 	bool success;
+	u32 val;
 
 	/*
 	 * Emulate the I/O read via hypercall. More info about ABI can be found
@@ -703,10 +703,33 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 	 */
 	success = !__tdx_hypercall(&args);
 
-	/* Update part of the register affected by the emulated instruction */
-	regs->ax &= ~mask;
 	if (success)
-		regs->ax |= args.r11 & mask;
+		val = args.r11;
+	else
+		val = 0;
+
+	/*
+	 * IN writes the result into a sub-register of RAX.
+	 *
+	 * Only the 32-bit form zero-extends; the smaller forms leave
+	 * the upper bits untouched.
+	 */
+	switch (size) {
+	case 1:
+		*(u8 *)&regs->ax = (u8)val;
+		break;
+	case 2:
+		*(u16 *)&regs->ax = (u16)val;
+		break;
+	case 4:
+		/* zero-extended */
+		regs->ax = val;
+		break;
+	default:
+		/* Probable TDX module bug. Illegal in[bwl] size. */
+		WARN_ON_ONCE(1);
+		break;
+	}
 
 	return success;
 }
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related

* [PATCH v2 0/5] KVM/x86: Drop "1" as MSR emulation return value
From: Juergen Gross @ 2026-05-28 11:13 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-coco
  Cc: Juergen Gross, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe

Get rid of the literal "1" used as general error return value in KVM
MSR emulation. It can easily be replaced by negative errno values
instead.

This is meant to avoid confusion with the literal "1" used as return
value for "return to guest".

Changes in V2:
- series carved out from initial "KVM: Avoid literal numbers as return
  values" series
- don't use new KVM_MSR_RET_* defines, but 0 and -errno

Juergen Gross (5):
  KVM/x86: Change comment before KVM_MSR_RET_* defines
  KVM/x86: Return -errno instead of "1" for APIC related MSR emulation
  KVM/x86: Return -errno instead of "1" for Hyper-V related MSR
    emulation
  KVM/x86: Return -errno instead of "1" for VMX related MSR emulation
  KVM/x86: Return -errno instead of "1" for SVM related MSR emulation

 arch/x86/kvm/hyperv.c        | 72 +++++++++++++--------------
 arch/x86/kvm/lapic.c         | 39 +++++++--------
 arch/x86/kvm/svm/pmu.c       |  4 +-
 arch/x86/kvm/svm/svm.c       | 36 +++++++-------
 arch/x86/kvm/vmx/nested.c    |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c | 16 +++---
 arch/x86/kvm/vmx/tdx.c       | 10 ++--
 arch/x86/kvm/vmx/vmx.c       | 96 ++++++++++++++++++------------------
 arch/x86/kvm/x86.h           |  4 +-
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.54.0


^ permalink raw reply

* [PATCH v2 4/5] KVM/x86: Return -errno instead of "1" for VMX related MSR emulation
From: Juergen Gross @ 2026-05-28 11:13 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-coco
  Cc: Juergen Gross, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe
In-Reply-To: <20260528111357.264809-1-jgross@suse.com>

Instead of a literal "1" for signalling an error, use a negative errno
value in the emulation code of VMX related MSR registers.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- use -errno instead of KVM_MSR_RET_ERR
---
 arch/x86/kvm/vmx/nested.c    |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c | 16 +++---
 arch/x86/kvm/vmx/tdx.c       | 10 ++--
 arch/x86/kvm/vmx/vmx.c       | 96 ++++++++++++++++++------------------
 4 files changed, 62 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 3fe88f29be7a..2236f15ffab2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1611,7 +1611,7 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 		*pdata = msrs->vmfunc_controls;
 		break;
 	default:
-		return 1;
+		return -EINVAL;
 	}
 
 	return 0;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 27eb76e6b6a0..4f7e354c4b50 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -362,7 +362,7 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		} else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, true)) {
 			break;
 		}
-		return 1;
+		return -EINVAL;
 	}
 
 	return 0;
@@ -379,14 +379,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
 		if (data & pmu->fixed_ctr_ctrl_rsvd)
-			return 1;
+			return -EINVAL;
 
 		if (pmu->fixed_ctr_ctrl != data)
 			reprogram_fixed_counters(pmu, data);
 		break;
 	case MSR_IA32_PEBS_ENABLE:
 		if (data & pmu->pebs_enable_rsvd)
-			return 1;
+			return -EINVAL;
 
 		if (pmu->pebs_enable != data) {
 			diff = pmu->pebs_enable ^ data;
@@ -396,13 +396,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_DS_AREA:
 		if (is_noncanonical_msr_address(data, vcpu))
-			return 1;
+			return -EINVAL;
 
 		pmu->ds_area = data;
 		break;
 	case MSR_PEBS_DATA_CFG:
 		if (data & pmu->pebs_data_cfg_rsvd)
-			return 1;
+			return -EINVAL;
 
 		pmu->pebs_data_cfg = data;
 		break;
@@ -411,7 +411,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		    (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
 			if ((msr & MSR_PMC_FULL_WIDTH_BIT) &&
 			    (data & ~pmu->counter_bitmask[KVM_PMC_GP]))
-				return 1;
+				return -EINVAL;
 
 			if (!msr_info->host_initiated &&
 			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
@@ -427,7 +427,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			    (pmu->raw_event_mask & HSW_IN_TX_CHECKPOINTED))
 				reserved_bits ^= HSW_IN_TX_CHECKPOINTED;
 			if (data & reserved_bits)
-				return 1;
+				return -EINVAL;
 
 			if (data != pmc->eventsel) {
 				pmc->eventsel = data;
@@ -439,7 +439,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			break;
 		}
 		/* Not a known PMU MSR. */
-		return 1;
+		return -EINVAL;
 	}
 
 	return 0;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 04ce321ebdf3..acc3242af4f4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2158,12 +2158,12 @@ int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		return 0;
 	case MSR_IA32_MCG_EXT_CTL:
 		if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
-			return 1;
+			return -EINVAL;
 		msr->data = vcpu->arch.mcg_ext_ctl;
 		return 0;
 	default:
 		if (!tdx_has_emulated_msr(msr->index))
-			return 1;
+			return -EACCES;
 
 		return kvm_get_msr_common(vcpu, msr);
 	}
@@ -2175,15 +2175,15 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_IA32_MCG_EXT_CTL:
 		if ((!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P)) ||
 		    (msr->data & ~MCG_EXT_CTL_LMCE_EN))
-			return 1;
+			return -EINVAL;
 		vcpu->arch.mcg_ext_ctl = msr->data;
 		return 0;
 	default:
 		if (tdx_is_read_only_msr(msr->index))
-			return 1;
+			return -EACCES;
 
 		if (!tdx_has_emulated_msr(msr->index))
-			return 1;
+			return -EACCES;
 
 		return kvm_set_msr_common(vcpu, msr);
 	}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b9103de01428..2eee599fca30 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2076,7 +2076,7 @@ int vmx_get_feature_msr(u32 msr, u64 *data)
 	switch (msr) {
 	case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
 		if (!nested)
-			return 1;
+			return -EINVAL;
 		return vmx_get_vmx_msr(&vmcs_config.nested, msr, data);
 	default:
 		return KVM_MSR_RET_UNSUPPORTED;
@@ -2111,18 +2111,18 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSX_CTRL:
 		if (!msr_info->host_initiated &&
 		    !(vcpu->arch.arch_capabilities & ARCH_CAP_TSX_CTRL_MSR))
-			return 1;
+			return -EINVAL;
 		goto find_uret_msr;
 	case MSR_IA32_UMWAIT_CONTROL:
 		if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
-			return 1;
+			return -EINVAL;
 
 		msr_info->data = vmx->msr_ia32_umwait_control;
 		break;
 	case MSR_IA32_SPEC_CTRL:
 		if (!msr_info->host_initiated &&
 		    !guest_has_spec_ctrl_msr(vcpu))
-			return 1;
+			return -EINVAL;
 
 		msr_info->data = to_vmx(vcpu)->spec_ctrl;
 		break;
@@ -2139,14 +2139,14 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!kvm_mpx_supported() ||
 		    (!msr_info->host_initiated &&
 		     !guest_cpu_cap_has(vcpu, X86_FEATURE_MPX)))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmcs_read64(GUEST_BNDCFGS);
 		break;
 	case MSR_IA32_MCG_EXT_CTL:
 		if (!msr_info->host_initiated &&
 		    !(vmx->msr_ia32_feature_control &
 		      FEAT_CTL_LMCE_ENABLED))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vcpu->arch.mcg_ext_ctl;
 		break;
 	case MSR_IA32_FEAT_CTL:
@@ -2155,16 +2155,16 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_SGXLEPUBKEYHASH0 ... MSR_IA32_SGXLEPUBKEYHASH3:
 		if (!msr_info->host_initiated &&
 		    !guest_cpu_cap_has(vcpu, X86_FEATURE_SGX_LC))
-			return 1;
+			return -EINVAL;
 		msr_info->data = to_vmx(vcpu)->msr_ia32_sgxlepubkeyhash
 			[msr_info->index - MSR_IA32_SGXLEPUBKEYHASH0];
 		break;
 	case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
 		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_VMX))
-			return 1;
+			return -EINVAL;
 		if (vmx_get_vmx_msr(&vmx->nested.msrs, msr_info->index,
 				    &msr_info->data))
-			return 1;
+			return -EINVAL;
 #ifdef CONFIG_KVM_HYPERV
 		/*
 		 * Enlightened VMCS v1 doesn't have certain VMCS fields but
@@ -2180,19 +2180,19 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_RTIT_CTL:
 		if (!vmx_pt_mode_is_host_guest())
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.ctl;
 		break;
 	case MSR_IA32_RTIT_STATUS:
 		if (!vmx_pt_mode_is_host_guest())
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.status;
 		break;
 	case MSR_IA32_RTIT_CR3_MATCH:
 		if (!vmx_pt_mode_is_host_guest() ||
 			!intel_pt_validate_cap(vmx->pt_desc.caps,
 						PT_CAP_cr3_filtering))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.cr3_match;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_BASE:
@@ -2201,7 +2201,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 					PT_CAP_topa_output) &&
 			 !intel_pt_validate_cap(vmx->pt_desc.caps,
 					PT_CAP_single_range_output)))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.output_base;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_MASK:
@@ -2210,14 +2210,14 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 					PT_CAP_topa_output) &&
 			 !intel_pt_validate_cap(vmx->pt_desc.caps,
 					PT_CAP_single_range_output)))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.output_mask;
 		break;
 	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
 		index = msr_info->index - MSR_IA32_RTIT_ADDR0_A;
 		if (!vmx_pt_mode_is_host_guest() ||
 		    (index >= 2 * vmx->pt_desc.num_address_ranges))
-			return 1;
+			return -EINVAL;
 		if (index % 2)
 			msr_info->data = vmx->pt_desc.guest.addr_b[index / 2];
 		else
@@ -2359,7 +2359,7 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_DEBUGCTLMSR:
 		if (!vmx_is_valid_debugctl(vcpu, data, msr_info->host_initiated))
-			return 1;
+			return -EINVAL;
 
 		data &= vmx_get_supported_debugctl(vcpu, msr_info->host_initiated);
 
@@ -2377,10 +2377,10 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!kvm_mpx_supported() ||
 		    (!msr_info->host_initiated &&
 		     !guest_cpu_cap_has(vcpu, X86_FEATURE_MPX)))
-			return 1;
+			return -EINVAL;
 		if (is_noncanonical_msr_address(data & PAGE_MASK, vcpu) ||
 		    (data & MSR_IA32_BNDCFGS_RSVD))
-			return 1;
+			return -EINVAL;
 
 		if (is_guest_mode(vcpu) &&
 		    ((vmx->nested.msrs.entry_ctls_high & VM_ENTRY_LOAD_BNDCFGS) ||
@@ -2391,21 +2391,21 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_UMWAIT_CONTROL:
 		if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
-			return 1;
+			return -EINVAL;
 
 		/* The reserved bit 1 and non-32 bit [63:32] should be zero */
 		if (data & (BIT_ULL(1) | GENMASK_ULL(63, 32)))
-			return 1;
+			return -EINVAL;
 
 		vmx->msr_ia32_umwait_control = data;
 		break;
 	case MSR_IA32_SPEC_CTRL:
 		if (!msr_info->host_initiated &&
 		    !guest_has_spec_ctrl_msr(vcpu))
-			return 1;
+			return -EINVAL;
 
 		if (kvm_spec_ctrl_test_value(data))
-			return 1;
+			return -EINVAL;
 
 		vmx->spec_ctrl = data;
 		if (!data)
@@ -2430,9 +2430,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSX_CTRL:
 		if (!msr_info->host_initiated &&
 		    !(vcpu->arch.arch_capabilities & ARCH_CAP_TSX_CTRL_MSR))
-			return 1;
+			return -EINVAL;
 		if (data & ~(TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR))
-			return 1;
+			return -EINVAL;
 		goto find_uret_msr;
 	case MSR_IA32_CR_PAT:
 		ret = kvm_set_msr_common(vcpu, msr_info);
@@ -2451,12 +2451,12 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		     !(to_vmx(vcpu)->msr_ia32_feature_control &
 		       FEAT_CTL_LMCE_ENABLED)) ||
 		    (data & ~MCG_EXT_CTL_LMCE_EN))
-			return 1;
+			return -EINVAL;
 		vcpu->arch.mcg_ext_ctl = data;
 		break;
 	case MSR_IA32_FEAT_CTL:
 		if (!is_vmx_feature_control_msr_valid(vmx, msr_info))
-			return 1;
+			return -EINVAL;
 
 		vmx->msr_ia32_feature_control = data;
 		if (msr_info->host_initiated && data == 0)
@@ -2481,70 +2481,70 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		    (!guest_cpu_cap_has(vcpu, X86_FEATURE_SGX_LC) ||
 		    ((vmx->msr_ia32_feature_control & FEAT_CTL_LOCKED) &&
 		    !(vmx->msr_ia32_feature_control & FEAT_CTL_SGX_LC_ENABLED))))
-			return 1;
+			return -EINVAL;
 		vmx->msr_ia32_sgxlepubkeyhash
 			[msr_index - MSR_IA32_SGXLEPUBKEYHASH0] = data;
 		break;
 	case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
 		if (!msr_info->host_initiated)
-			return 1; /* they are read-only */
+			return -EINVAL; /* they are read-only */
 		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_VMX))
-			return 1;
+			return -EINVAL;
 		return vmx_set_vmx_msr(vcpu, msr_index, data);
 	case MSR_IA32_RTIT_CTL:
 		if (!vmx_pt_mode_is_host_guest() ||
 			vmx_rtit_ctl_check(vcpu, data) ||
 			vmx->nested.vmxon)
-			return 1;
+			return -EINVAL;
 		vmcs_write64(GUEST_IA32_RTIT_CTL, data);
 		vmx->pt_desc.guest.ctl = data;
 		pt_update_intercept_for_msr(vcpu);
 		break;
 	case MSR_IA32_RTIT_STATUS:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (data & MSR_IA32_RTIT_STATUS_MASK)
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.status = data;
 		break;
 	case MSR_IA32_RTIT_CR3_MATCH:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (!intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_cr3_filtering))
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.cr3_match = data;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_BASE:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (!intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_topa_output) &&
 		    !intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_single_range_output))
-			return 1;
+			return -EINVAL;
 		if (!pt_output_base_valid(vcpu, data))
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.output_base = data;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_MASK:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (!intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_topa_output) &&
 		    !intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_single_range_output))
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.output_mask = data;
 		break;
 	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		index = msr_info->index - MSR_IA32_RTIT_ADDR0_A;
 		if (index >= 2 * vmx->pt_desc.num_address_ranges)
-			return 1;
+			return -EINVAL;
 		if (is_noncanonical_msr_address(data, vcpu))
-			return 1;
+			return -EINVAL;
 		if (index % 2)
 			vmx->pt_desc.guest.addr_b[index / 2] = data;
 		else
@@ -2563,20 +2563,20 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & PERF_CAP_LBR_FMT) {
 			if ((data & PERF_CAP_LBR_FMT) !=
 			    (kvm_caps.supported_perf_cap & PERF_CAP_LBR_FMT))
-				return 1;
+				return -EINVAL;
 			if (!cpuid_model_is_consistent(vcpu))
-				return 1;
+				return -EINVAL;
 		}
 		if (data & PERF_CAP_PEBS_FORMAT) {
 			if ((data & PERF_CAP_PEBS_MASK) !=
 			    (kvm_caps.supported_perf_cap & PERF_CAP_PEBS_MASK))
-				return 1;
+				return -EINVAL;
 			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_DS))
-				return 1;
+				return -EINVAL;
 			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_DTES64))
-				return 1;
+				return -EINVAL;
 			if (!cpuid_model_is_consistent(vcpu))
-				return 1;
+				return -EINVAL;
 		}
 		ret = kvm_set_msr_common(vcpu, msr_info);
 		break;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2 0/5] KVM/x86: Drop "1" as MSR emulation return value
From: Juergen Gross @ 2026-05-28 11:35 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-coco
  Cc: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Vitaly Kuznetsov,
	Kiryl Shutsemau, Rick Edgecombe
In-Reply-To: <20260528111357.264809-1-jgross@suse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 1468 bytes --]

On 28.05.26 13:13, Juergen Gross wrote:
> Get rid of the literal "1" used as general error return value in KVM
> MSR emulation. It can easily be replaced by negative errno values
> instead.
> 
> This is meant to avoid confusion with the literal "1" used as return
> value for "return to guest".
> 
> Changes in V2:
> - series carved out from initial "KVM: Avoid literal numbers as return
>    values" series
> - don't use new KVM_MSR_RET_* defines, but 0 and -errno
> 
> Juergen Gross (5):
>    KVM/x86: Change comment before KVM_MSR_RET_* defines
>    KVM/x86: Return -errno instead of "1" for APIC related MSR emulation
>    KVM/x86: Return -errno instead of "1" for Hyper-V related MSR
>      emulation
>    KVM/x86: Return -errno instead of "1" for VMX related MSR emulation
>    KVM/x86: Return -errno instead of "1" for SVM related MSR emulation
> 
>   arch/x86/kvm/hyperv.c        | 72 +++++++++++++--------------
>   arch/x86/kvm/lapic.c         | 39 +++++++--------
>   arch/x86/kvm/svm/pmu.c       |  4 +-
>   arch/x86/kvm/svm/svm.c       | 36 +++++++-------
>   arch/x86/kvm/vmx/nested.c    |  2 +-
>   arch/x86/kvm/vmx/pmu_intel.c | 16 +++---
>   arch/x86/kvm/vmx/tdx.c       | 10 ++--
>   arch/x86/kvm/vmx/vmx.c       | 96 ++++++++++++++++++------------------
>   arch/x86/kvm/x86.h           |  4 +-
>   9 files changed, 139 insertions(+), 140 deletions(-)
> 

Oh, sorry, there is another patch. Resending.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Juergen Gross @ 2026-05-28 11:35 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-coco
  Cc: Juergen Gross, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant

Get rid of the literal "1" used as general error return value in KVM
MSR emulation. It can easily be replaced by negative errno values
instead.

This is meant to avoid confusion with the literal "1" used as return
value for "return to guest".

Changes in V2:
- series carved out from initial "KVM: Avoid literal numbers as return
  values" series
- don't use new KVM_MSR_RET_* defines, but 0 and -errno

Juergen Gross (6):
  KVM/x86: Change comment before KVM_MSR_RET_* defines
  KVM/x86: Return -errno instead of "1" for APIC related MSR emulation
  KVM/x86: Return -errno instead of "1" for Hyper-V related MSR
    emulation
  KVM/x86: Return -errno instead of "1" for VMX related MSR emulation
  KVM/x86: Return -errno instead of "1" for SVM related MSR emulation
  KVM/x86: Return -errno instead of "1" for common MSR emulation

 arch/x86/kvm/hyperv.c        |  72 ++++++++++++-------------
 arch/x86/kvm/lapic.c         |  39 +++++++-------
 arch/x86/kvm/mtrr.c          |   6 +--
 arch/x86/kvm/pmu.c           |   8 +--
 arch/x86/kvm/svm/pmu.c       |   4 +-
 arch/x86/kvm/svm/svm.c       |  36 ++++++-------
 arch/x86/kvm/vmx/nested.c    |   2 +-
 arch/x86/kvm/vmx/pmu_intel.c |  16 +++---
 arch/x86/kvm/vmx/tdx.c       |  10 ++--
 arch/x86/kvm/vmx/vmx.c       |  96 ++++++++++++++++-----------------
 arch/x86/kvm/x86.c           | 102 +++++++++++++++++------------------
 arch/x86/kvm/x86.h           |   4 +-
 arch/x86/kvm/xen.c           |  10 ++--
 13 files changed, 202 insertions(+), 203 deletions(-)

-- 
2.54.0


^ permalink raw reply

* [PATCH v2 4/6] KVM/x86: Return -errno instead of "1" for VMX related MSR emulation
From: Juergen Gross @ 2026-05-28 11:36 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-coco
  Cc: Juergen Gross, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe
In-Reply-To: <20260528113605.267111-1-jgross@suse.com>

Instead of a literal "1" for signalling an error, use a negative errno
value in the emulation code of VMX related MSR registers.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- use -errno instead of KVM_MSR_RET_ERR
---
 arch/x86/kvm/vmx/nested.c    |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c | 16 +++---
 arch/x86/kvm/vmx/tdx.c       | 10 ++--
 arch/x86/kvm/vmx/vmx.c       | 96 ++++++++++++++++++------------------
 4 files changed, 62 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 3fe88f29be7a..2236f15ffab2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1611,7 +1611,7 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 		*pdata = msrs->vmfunc_controls;
 		break;
 	default:
-		return 1;
+		return -EINVAL;
 	}
 
 	return 0;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 27eb76e6b6a0..4f7e354c4b50 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -362,7 +362,7 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		} else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, true)) {
 			break;
 		}
-		return 1;
+		return -EINVAL;
 	}
 
 	return 0;
@@ -379,14 +379,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
 		if (data & pmu->fixed_ctr_ctrl_rsvd)
-			return 1;
+			return -EINVAL;
 
 		if (pmu->fixed_ctr_ctrl != data)
 			reprogram_fixed_counters(pmu, data);
 		break;
 	case MSR_IA32_PEBS_ENABLE:
 		if (data & pmu->pebs_enable_rsvd)
-			return 1;
+			return -EINVAL;
 
 		if (pmu->pebs_enable != data) {
 			diff = pmu->pebs_enable ^ data;
@@ -396,13 +396,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_DS_AREA:
 		if (is_noncanonical_msr_address(data, vcpu))
-			return 1;
+			return -EINVAL;
 
 		pmu->ds_area = data;
 		break;
 	case MSR_PEBS_DATA_CFG:
 		if (data & pmu->pebs_data_cfg_rsvd)
-			return 1;
+			return -EINVAL;
 
 		pmu->pebs_data_cfg = data;
 		break;
@@ -411,7 +411,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		    (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
 			if ((msr & MSR_PMC_FULL_WIDTH_BIT) &&
 			    (data & ~pmu->counter_bitmask[KVM_PMC_GP]))
-				return 1;
+				return -EINVAL;
 
 			if (!msr_info->host_initiated &&
 			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
@@ -427,7 +427,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			    (pmu->raw_event_mask & HSW_IN_TX_CHECKPOINTED))
 				reserved_bits ^= HSW_IN_TX_CHECKPOINTED;
 			if (data & reserved_bits)
-				return 1;
+				return -EINVAL;
 
 			if (data != pmc->eventsel) {
 				pmc->eventsel = data;
@@ -439,7 +439,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			break;
 		}
 		/* Not a known PMU MSR. */
-		return 1;
+		return -EINVAL;
 	}
 
 	return 0;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 04ce321ebdf3..acc3242af4f4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2158,12 +2158,12 @@ int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		return 0;
 	case MSR_IA32_MCG_EXT_CTL:
 		if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
-			return 1;
+			return -EINVAL;
 		msr->data = vcpu->arch.mcg_ext_ctl;
 		return 0;
 	default:
 		if (!tdx_has_emulated_msr(msr->index))
-			return 1;
+			return -EACCES;
 
 		return kvm_get_msr_common(vcpu, msr);
 	}
@@ -2175,15 +2175,15 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_IA32_MCG_EXT_CTL:
 		if ((!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P)) ||
 		    (msr->data & ~MCG_EXT_CTL_LMCE_EN))
-			return 1;
+			return -EINVAL;
 		vcpu->arch.mcg_ext_ctl = msr->data;
 		return 0;
 	default:
 		if (tdx_is_read_only_msr(msr->index))
-			return 1;
+			return -EACCES;
 
 		if (!tdx_has_emulated_msr(msr->index))
-			return 1;
+			return -EACCES;
 
 		return kvm_set_msr_common(vcpu, msr);
 	}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b9103de01428..2eee599fca30 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2076,7 +2076,7 @@ int vmx_get_feature_msr(u32 msr, u64 *data)
 	switch (msr) {
 	case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
 		if (!nested)
-			return 1;
+			return -EINVAL;
 		return vmx_get_vmx_msr(&vmcs_config.nested, msr, data);
 	default:
 		return KVM_MSR_RET_UNSUPPORTED;
@@ -2111,18 +2111,18 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSX_CTRL:
 		if (!msr_info->host_initiated &&
 		    !(vcpu->arch.arch_capabilities & ARCH_CAP_TSX_CTRL_MSR))
-			return 1;
+			return -EINVAL;
 		goto find_uret_msr;
 	case MSR_IA32_UMWAIT_CONTROL:
 		if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
-			return 1;
+			return -EINVAL;
 
 		msr_info->data = vmx->msr_ia32_umwait_control;
 		break;
 	case MSR_IA32_SPEC_CTRL:
 		if (!msr_info->host_initiated &&
 		    !guest_has_spec_ctrl_msr(vcpu))
-			return 1;
+			return -EINVAL;
 
 		msr_info->data = to_vmx(vcpu)->spec_ctrl;
 		break;
@@ -2139,14 +2139,14 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!kvm_mpx_supported() ||
 		    (!msr_info->host_initiated &&
 		     !guest_cpu_cap_has(vcpu, X86_FEATURE_MPX)))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmcs_read64(GUEST_BNDCFGS);
 		break;
 	case MSR_IA32_MCG_EXT_CTL:
 		if (!msr_info->host_initiated &&
 		    !(vmx->msr_ia32_feature_control &
 		      FEAT_CTL_LMCE_ENABLED))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vcpu->arch.mcg_ext_ctl;
 		break;
 	case MSR_IA32_FEAT_CTL:
@@ -2155,16 +2155,16 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_SGXLEPUBKEYHASH0 ... MSR_IA32_SGXLEPUBKEYHASH3:
 		if (!msr_info->host_initiated &&
 		    !guest_cpu_cap_has(vcpu, X86_FEATURE_SGX_LC))
-			return 1;
+			return -EINVAL;
 		msr_info->data = to_vmx(vcpu)->msr_ia32_sgxlepubkeyhash
 			[msr_info->index - MSR_IA32_SGXLEPUBKEYHASH0];
 		break;
 	case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
 		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_VMX))
-			return 1;
+			return -EINVAL;
 		if (vmx_get_vmx_msr(&vmx->nested.msrs, msr_info->index,
 				    &msr_info->data))
-			return 1;
+			return -EINVAL;
 #ifdef CONFIG_KVM_HYPERV
 		/*
 		 * Enlightened VMCS v1 doesn't have certain VMCS fields but
@@ -2180,19 +2180,19 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_RTIT_CTL:
 		if (!vmx_pt_mode_is_host_guest())
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.ctl;
 		break;
 	case MSR_IA32_RTIT_STATUS:
 		if (!vmx_pt_mode_is_host_guest())
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.status;
 		break;
 	case MSR_IA32_RTIT_CR3_MATCH:
 		if (!vmx_pt_mode_is_host_guest() ||
 			!intel_pt_validate_cap(vmx->pt_desc.caps,
 						PT_CAP_cr3_filtering))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.cr3_match;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_BASE:
@@ -2201,7 +2201,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 					PT_CAP_topa_output) &&
 			 !intel_pt_validate_cap(vmx->pt_desc.caps,
 					PT_CAP_single_range_output)))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.output_base;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_MASK:
@@ -2210,14 +2210,14 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 					PT_CAP_topa_output) &&
 			 !intel_pt_validate_cap(vmx->pt_desc.caps,
 					PT_CAP_single_range_output)))
-			return 1;
+			return -EINVAL;
 		msr_info->data = vmx->pt_desc.guest.output_mask;
 		break;
 	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
 		index = msr_info->index - MSR_IA32_RTIT_ADDR0_A;
 		if (!vmx_pt_mode_is_host_guest() ||
 		    (index >= 2 * vmx->pt_desc.num_address_ranges))
-			return 1;
+			return -EINVAL;
 		if (index % 2)
 			msr_info->data = vmx->pt_desc.guest.addr_b[index / 2];
 		else
@@ -2359,7 +2359,7 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_DEBUGCTLMSR:
 		if (!vmx_is_valid_debugctl(vcpu, data, msr_info->host_initiated))
-			return 1;
+			return -EINVAL;
 
 		data &= vmx_get_supported_debugctl(vcpu, msr_info->host_initiated);
 
@@ -2377,10 +2377,10 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!kvm_mpx_supported() ||
 		    (!msr_info->host_initiated &&
 		     !guest_cpu_cap_has(vcpu, X86_FEATURE_MPX)))
-			return 1;
+			return -EINVAL;
 		if (is_noncanonical_msr_address(data & PAGE_MASK, vcpu) ||
 		    (data & MSR_IA32_BNDCFGS_RSVD))
-			return 1;
+			return -EINVAL;
 
 		if (is_guest_mode(vcpu) &&
 		    ((vmx->nested.msrs.entry_ctls_high & VM_ENTRY_LOAD_BNDCFGS) ||
@@ -2391,21 +2391,21 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_UMWAIT_CONTROL:
 		if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
-			return 1;
+			return -EINVAL;
 
 		/* The reserved bit 1 and non-32 bit [63:32] should be zero */
 		if (data & (BIT_ULL(1) | GENMASK_ULL(63, 32)))
-			return 1;
+			return -EINVAL;
 
 		vmx->msr_ia32_umwait_control = data;
 		break;
 	case MSR_IA32_SPEC_CTRL:
 		if (!msr_info->host_initiated &&
 		    !guest_has_spec_ctrl_msr(vcpu))
-			return 1;
+			return -EINVAL;
 
 		if (kvm_spec_ctrl_test_value(data))
-			return 1;
+			return -EINVAL;
 
 		vmx->spec_ctrl = data;
 		if (!data)
@@ -2430,9 +2430,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSX_CTRL:
 		if (!msr_info->host_initiated &&
 		    !(vcpu->arch.arch_capabilities & ARCH_CAP_TSX_CTRL_MSR))
-			return 1;
+			return -EINVAL;
 		if (data & ~(TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR))
-			return 1;
+			return -EINVAL;
 		goto find_uret_msr;
 	case MSR_IA32_CR_PAT:
 		ret = kvm_set_msr_common(vcpu, msr_info);
@@ -2451,12 +2451,12 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		     !(to_vmx(vcpu)->msr_ia32_feature_control &
 		       FEAT_CTL_LMCE_ENABLED)) ||
 		    (data & ~MCG_EXT_CTL_LMCE_EN))
-			return 1;
+			return -EINVAL;
 		vcpu->arch.mcg_ext_ctl = data;
 		break;
 	case MSR_IA32_FEAT_CTL:
 		if (!is_vmx_feature_control_msr_valid(vmx, msr_info))
-			return 1;
+			return -EINVAL;
 
 		vmx->msr_ia32_feature_control = data;
 		if (msr_info->host_initiated && data == 0)
@@ -2481,70 +2481,70 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		    (!guest_cpu_cap_has(vcpu, X86_FEATURE_SGX_LC) ||
 		    ((vmx->msr_ia32_feature_control & FEAT_CTL_LOCKED) &&
 		    !(vmx->msr_ia32_feature_control & FEAT_CTL_SGX_LC_ENABLED))))
-			return 1;
+			return -EINVAL;
 		vmx->msr_ia32_sgxlepubkeyhash
 			[msr_index - MSR_IA32_SGXLEPUBKEYHASH0] = data;
 		break;
 	case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
 		if (!msr_info->host_initiated)
-			return 1; /* they are read-only */
+			return -EINVAL; /* they are read-only */
 		if (!guest_cpu_cap_has(vcpu, X86_FEATURE_VMX))
-			return 1;
+			return -EINVAL;
 		return vmx_set_vmx_msr(vcpu, msr_index, data);
 	case MSR_IA32_RTIT_CTL:
 		if (!vmx_pt_mode_is_host_guest() ||
 			vmx_rtit_ctl_check(vcpu, data) ||
 			vmx->nested.vmxon)
-			return 1;
+			return -EINVAL;
 		vmcs_write64(GUEST_IA32_RTIT_CTL, data);
 		vmx->pt_desc.guest.ctl = data;
 		pt_update_intercept_for_msr(vcpu);
 		break;
 	case MSR_IA32_RTIT_STATUS:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (data & MSR_IA32_RTIT_STATUS_MASK)
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.status = data;
 		break;
 	case MSR_IA32_RTIT_CR3_MATCH:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (!intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_cr3_filtering))
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.cr3_match = data;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_BASE:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (!intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_topa_output) &&
 		    !intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_single_range_output))
-			return 1;
+			return -EINVAL;
 		if (!pt_output_base_valid(vcpu, data))
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.output_base = data;
 		break;
 	case MSR_IA32_RTIT_OUTPUT_MASK:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		if (!intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_topa_output) &&
 		    !intel_pt_validate_cap(vmx->pt_desc.caps,
 					   PT_CAP_single_range_output))
-			return 1;
+			return -EINVAL;
 		vmx->pt_desc.guest.output_mask = data;
 		break;
 	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
 		if (!pt_can_write_msr(vmx))
-			return 1;
+			return -EINVAL;
 		index = msr_info->index - MSR_IA32_RTIT_ADDR0_A;
 		if (index >= 2 * vmx->pt_desc.num_address_ranges)
-			return 1;
+			return -EINVAL;
 		if (is_noncanonical_msr_address(data, vcpu))
-			return 1;
+			return -EINVAL;
 		if (index % 2)
 			vmx->pt_desc.guest.addr_b[index / 2] = data;
 		else
@@ -2563,20 +2563,20 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & PERF_CAP_LBR_FMT) {
 			if ((data & PERF_CAP_LBR_FMT) !=
 			    (kvm_caps.supported_perf_cap & PERF_CAP_LBR_FMT))
-				return 1;
+				return -EINVAL;
 			if (!cpuid_model_is_consistent(vcpu))
-				return 1;
+				return -EINVAL;
 		}
 		if (data & PERF_CAP_PEBS_FORMAT) {
 			if ((data & PERF_CAP_PEBS_MASK) !=
 			    (kvm_caps.supported_perf_cap & PERF_CAP_PEBS_MASK))
-				return 1;
+				return -EINVAL;
 			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_DS))
-				return 1;
+				return -EINVAL;
 			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_DTES64))
-				return 1;
+				return -EINVAL;
 			if (!cpuid_model_is_consistent(vcpu))
-				return 1;
+				return -EINVAL;
 		}
 		ret = kvm_set_msr_common(vcpu, msr_info);
 		break;
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox