Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: David Woodhouse @ 2026-06-04 21:48 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Kiryl Shutsemau, Paul Durrant, Dave Hansen, Rick Edgecombe, kvm,
	x86, linux-coco, linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <dc62e58e-b6ee-41e1-84a5-0716822fefc8@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 587 bytes --]

On Mon, 2026-05-18 at 17:55 +0800, Binbin Wu wrote:
>  
> > I'm suggesting that you clean up longmode→is_64bit for the *hypercalls*
> > but leave 'long_mode' as is.
> > 
> 
> Yes, will only do it for is_64_bit_hypercall().

If you did this, I'm not sure I saw it? 

In response to https://lore.kernel.org/all/aiHPPUk5DY7rH-zL@v4bel/#r I
now find myself with both 'longmode' (current vCPU mode, should be
called is_64bit), and 'long_mode' (latched VM-wide mode) in the *same*
function.

I cannot live with that; I'm going to do the longmode→is_64bit change
locally.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v6 10/11] x86/virt/tdx: Enable Dynamic PAMT
From: Chao Gao @ 2026-06-05  5:25 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Rick Edgecombe, bp, dave.hansen, hpa, kvm, linux-coco, linux-doc,
	linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
	vannapurve, x86, yan.y.zhao, kai.huang, Kirill A. Shutemov
In-Reply-To: <aiGyIQvudD5ZF3lf@thinkstation>

On Thu, Jun 04, 2026 at 06:14:17PM +0100, Kiryl Shutsemau wrote:
>On Mon, May 25, 2026 at 07:35:14PM -0700, Rick Edgecombe wrote:
>> @@ -152,7 +156,12 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
>>  
>>  static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
>>  {
>> -	return false; /* To be enabled when kernel is ready */
>> +	/*
>> +	 * The TDX Module's internal Dynamic PAMT tree structure can't
>> +	 * handle physical addresses with more than 48 bits.
>> +	 */
>> +	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT &&
>> +	       boot_cpu_data.x86_phys_bits <= 48;
>
>Should we warn for >48?

Maybe we should drop this check. If the TDX module cannot handle that case,
advertising TDX_FEATURES0_DYNAMIC_PAMT is a bug and should be fixed by the
module.

^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Chao Gao @ 2026-06-05  5:40 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, Hansen, Dave, Zhao, Yan Y,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <aiGq7XjmMrsqdBY5@thinkstation>

On Thu, Jun 04, 2026 at 05:59:02PM +0100, Kiryl Shutsemau wrote:
>On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
>> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
>> > > -	scoped_guard(spinlock, &pamt_lock) {
>> > 
>> > This converts the scoped_guard() added by the previous patch to
>> > explicit lock/unlock and goto. It would reduce code churn if the
>> > previous patch used that form directly.
>> 
>> Yea, it's a good point. I actually debated doing it, but decided not to because
>> the scoped version is cleaner for the non-optimized version. But for
>> reviewability, never doing the scoped version is probably better.
>
>I don't see a reason why we can't keep the scoped_guard() on get side.

One additional reason to drop scoped_guard() is that it mixes cleanup helpers
with goto, which is discouraged. See [*]

 :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
 :that the “goto” statement can jump between scopes, the expectation is that
 :usage of “goto” and cleanup helpers is never mixed in the same function.

Removing scoped_guard() here also reduces indentation.

*: https://www.kernel.org/doc/html/v7.1-rc6/core-api/cleanup.html

>
>On put side, we cannot get atomic_get_and_lock() semantics without
>dropping the scoped_guard().
>
>Maybe we should keep it for get?
>
>-- 
>  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-05  6:23 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-30-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> At runtime if the realm guest accesses memory which hasn't yet been
> mapped then KVM needs to either populate the region or fault the guest.
> 
> For memory in the lower (protected) region of IPA a fresh page is
> provided to the RMM which will zero the contents. For memory in the
> upper (shared) region of IPA, the memory from the memslot is mapped
> into the realm VM non secure.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>   * Numerous changes due to rebasing.
>   * Fix addr_range_desc() to encode the correct block size.
> Changes since v12:
>   * Switch to RMM v2.0 range based APIs.
> Changes since v11:
>   * Adapt to upstream changes.
> Changes since v10:
>   * RME->RMI renaming.
>   * Adapt to upstream gmem changes.
> Changes since v9:
>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>     may_block to avoid stall warnings.
>   * Minor coding style fixes.
> Changes since v8:
>   * Propagate the may_block flag.
>   * Minor comments and coding style changes.
> Changes since v7:
>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>     internally WARN when necessary.
> Changes since v6:
>   * Handle PAGE_SIZE being larger than RMM granule size.
>   * Some minor renaming following review comments.
> Changes since v5:
>   * Reduce use of struct page in preparation for supporting the RMM
>     having a different page size to the host.
>   * Handle a race when delegating a page where another CPU has faulted on
>     a the same page (and already delegated the physical page) but not yet
>     mapped it. In this case simply return to the guest to either use the
>     mapping from the other CPU (or refault if the race is lost).
>   * The changes to populate_par_region() are moved into the previous
>     patch where they belong.
> Changes since v4:
>   * Code cleanup following review feedback.
>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>     This is now set by the RMM and the host has no control of it and the
>     spec requires the bit to be set to zero.
> Changes since v2:
>   * Avoid leaking memory if failing to map it in the realm.
>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>   * Adapt to changes in previous patches.
> ---
>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>   4 files changed, 301 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 2e69fe494716..8b6f9d26b5d8 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -712,6 +712,14 @@ static inline bool kvm_realm_is_created(struct kvm *kvm)
>   	return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE;
>   }
>   
> +static inline gpa_t kvm_gpa_from_fault(struct kvm *kvm, phys_addr_t ipa)
> +{
> +	if (!kvm_is_realm(kvm))
> +		return ipa;
> +
> +	return ipa & ~BIT(kvm->arch.realm.ia_bits - 1);
> +}
> +
>   static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>   {
>   	return kvm_is_realm(vcpu->kvm);
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index a2b6bc412a22..b65cfec10dee 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -6,6 +6,7 @@
>   #ifndef __ASM_KVM_RMI_H
>   #define __ASM_KVM_RMI_H
>   
> +#include <asm/kvm_pgtable.h>
>   #include <asm/rmi_smc.h>
>   
>   /**
> @@ -97,6 +98,17 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>   			   unsigned long size,
>   			   bool unmap_private,
>   			   bool may_block);
> +int realm_map_protected(struct kvm *kvm,
> +			unsigned long base_ipa,
> +			kvm_pfn_t pfn,
> +			unsigned long size,
> +			struct kvm_mmu_memory_cache *memcache);
> +int realm_map_non_secure(struct realm *realm,
> +			 unsigned long ipa,
> +			 kvm_pfn_t pfn,
> +			 unsigned long size,
> +			 enum kvm_pgtable_prot prot,
> +			 struct kvm_mmu_memory_cache *memcache);
>   
>   static inline bool kvm_realm_is_private_address(struct realm *realm,
>   						unsigned long addr)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index ac2a0f0106b0..776ffe56d17e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -334,8 +334,15 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
>   
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   	WARN_ON(size & ~PAGE_MASK);
> -	WARN_ON(stage2_apply_range(mmu, start, end, KVM_PGT_FN(kvm_pgtable_stage2_unmap),
> -				   may_block));
> +
> +	if (kvm_is_realm(kvm)) {
> +		kvm_realm_unmap_range(kvm, start, size, !only_shared,
> +				      may_block);
> +	} else {
> +		WARN_ON(stage2_apply_range(mmu, start, end,
> +					   KVM_PGT_FN(kvm_pgtable_stage2_unmap),
> +					   may_block));
> +	}
>   }
>   
>   void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
> @@ -358,7 +365,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
>   	phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
>   	phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
>   
> -	kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
> +	if (kvm_is_realm(kvm))
> +		kvm_realm_unmap_range(kvm, addr, end - addr, false, true);
> +	else
> +		kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
>   }
>   
>   /**
> @@ -1103,6 +1113,10 @@ void stage2_unmap_vm(struct kvm *kvm)
>   	struct kvm_memory_slot *memslot;
>   	int idx, bkt;
>   
> +	/* For realms this is handled by the RMM so nothing to do here */
> +	if (kvm_is_realm(kvm))
> +		return;
> +
>   	idx = srcu_read_lock(&kvm->srcu);
>   	mmap_read_lock(current->mm);
>   	write_lock(&kvm->mmu_lock);
> @@ -1528,6 +1542,29 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>   	return vma->vm_flags & VM_MTE_ALLOWED;
>   }
>   
> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
> +			 kvm_pfn_t pfn, unsigned long map_size,
> +			 enum kvm_pgtable_prot prot,
> +			 struct kvm_mmu_memory_cache *memcache)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +
> +	/*
> +	 * Write permission is required for now even though it's possible to
> +	 * map unprotected pages (granules) as read-only. It's impossible to
> +	 * map protected pages (granules) as read-only.
> +	 */
> +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> +		return -EFAULT;
> +

I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
(e.g. QEMU), the vCPU continuous execution is stopped and system won't be
working any more.

> +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
> +	if (!kvm_realm_is_private_address(realm, ipa))
> +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
> +					    memcache);
> +
> +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
> +}
> +
>   static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>   {
>   	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   	bool write_fault, exec_fault;
>   	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>   	unsigned long mmu_seq;
>   	struct page *page;
> -	struct kvm *kvm = s2fd->vcpu->kvm;
> +	struct kvm *kvm = vcpu->kvm;
>   	void *memcache;
>   	kvm_pfn_t pfn;
>   	gfn_t gfn;
>   	int ret;
>   
> -	memcache = get_mmu_memcache(s2fd->vcpu);
> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> +	if (kvm_is_realm(vcpu->kvm)) {
> +		/* check for memory attribute mismatch */
> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +		/*
> +		 * For Realms, the shared address is an alias of the private
> +		 * PA with the top bit set. Thus if the fault address matches
> +		 * the GPA then it is the private alias.
> +		 */
> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> +
> +		if (is_priv_gfn != is_priv_fault) {
> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> +						      kvm_is_write_fault(vcpu),
> +						      false,
> +						      is_priv_fault);
> +			/*
> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> +			 * -EFAULT, see the API documentation
> +			 */
> +			return -EFAULT;
> +		}
> +	}
> +
> +	memcache = get_mmu_memcache(vcpu);
> +	ret = topup_mmu_memcache(vcpu, memcache);
>   	if (ret)
>   		return ret;
>   
>   	if (s2fd->nested)
>   		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>   	else
> -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
> +		gfn = gpa >> PAGE_SHIFT;
>   
> -	write_fault = kvm_is_write_fault(s2fd->vcpu);
> -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
> +	write_fault = kvm_is_write_fault(vcpu);
> +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>   
>   	VM_WARN_ON_ONCE(write_fault && exec_fault);
>   
> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   
>   	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>   	if (ret) {
> -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
> +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>   					      write_fault, exec_fault, false);
>   		return ret;
>   	}
> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   	kvm_fault_lock(kvm);
>   	if (mmu_invalidate_retry(kvm, mmu_seq)) {
>   		ret = -EAGAIN;
> -		goto out_unlock;
> +		goto out_release_page;
> +	}
> +
> +	if (kvm_is_realm(kvm)) {
> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
> +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
> +		goto out_release_page;
>   	}
>   
>   	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
>   						 __pfn_to_phys(pfn), prot,
>   						 memcache, flags);
>   
> -out_unlock:
> +out_release_page:
>   	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
>   	kvm_fault_unlock(kvm);
>   
> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
>   	 * mapping size to ensure we find the right PFN and lay down the
>   	 * mapping in the right place.
>   	 */
> -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
> +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>   
>   	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>   
> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
>   		prot &= ~KVM_NV_GUEST_MAP_SZ;
>   		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
>   								 prot, flags);
> +	} else if (kvm_is_realm(kvm)) {
> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
> +				    prot, memcache);
>   	} else {
>   		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
>   							 __pfn_to_phys(pfn), prot,

For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
transparent_hugepage_adjust() to be aligned with huge page size. If the
adjustment happened in transparent_hugepage_adjust(), we need to align
s2fd->fault_ipa down to the huge page size either.


> @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>   	return 0;
>   }
>   
> +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
> +{
> +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
> +
> +	return (gpa != fault_ipa);
> +}
> +
>   /**
>    * kvm_handle_guest_abort - handles all 2nd stage aborts
>    * @vcpu:	the VCPU pointer
> @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   		nested = &nested_trans;
>   	}
>   
> -	gfn = ipa >> PAGE_SHIFT;
> +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
>   	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> +
>   	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>   	write_fault = kvm_is_write_fault(vcpu);
>   	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
> @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   		 * of the page size.
>   		 */
>   		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
> -		ret = io_mem_abort(vcpu, ipa);
> +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
>   		goto out_unlock;
>   	}
>   
> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   				!write_fault &&
>   				!kvm_vcpu_trap_is_exec_fault(vcpu));
>   
> -		if (kvm_slot_has_gmem(memslot))
> +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
>   			ret = gmem_abort(&s2fd);
>   		else
>   			ret = user_mem_abort(&s2fd);
> @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>   		return false;
>   
> +	/* We don't support aging for Realms */
> +	if (kvm_is_realm(kvm))
> +		return true;
> +
>   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>   						   range->start << PAGE_SHIFT,
>   						   size, true);
> @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>   		return false;
>   
> +	/* We don't support aging for Realms */
> +	if (kvm_is_realm(kvm))
> +		return true;
> +
>   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>   						   range->start << PAGE_SHIFT,
>   						   size, false);
> @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   		return -EFAULT;
>   
>   	/*
> -	 * Only support guest_memfd backed memslots with mappable memory, since
> -	 * there aren't any CoCo VMs that support only private memory on arm64.
> +	 * Only support guest_memfd backed memslots with mappable memory,
> +	 * unless the guest is a CCA realm guest.
>   	 */
> -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
> +	    !kvm_is_realm(kvm))
>   		return -EINVAL;
>   
>   	hva = new->userspace_addr;
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index cae29fd3353c..761b38a4071c 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>   	return ret;
>   }
>   
> +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
> +{
> +	unsigned long out = 0;
> +
> +	switch (size) {
> +	case P4D_SIZE:
> +		out = 3 | (1 << 2);
> +		break;
> +	case PUD_SIZE:
> +		out = 2 | (1 << 2);
> +		break;
> +	case PMD_SIZE:
> +		out = 1 | (1 << 2);
> +		break;
> +	case PAGE_SIZE:
> +		out = 0 | (1 << 2);
> +		break;
> +	default:
> +		/*
> +		 * Only support mapping at the page level granulatity when
> +		 * it's an unusual length. This should get us back onto a larger
> +		 * block size for the subsequent mappings.
> +		 */
> +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
> +		break;
> +	}
> +
> +	WARN_ON(phys & ~PAGE_MASK);
> +
> +	out |= phys & PAGE_MASK;
> +
> +	return out;
> +}
> +
> +int realm_map_protected(struct kvm *kvm,
> +			unsigned long ipa,
> +			kvm_pfn_t pfn,
> +			unsigned long map_size,
> +			struct kvm_mmu_memory_cache *memcache)
> +{
> +	struct realm *realm = &kvm->arch.realm;
> +	phys_addr_t phys = __pfn_to_phys(pfn);
> +	phys_addr_t base_phys = phys;
> +	phys_addr_t rd = virt_to_phys(realm->rd);
> +	unsigned long base_ipa = ipa;
> +	unsigned long ipa_top = ipa + map_size;
> +	int ret = 0;
> +
> +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
> +		    !IS_ALIGNED(ipa, map_size)))
> +		return -EINVAL;
> +
> +	if (rmi_delegate_range(phys, map_size)) {
> +		/*
> +		 * It's likely we raced with another VCPU on the same
> +		 * fault. Assume the other VCPU has handled the fault
> +		 * and return to the guest.
> +		 */
> +		return 0;
> +	}
> +
> +	while (ipa < ipa_top) {
> +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> +		unsigned long out_top;
> +
> +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
> +				       &out_top);
> +
> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> +			/* Create missing RTTs and retry */
> +			int level = RMI_RETURN_INDEX(ret);
> +
> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> +			ret = realm_create_rtt_levels(realm, ipa, level,
> +						      KVM_PGTABLE_LAST_LEVEL,
> +						      memcache);
> +			if (ret)
> +				goto err_undelegate;
> +
> +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> +					       range_desc, &out_top);
> +		}
> +
> +		if (WARN_ON(ret))
> +			goto err_undelegate;
> +
> +		phys += out_top - ipa;
> +		ipa = out_top;
> +	}
> +
> +	return 0;
> +
> +err_undelegate:
> +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
> +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
> +		/* Page can't be returned to NS world so is lost */
> +		get_page(phys_to_page(base_phys));
> +	}
> +	return -ENXIO;
> +}
> +
> +int realm_map_non_secure(struct realm *realm,
> +			 unsigned long ipa,
> +			 kvm_pfn_t pfn,
> +			 unsigned long size,
> +			 enum kvm_pgtable_prot prot,
> +			 struct kvm_mmu_memory_cache *memcache)
> +{
> +	unsigned long attr, flags = 0;
> +	phys_addr_t rd = virt_to_phys(realm->rd);
> +	phys_addr_t phys = __pfn_to_phys(pfn);
> +	unsigned long ipa_top = ipa + size;
> +	int ret;
> +
> +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
> +		    !IS_ALIGNED(ipa, size)))
> +		return -EINVAL;
> +
> +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
> +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
> +		return -EINVAL;
> +	case KVM_PGTABLE_PROT_DEVICE:
> +		attr = MT_S2_FWB_DEVICE_nGnRE;
> +		break;
> +	case KVM_PGTABLE_PROT_NORMAL_NC:
> +		attr = MT_S2_FWB_NORMAL_NC;
> +		break;
> +	default:
> +		attr = MT_S2_FWB_NORMAL;
> +	}
> +
> +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
> +
> +	if (prot & KVM_PGTABLE_PROT_R)
> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
> +	if (prot & KVM_PGTABLE_PROT_W)
> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
> +
> +	flags |= RMI_ADDR_TYPE_SINGLE;
> +
> +	while (ipa < ipa_top) {
> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> +		unsigned long out_top;
> +
> +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
> +					 &out_top);
> +
> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> +			/* Create missing RTTs and retry */
> +			int level = RMI_RETURN_INDEX(ret);
> +
> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> +			ret = realm_create_rtt_levels(realm, ipa, level,
> +						      KVM_PGTABLE_LAST_LEVEL,
> +						      memcache);
> +			if (ret)
> +				return ret;
> +
> +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
> +						 range_desc, &out_top);
> +		}
> +
> +		if (WARN_ON(ret))
> +			return ret;
> +
> +		phys += out_top - ipa;
> +		ipa = out_top;
> +	}
> +
> +	return 0;
> +}
> +
>   static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>   			      struct page *src_page, void *opaque)
>   {

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v4 1/3] x86/tdx: Fix off-by-one in port I/O handling
From: Binbin Wu @ 2026-06-05  7:08 UTC (permalink / raw)
  To: Kiryl Shutsemau (Meta)
  Cc: tglx, mingo, bp, dave.hansen, seanjc, pbonzini,
	sathyanarayanan.kuppuswamy, kai.huang, xiaoyao.li,
	rick.p.edgecombe, david.laight.linux, ak, djbw, tsyrulnikov.borys,
	x86, kvm, linux-coco, linux-kernel, stable
In-Reply-To: <e5a75bb68a6a778c95cac2ef77acd55cfd24d389.1780584300.git.kas@kernel.org>



On 6/4/2026 10:46 PM, Kiryl Shutsemau (Meta) wrote:
> handle_in() and handle_out() in arch/x86/coco/tdx/tdx.c use:
> 
>     u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
> 
> GENMASK(h, l) includes bit h. For size=1 (INB), this produces
> GENMASK(8, 0) = 0x1FF (9 bits) instead of GENMASK(7, 0) = 0xFF (8
> bits). The mask is one bit too wide for all I/O sizes.
> 
> Fix the mask calculation.
> 
> Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  arch/x86/coco/tdx/tdx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 186915a17c50..65119362f9a2 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -693,7 +693,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  		.r13 = PORT_READ,
>  		.r14 = port,
>  	};
> -	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
> +	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  	bool success;
>  
>  	/*
> @@ -713,7 +713,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  
>  static bool handle_out(struct pt_regs *regs, int size, int port)
>  {
> -	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
> +	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  
>  	/*
>  	 * Emulate the I/O write via hypercall. More info about ABI can be found


^ permalink raw reply

* Re: [PATCH v4 3/3] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Binbin Wu @ 2026-06-05  7:10 UTC (permalink / raw)
  To: Kiryl Shutsemau (Meta)
  Cc: tglx, mingo, bp, dave.hansen, seanjc, pbonzini,
	sathyanarayanan.kuppuswamy, kai.huang, xiaoyao.li,
	rick.p.edgecombe, david.laight.linux, ak, djbw, tsyrulnikov.borys,
	x86, kvm, linux-coco, linux-kernel, stable
In-Reply-To: <ca503ae3de72d90956fcaf5dbc0760ec20f5a5e0.1780584300.git.kas@kernel.org>



On 6/4/2026 10:47 PM, Kiryl Shutsemau (Meta) wrote:
> According to x86 architecture rules, 32-bit operations zero-extend the
> result to 64 bits. The current implementation of handle_in() only masks
> the lower 32 bits, which preserves the upper 32 bits of RAX when a
> 32-bit port IN instruction is emulated.
> 
> Use insn_assign_reg() to write the result back into RAX with proper
> partial-register-write semantics: 1- and 2-byte forms leave the upper
> bits untouched, the 4-byte form zero-extends to the full register.
> 
> Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Cc: stable@vger.kernel.org

I think the concern sashiko commented in patch 2 is valid.

But for this patch itself,
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>  arch/x86/coco/tdx/tdx.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 65119362f9a2..41cc23cc63dd 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -693,8 +693,8 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  		.r13 = PORT_READ,
>  		.r14 = port,
>  	};
> -	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  	bool success;
> +	u64 val;
>  
>  	/*
>  	 * Emulate the I/O read via hypercall. More info about ABI can be found
> @@ -702,11 +702,9 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  	 * "TDG.VP.VMCALL<Instruction.IO>".
>  	 */
>  	success = !__tdx_hypercall(&args);
> +	val = success ? args.r11 : 0;
>  
> -	/* Update part of the register affected by the emulated instruction */
> -	regs->ax &= ~mask;
> -	if (success)
> -		regs->ax |= args.r11 & mask;
> +	insn_assign_reg(&regs->ax, val, size);
>  
>  	return success;
>  }


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Lorenzo Pieralisi @ 2026-06-05  7:28 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
	Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
	Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
	Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <3359f788-07fa-41a1-9ac7-45c58577c1fa@redhat.com>

On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:

[...]

> > +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
> > +			 kvm_pfn_t pfn, unsigned long map_size,
> > +			 enum kvm_pgtable_prot prot,
> > +			 struct kvm_mmu_memory_cache *memcache)
> > +{
> > +	struct realm *realm = &kvm->arch.realm;
> > +
> > +	/*
> > +	 * Write permission is required for now even though it's possible to
> > +	 * map unprotected pages (granules) as read-only. It's impossible to
> > +	 * map protected pages (granules) as read-only.
> > +	 */
> > +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> > +		return -EFAULT;
> > +
> 
> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
> if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
> working any more.
> 
> > +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
> > +	if (!kvm_realm_is_private_address(realm, ipa))
> > +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
> > +					    memcache);
> > +
> > +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
> > +}
> > +
> >   static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
> >   {
> >   	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
> > @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >   	bool write_fault, exec_fault;
> >   	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
> >   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> > +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> > +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
> >   	unsigned long mmu_seq;
> >   	struct page *page;
> > -	struct kvm *kvm = s2fd->vcpu->kvm;
> > +	struct kvm *kvm = vcpu->kvm;
> >   	void *memcache;
> >   	kvm_pfn_t pfn;
> >   	gfn_t gfn;
> >   	int ret;
> > -	memcache = get_mmu_memcache(s2fd->vcpu);
> > -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> > +	if (kvm_is_realm(vcpu->kvm)) {
> > +		/* check for memory attribute mismatch */
> > +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > +		/*
> > +		 * For Realms, the shared address is an alias of the private
> > +		 * PA with the top bit set. Thus if the fault address matches
> > +		 * the GPA then it is the private alias.
> > +		 */
> > +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> > +
> > +		if (is_priv_gfn != is_priv_fault) {
> > +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> > +						      kvm_is_write_fault(vcpu),
> > +						      false,
> > +						      is_priv_fault);
> > +			/*
> > +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> > +			 * -EFAULT, see the API documentation
> > +			 */
> > +			return -EFAULT;
> > +		}
> > +	}
> > +
> > +	memcache = get_mmu_memcache(vcpu);
> > +	ret = topup_mmu_memcache(vcpu, memcache);
> >   	if (ret)
> >   		return ret;
> >   	if (s2fd->nested)
> >   		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> >   	else
> > -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
> > +		gfn = gpa >> PAGE_SHIFT;
> > -	write_fault = kvm_is_write_fault(s2fd->vcpu);
> > -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
> > +	write_fault = kvm_is_write_fault(vcpu);
> > +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> >   	VM_WARN_ON_ONCE(write_fault && exec_fault);
> > @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >   	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
> >   	if (ret) {
> > -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
> > +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> >   					      write_fault, exec_fault, false);
> >   		return ret;
> >   	}
> > @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> >   	kvm_fault_lock(kvm);
> >   	if (mmu_invalidate_retry(kvm, mmu_seq)) {
> >   		ret = -EAGAIN;
> > -		goto out_unlock;
> > +		goto out_release_page;
> > +	}
> > +
> > +	if (kvm_is_realm(kvm)) {
> > +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
> > +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
> > +		goto out_release_page;
> >   	}
> >   	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> >   						 __pfn_to_phys(pfn), prot,
> >   						 memcache, flags);
> > -out_unlock:
> > +out_release_page:
> >   	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
> >   	kvm_fault_unlock(kvm);
> > @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
> >   	 * mapping size to ensure we find the right PFN and lay down the
> >   	 * mapping in the right place.
> >   	 */
> > -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
> > +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
> >   	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
> > @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
> >   		prot &= ~KVM_NV_GUEST_MAP_SZ;
> >   		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
> >   								 prot, flags);
> > +	} else if (kvm_is_realm(kvm)) {
> > +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
> > +				    prot, memcache);
> >   	} else {
> >   		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
> >   							 __pfn_to_phys(pfn), prot,
> 
> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
> transparent_hugepage_adjust() to be aligned with huge page size. If the
> adjustment happened in transparent_hugepage_adjust(), we need to align
> s2fd->fault_ipa down to the huge page size either.

All of the above + some RMM changes are needed to get QEmu VMM going
with anon pages guest memory backing - currently testing various
configurations in the background.

Thanks,
Lorenzo

> > @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> >   	return 0;
> >   }
> > +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
> > +{
> > +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
> > +
> > +	return (gpa != fault_ipa);
> > +}
> > +
> >   /**
> >    * kvm_handle_guest_abort - handles all 2nd stage aborts
> >    * @vcpu:	the VCPU pointer
> > @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >   		nested = &nested_trans;
> >   	}
> > -	gfn = ipa >> PAGE_SHIFT;
> > +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
> >   	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> > +
> >   	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> >   	write_fault = kvm_is_write_fault(vcpu);
> >   	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
> > @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >   		 * of the page size.
> >   		 */
> >   		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
> > -		ret = io_mem_abort(vcpu, ipa);
> > +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
> >   		goto out_unlock;
> >   	}
> > @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >   				!write_fault &&
> >   				!kvm_vcpu_trap_is_exec_fault(vcpu));
> > -		if (kvm_slot_has_gmem(memslot))
> > +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
> >   			ret = gmem_abort(&s2fd);
> >   		else
> >   			ret = user_mem_abort(&s2fd);
> > @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> >   		return false;
> > +	/* We don't support aging for Realms */
> > +	if (kvm_is_realm(kvm))
> > +		return true;
> > +
> >   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> >   						   range->start << PAGE_SHIFT,
> >   						   size, true);
> > @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >   	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> >   		return false;
> > +	/* We don't support aging for Realms */
> > +	if (kvm_is_realm(kvm))
> > +		return true;
> > +
> >   	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> >   						   range->start << PAGE_SHIFT,
> >   						   size, false);
> > @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> >   		return -EFAULT;
> >   	/*
> > -	 * Only support guest_memfd backed memslots with mappable memory, since
> > -	 * there aren't any CoCo VMs that support only private memory on arm64.
> > +	 * Only support guest_memfd backed memslots with mappable memory,
> > +	 * unless the guest is a CCA realm guest.
> >   	 */
> > -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> > +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
> > +	    !kvm_is_realm(kvm))
> >   		return -EINVAL;
> >   	hva = new->userspace_addr;
> > diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> > index cae29fd3353c..761b38a4071c 100644
> > --- a/arch/arm64/kvm/rmi.c
> > +++ b/arch/arm64/kvm/rmi.c
> > @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
> >   	return ret;
> >   }
> > +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
> > +{
> > +	unsigned long out = 0;
> > +
> > +	switch (size) {
> > +	case P4D_SIZE:
> > +		out = 3 | (1 << 2);
> > +		break;
> > +	case PUD_SIZE:
> > +		out = 2 | (1 << 2);
> > +		break;
> > +	case PMD_SIZE:
> > +		out = 1 | (1 << 2);
> > +		break;
> > +	case PAGE_SIZE:
> > +		out = 0 | (1 << 2);
> > +		break;
> > +	default:
> > +		/*
> > +		 * Only support mapping at the page level granulatity when
> > +		 * it's an unusual length. This should get us back onto a larger
> > +		 * block size for the subsequent mappings.
> > +		 */
> > +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
> > +		break;
> > +	}
> > +
> > +	WARN_ON(phys & ~PAGE_MASK);
> > +
> > +	out |= phys & PAGE_MASK;
> > +
> > +	return out;
> > +}
> > +
> > +int realm_map_protected(struct kvm *kvm,
> > +			unsigned long ipa,
> > +			kvm_pfn_t pfn,
> > +			unsigned long map_size,
> > +			struct kvm_mmu_memory_cache *memcache)
> > +{
> > +	struct realm *realm = &kvm->arch.realm;
> > +	phys_addr_t phys = __pfn_to_phys(pfn);
> > +	phys_addr_t base_phys = phys;
> > +	phys_addr_t rd = virt_to_phys(realm->rd);
> > +	unsigned long base_ipa = ipa;
> > +	unsigned long ipa_top = ipa + map_size;
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
> > +		    !IS_ALIGNED(ipa, map_size)))
> > +		return -EINVAL;
> > +
> > +	if (rmi_delegate_range(phys, map_size)) {
> > +		/*
> > +		 * It's likely we raced with another VCPU on the same
> > +		 * fault. Assume the other VCPU has handled the fault
> > +		 * and return to the guest.
> > +		 */
> > +		return 0;
> > +	}
> > +
> > +	while (ipa < ipa_top) {
> > +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
> > +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > +		unsigned long out_top;
> > +
> > +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
> > +				       &out_top);
> > +
> > +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > +			/* Create missing RTTs and retry */
> > +			int level = RMI_RETURN_INDEX(ret);
> > +
> > +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > +			ret = realm_create_rtt_levels(realm, ipa, level,
> > +						      KVM_PGTABLE_LAST_LEVEL,
> > +						      memcache);
> > +			if (ret)
> > +				goto err_undelegate;
> > +
> > +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> > +					       range_desc, &out_top);
> > +		}
> > +
> > +		if (WARN_ON(ret))
> > +			goto err_undelegate;
> > +
> > +		phys += out_top - ipa;
> > +		ipa = out_top;
> > +	}
> > +
> > +	return 0;
> > +
> > +err_undelegate:
> > +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
> > +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
> > +		/* Page can't be returned to NS world so is lost */
> > +		get_page(phys_to_page(base_phys));
> > +	}
> > +	return -ENXIO;
> > +}
> > +
> > +int realm_map_non_secure(struct realm *realm,
> > +			 unsigned long ipa,
> > +			 kvm_pfn_t pfn,
> > +			 unsigned long size,
> > +			 enum kvm_pgtable_prot prot,
> > +			 struct kvm_mmu_memory_cache *memcache)
> > +{
> > +	unsigned long attr, flags = 0;
> > +	phys_addr_t rd = virt_to_phys(realm->rd);
> > +	phys_addr_t phys = __pfn_to_phys(pfn);
> > +	unsigned long ipa_top = ipa + size;
> > +	int ret;
> > +
> > +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
> > +		    !IS_ALIGNED(ipa, size)))
> > +		return -EINVAL;
> > +
> > +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
> > +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
> > +		return -EINVAL;
> > +	case KVM_PGTABLE_PROT_DEVICE:
> > +		attr = MT_S2_FWB_DEVICE_nGnRE;
> > +		break;
> > +	case KVM_PGTABLE_PROT_NORMAL_NC:
> > +		attr = MT_S2_FWB_NORMAL_NC;
> > +		break;
> > +	default:
> > +		attr = MT_S2_FWB_NORMAL;
> > +	}
> > +
> > +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
> > +
> > +	if (prot & KVM_PGTABLE_PROT_R)
> > +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
> > +	if (prot & KVM_PGTABLE_PROT_W)
> > +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
> > +
> > +	flags |= RMI_ADDR_TYPE_SINGLE;
> > +
> > +	while (ipa < ipa_top) {
> > +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > +		unsigned long out_top;
> > +
> > +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
> > +					 &out_top);
> > +
> > +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > +			/* Create missing RTTs and retry */
> > +			int level = RMI_RETURN_INDEX(ret);
> > +
> > +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > +			ret = realm_create_rtt_levels(realm, ipa, level,
> > +						      KVM_PGTABLE_LAST_LEVEL,
> > +						      memcache);
> > +			if (ret)
> > +				return ret;
> > +
> > +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
> > +						 range_desc, &out_top);
> > +		}
> > +
> > +		if (WARN_ON(ret))
> > +			return ret;
> > +
> > +		phys += out_top - ipa;
> > +		ipa = out_top;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >   static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >   			      struct page *src_page, void *opaque)
> >   {
> 
> Thanks,
> Gavin
> 

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-05  8:11 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
	Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
	Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
	Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <aiJ6u83O0nVUtPyv@lpieralisi>

On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
> 
> [...]
> 
>>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>>> +			 kvm_pfn_t pfn, unsigned long map_size,
>>> +			 enum kvm_pgtable_prot prot,
>>> +			 struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +	struct realm *realm = &kvm->arch.realm;
>>> +
>>> +	/*
>>> +	 * Write permission is required for now even though it's possible to
>>> +	 * map unprotected pages (granules) as read-only. It's impossible to
>>> +	 * map protected pages (granules) as read-only.
>>> +	 */
>>> +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>>> +		return -EFAULT;
>>> +
>>
>> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
>> if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
>> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
>> working any more.
>>
>>> +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>>> +	if (!kvm_realm_is_private_address(realm, ipa))
>>> +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>>> +					    memcache);
>>> +
>>> +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>>> +}
>>> +
>>>    static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>>    {
>>>    	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
>>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>    	bool write_fault, exec_fault;
>>>    	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>>    	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>>> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
>>> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>>> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>>    	unsigned long mmu_seq;
>>>    	struct page *page;
>>> -	struct kvm *kvm = s2fd->vcpu->kvm;
>>> +	struct kvm *kvm = vcpu->kvm;
>>>    	void *memcache;
>>>    	kvm_pfn_t pfn;
>>>    	gfn_t gfn;
>>>    	int ret;
>>> -	memcache = get_mmu_memcache(s2fd->vcpu);
>>> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>>> +	if (kvm_is_realm(vcpu->kvm)) {
>>> +		/* check for memory attribute mismatch */
>>> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>>> +		/*
>>> +		 * For Realms, the shared address is an alias of the private
>>> +		 * PA with the top bit set. Thus if the fault address matches
>>> +		 * the GPA then it is the private alias.
>>> +		 */
>>> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
>>> +
>>> +		if (is_priv_gfn != is_priv_fault) {
>>> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>> +						      kvm_is_write_fault(vcpu),
>>> +						      false,
>>> +						      is_priv_fault);
>>> +			/*
>>> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
>>> +			 * -EFAULT, see the API documentation
>>> +			 */
>>> +			return -EFAULT;
>>> +		}
>>> +	}
>>> +
>>> +	memcache = get_mmu_memcache(vcpu);
>>> +	ret = topup_mmu_memcache(vcpu, memcache);
>>>    	if (ret)
>>>    		return ret;
>>>    	if (s2fd->nested)
>>>    		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>>    	else
>>> -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>>> +		gfn = gpa >> PAGE_SHIFT;
>>> -	write_fault = kvm_is_write_fault(s2fd->vcpu);
>>> -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>>> +	write_fault = kvm_is_write_fault(vcpu);
>>> +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>>    	VM_WARN_ON_ONCE(write_fault && exec_fault);
>>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>    	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>>>    	if (ret) {
>>> -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
>>> +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>    					      write_fault, exec_fault, false);
>>>    		return ret;
>>>    	}
>>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>    	kvm_fault_lock(kvm);
>>>    	if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>>    		ret = -EAGAIN;
>>> -		goto out_unlock;
>>> +		goto out_release_page;
>>> +	}
>>> +
>>> +	if (kvm_is_realm(kvm)) {
>>> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>>> +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
>>> +		goto out_release_page;
>>>    	}
>>>    	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
>>>    						 __pfn_to_phys(pfn), prot,
>>>    						 memcache, flags);
>>> -out_unlock:
>>> +out_release_page:
>>>    	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
>>>    	kvm_fault_unlock(kvm);
>>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
>>>    	 * mapping size to ensure we find the right PFN and lay down the
>>>    	 * mapping in the right place.
>>>    	 */
>>> -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
>>> +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>>    	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
>>>    		prot &= ~KVM_NV_GUEST_MAP_SZ;
>>>    		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
>>>    								 prot, flags);
>>> +	} else if (kvm_is_realm(kvm)) {
>>> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>>> +				    prot, memcache);
>>>    	} else {
>>>    		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
>>>    							 __pfn_to_phys(pfn), prot,
>>
>> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
>> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
>> transparent_hugepage_adjust() to be aligned with huge page size. If the
>> adjustment happened in transparent_hugepage_adjust(), we need to align
>> s2fd->fault_ipa down to the huge page size either.
> 
> All of the above + some RMM changes are needed to get QEmu VMM going
> with anon pages guest memory backing - currently testing various
> configurations in the background.
> 

I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
that memory slots backed by THP are broken. With THP disabled on the host and
other fixes (mentioned in my prevous replies) applied on the top of this (v14)
series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
fxies on the top.

[1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/latest)
[2] https://git.qemu.org/git/qemu.git                (branch: cca/gavin)

Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
If so, I'm not sure if there is a QEMU repository for me to try?

Thanks,
Gavin

> Thanks,
> Lorenzo
> 
>>> @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>>    	return 0;
>>>    }
>>> +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
>>> +{
>>> +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
>>> +
>>> +	return (gpa != fault_ipa);
>>> +}
>>> +
>>>    /**
>>>     * kvm_handle_guest_abort - handles all 2nd stage aborts
>>>     * @vcpu:	the VCPU pointer
>>> @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>    		nested = &nested_trans;
>>>    	}
>>> -	gfn = ipa >> PAGE_SHIFT;
>>> +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
>>>    	memslot = gfn_to_memslot(vcpu->kvm, gfn);
>>> +
>>>    	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>>>    	write_fault = kvm_is_write_fault(vcpu);
>>>    	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
>>> @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>    		 * of the page size.
>>>    		 */
>>>    		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
>>> -		ret = io_mem_abort(vcpu, ipa);
>>> +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
>>>    		goto out_unlock;
>>>    	}
>>> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>>    				!write_fault &&
>>>    				!kvm_vcpu_trap_is_exec_fault(vcpu));
>>> -		if (kvm_slot_has_gmem(memslot))
>>> +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
>>>    			ret = gmem_abort(&s2fd);
>>>    		else
>>>    			ret = user_mem_abort(&s2fd);
>>> @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>>    	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>>>    		return false;
>>> +	/* We don't support aging for Realms */
>>> +	if (kvm_is_realm(kvm))
>>> +		return true;
>>> +
>>>    	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>>>    						   range->start << PAGE_SHIFT,
>>>    						   size, true);
>>> @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>>    	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>>>    		return false;
>>> +	/* We don't support aging for Realms */
>>> +	if (kvm_is_realm(kvm))
>>> +		return true;
>>> +
>>>    	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
>>>    						   range->start << PAGE_SHIFT,
>>>    						   size, false);
>>> @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>    		return -EFAULT;
>>>    	/*
>>> -	 * Only support guest_memfd backed memslots with mappable memory, since
>>> -	 * there aren't any CoCo VMs that support only private memory on arm64.
>>> +	 * Only support guest_memfd backed memslots with mappable memory,
>>> +	 * unless the guest is a CCA realm guest.
>>>    	 */
>>> -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
>>> +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
>>> +	    !kvm_is_realm(kvm))
>>>    		return -EINVAL;
>>>    	hva = new->userspace_addr;
>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index cae29fd3353c..761b38a4071c 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>>    	return ret;
>>>    }
>>> +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
>>> +{
>>> +	unsigned long out = 0;
>>> +
>>> +	switch (size) {
>>> +	case P4D_SIZE:
>>> +		out = 3 | (1 << 2);
>>> +		break;
>>> +	case PUD_SIZE:
>>> +		out = 2 | (1 << 2);
>>> +		break;
>>> +	case PMD_SIZE:
>>> +		out = 1 | (1 << 2);
>>> +		break;
>>> +	case PAGE_SIZE:
>>> +		out = 0 | (1 << 2);
>>> +		break;
>>> +	default:
>>> +		/*
>>> +		 * Only support mapping at the page level granulatity when
>>> +		 * it's an unusual length. This should get us back onto a larger
>>> +		 * block size for the subsequent mappings.
>>> +		 */
>>> +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>>> +		break;
>>> +	}
>>> +
>>> +	WARN_ON(phys & ~PAGE_MASK);
>>> +
>>> +	out |= phys & PAGE_MASK;
>>> +
>>> +	return out;
>>> +}
>>> +
>>> +int realm_map_protected(struct kvm *kvm,
>>> +			unsigned long ipa,
>>> +			kvm_pfn_t pfn,
>>> +			unsigned long map_size,
>>> +			struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +	struct realm *realm = &kvm->arch.realm;
>>> +	phys_addr_t phys = __pfn_to_phys(pfn);
>>> +	phys_addr_t base_phys = phys;
>>> +	phys_addr_t rd = virt_to_phys(realm->rd);
>>> +	unsigned long base_ipa = ipa;
>>> +	unsigned long ipa_top = ipa + map_size;
>>> +	int ret = 0;
>>> +
>>> +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>>> +		    !IS_ALIGNED(ipa, map_size)))
>>> +		return -EINVAL;
>>> +
>>> +	if (rmi_delegate_range(phys, map_size)) {
>>> +		/*
>>> +		 * It's likely we raced with another VCPU on the same
>>> +		 * fault. Assume the other VCPU has handled the fault
>>> +		 * and return to the guest.
>>> +		 */
>>> +		return 0;
>>> +	}
>>> +
>>> +	while (ipa < ipa_top) {
>>> +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>>> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>>> +		unsigned long out_top;
>>> +
>>> +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>>> +				       &out_top);
>>> +
>>> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +			/* Create missing RTTs and retry */
>>> +			int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>> +			ret = realm_create_rtt_levels(realm, ipa, level,
>>> +						      KVM_PGTABLE_LAST_LEVEL,
>>> +						      memcache);
>>> +			if (ret)
>>> +				goto err_undelegate;
>>> +
>>> +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
>>> +					       range_desc, &out_top);
>>> +		}
>>> +
>>> +		if (WARN_ON(ret))
>>> +			goto err_undelegate;
>>> +
>>> +		phys += out_top - ipa;
>>> +		ipa = out_top;
>>> +	}
>>> +
>>> +	return 0;
>>> +
>>> +err_undelegate:
>>> +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
>>> +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
>>> +		/* Page can't be returned to NS world so is lost */
>>> +		get_page(phys_to_page(base_phys));
>>> +	}
>>> +	return -ENXIO;
>>> +}
>>> +
>>> +int realm_map_non_secure(struct realm *realm,
>>> +			 unsigned long ipa,
>>> +			 kvm_pfn_t pfn,
>>> +			 unsigned long size,
>>> +			 enum kvm_pgtable_prot prot,
>>> +			 struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +	unsigned long attr, flags = 0;
>>> +	phys_addr_t rd = virt_to_phys(realm->rd);
>>> +	phys_addr_t phys = __pfn_to_phys(pfn);
>>> +	unsigned long ipa_top = ipa + size;
>>> +	int ret;
>>> +
>>> +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
>>> +		    !IS_ALIGNED(ipa, size)))
>>> +		return -EINVAL;
>>> +
>>> +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
>>> +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
>>> +		return -EINVAL;
>>> +	case KVM_PGTABLE_PROT_DEVICE:
>>> +		attr = MT_S2_FWB_DEVICE_nGnRE;
>>> +		break;
>>> +	case KVM_PGTABLE_PROT_NORMAL_NC:
>>> +		attr = MT_S2_FWB_NORMAL_NC;
>>> +		break;
>>> +	default:
>>> +		attr = MT_S2_FWB_NORMAL;
>>> +	}
>>> +
>>> +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
>>> +
>>> +	if (prot & KVM_PGTABLE_PROT_R)
>>> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
>>> +	if (prot & KVM_PGTABLE_PROT_W)
>>> +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
>>> +
>>> +	flags |= RMI_ADDR_TYPE_SINGLE;
>>> +
>>> +	while (ipa < ipa_top) {
>>> +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>>> +		unsigned long out_top;
>>> +
>>> +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
>>> +					 &out_top);
>>> +
>>> +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +			/* Create missing RTTs and retry */
>>> +			int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>> +			ret = realm_create_rtt_levels(realm, ipa, level,
>>> +						      KVM_PGTABLE_LAST_LEVEL,
>>> +						      memcache);
>>> +			if (ret)
>>> +				return ret;
>>> +
>>> +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
>>> +						 range_desc, &out_top);
>>> +		}
>>> +
>>> +		if (WARN_ON(ret))
>>> +			return ret;
>>> +
>>> +		phys += out_top - ipa;
>>> +		ipa = out_top;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>    static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>>>    			      struct page *src_page, void *opaque)
>>>    {
>>
>> Thanks,
>> Gavin
>>
> 


^ permalink raw reply

* Re: [PATCH 03/15] x86/virt/tdx: Make TDX Module initialize Extensions
From: Tony Lindgren @ 2026-06-05  8:46 UTC (permalink / raw)
  To: Xu Yilun
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-4-yilun.xu@linux.intel.com>

On Fri, May 22, 2026 at 11:41:16AM +0800, Xu Yilun wrote:
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1200,6 +1200,22 @@ static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
>  	       FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, nr_pages - 1);
>  }
>  
> +/* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */
> +static int tdx_ext_init(void)
> +{
> +	struct tdx_module_args args = {};
> +	u64 r;
> +
> +	do {
> +		r = seamcall(TDH_EXT_INIT, &args);
> +	} while (r == TDX_INTERRUPTED_RESUMABLE);
> +
> +	if (r != TDX_SUCCESS)
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
>  {
>  	struct tdx_module_args args = {

How about "Initialize the TDX Module Extensions for Extension-SEAMCALLs"
above for the comment?

Other than that:

Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05  8:54 UTC (permalink / raw)
  To: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgF43RBv77RgM67kXRRHDnQw4L5uwQTuvkJHzkHJWB1mag@mail.gmail.com>

On 04/06/2026 20:05, Ackerley Tng wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> 
>>
>> [...snip...]
>>
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
>> the process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
>>
>> Suzuki
>>
> 
> The following is the guest_memfd perspective, I didn't look at the SNP
> spec:
> 
> Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
> type?
> 
> guest_memfd has no plans to do any special zeroing based on type.
> 
> guest_memfd decoupled zeroing from preparation a while ago (Michael had
> some patches), so zeroing is supposed to be once during folio ownership
> by guest_memfd, tracked by the uptodate flag, and preparation is tracked
> outside of guest_memfd. So far only SNP does preparation.

I am talking about the SEV SNP conversions (specifically quoted in my 
response), I will follow up on Michael's response.

Suzuki


^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-05  9:06 UTC (permalink / raw)
  To: Michael Roth
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <4muegrza5iyyhqx6wevdlssnb6wvlc4m4wmuz5hmd3xikkftc4@3e2lpuq6tjgr>

On 04/06/2026 21:11, Michael Roth wrote:
> On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
>> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
>>> From: Michael Roth <michael.roth@amd.com>
>>>
>>> For vm_memory_attributes=1, in-place conversion/population is not
>>> supported, so the initial contents necessarily must need to come
>>> from a separate src address, which is enforced by the current
>>> implementation. However, for vm_memory_attributes=0, it is possible for
>>> guest memory to be initialized directly from userspace by mmap()'ing the
>>> guest_memfd and writing to it while the corresponding GPA ranges are in
>>> a 'shared' state before converting them to the 'private' state expected
>>> by KVM_SEV_SNP_LAUNCH_UPDATE.
>>>
>>> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>>> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>>> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>>> copy in data from a separate memory location. Continue to enforce
>>> non-NULL for the original vm_memory_attributes=1 case.
>>>
>>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>>> [Added src_page check in error handling path when the firmware command fails]
>>> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>>
>>
>>
>>> ---
>>>    Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
>>>    arch/x86/kvm/svm/sev.c                               | 18 +++++++++++++-----
>>>    virt/kvm/kvm_main.c                                  |  1 +
>>>    3 files changed, 25 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> index b2395dd4769de..43085f65b2d85 100644
>>> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>>> @@ -503,7 +503,8 @@ secrets.
>>>    It is required that the GPA ranges initialized by this command have had the
>>>    KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
>>> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
>>> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
>>> +this aspect.
>>>    Upon success, this command is not guaranteed to have processed the entire
>>>    range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>>>    remaining range that has yet to be processed. The caller should continue
>>>    calling this command until those fields indicate the entire range has been
>>>    processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
>>> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
>>> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
>>> -``uaddr`` will be ignored completely.
>>> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
>>> +userspace-provided source buffer address plus 1.
>>> +
>>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>>> +ignored completely. Otherwise, ``uaddr`` is required if
>>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>>> +in the latter case guest memory can be initialized directly from userspace
>>> +prior to converting it to private and passing the GPA range on to this
>>> +interface.
>>
>> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
>> process of making it "private" ? i.e., the contents of a SNP shared
>> page are preserved while transitioning to "SNP Private" (via RMP
>> update).
> 
> sev_gmem_prepare() does sort of destroy contents since it finalizes the
> shared->private conversion which puts the page in an unusable state
> until the guest 'accepts' it as private memory and re-initializes the
> contents.
> 
> But that's run-time, when the guest is doing conversions. The
> documentation here is relating to initialization time when we are
> setting up the initial pre-encrypted/pre-measured guest memory image,
> via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
> is then sev_gmem_post_populate() callback that actually finalizes the
> shared->private conversion. The sev_gmem_prepare() hook doesn't get used
> in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
> preparation).

Thanks, thats the bit I was missing. Skipping the prepare path, with 
__kvm_gmem_get_pfn(). I was under the assumption that 
kvm_arch_gmem_prepare() was called for all PFNs allocated from gmem
and how SNP was handling this populate case.


Thanks
Suzuki


> 
> -Mike
> 
>>
>> Suzuki
>>
>>

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-05 11:20 UTC (permalink / raw)
  To: Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-30-steven.price@arm.com>

Hi Steve,

On 5/13/26 11:17 PM, Steven Price wrote:
> At runtime if the realm guest accesses memory which hasn't yet been
> mapped then KVM needs to either populate the region or fault the guest.
> 
> For memory in the lower (protected) region of IPA a fresh page is
> provided to the RMM which will zero the contents. For memory in the
> upper (shared) region of IPA, the memory from the memslot is mapped
> into the realm VM non secure.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes since v13:
>   * Numerous changes due to rebasing.
>   * Fix addr_range_desc() to encode the correct block size.
> Changes since v12:
>   * Switch to RMM v2.0 range based APIs.
> Changes since v11:
>   * Adapt to upstream changes.
> Changes since v10:
>   * RME->RMI renaming.
>   * Adapt to upstream gmem changes.
> Changes since v9:
>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>     may_block to avoid stall warnings.
>   * Minor coding style fixes.
> Changes since v8:
>   * Propagate the may_block flag.
>   * Minor comments and coding style changes.
> Changes since v7:
>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>     internally WARN when necessary.
> Changes since v6:
>   * Handle PAGE_SIZE being larger than RMM granule size.
>   * Some minor renaming following review comments.
> Changes since v5:
>   * Reduce use of struct page in preparation for supporting the RMM
>     having a different page size to the host.
>   * Handle a race when delegating a page where another CPU has faulted on
>     a the same page (and already delegated the physical page) but not yet
>     mapped it. In this case simply return to the guest to either use the
>     mapping from the other CPU (or refault if the race is lost).
>   * The changes to populate_par_region() are moved into the previous
>     patch where they belong.
> Changes since v4:
>   * Code cleanup following review feedback.
>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>     This is now set by the RMM and the host has no control of it and the
>     spec requires the bit to be set to zero.
> Changes since v2:
>   * Avoid leaking memory if failing to map it in the realm.
>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>   * Adapt to changes in previous patches.
> ---
>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>   4 files changed, 301 insertions(+), 20 deletions(-)
> 

[...]

> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>   	bool write_fault, exec_fault;
>   	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>   	unsigned long mmu_seq;
>   	struct page *page;
> -	struct kvm *kvm = s2fd->vcpu->kvm;
> +	struct kvm *kvm = vcpu->kvm;
>   	void *memcache;
>   	kvm_pfn_t pfn;
>   	gfn_t gfn;
>   	int ret;
>   
> -	memcache = get_mmu_memcache(s2fd->vcpu);
> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> +	if (kvm_is_realm(vcpu->kvm)) {
> +		/* check for memory attribute mismatch */
> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +		/*
> +		 * For Realms, the shared address is an alias of the private
> +		 * PA with the top bit set. Thus if the fault address matches
> +		 * the GPA then it is the private alias.
> +		 */
> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> +
> +		if (is_priv_gfn != is_priv_fault) {
> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> +						      kvm_is_write_fault(vcpu),
> +						      false,
> +						      is_priv_fault);
> +			/*
> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> +			 * -EFAULT, see the API documentation
> +			 */
> +			return -EFAULT;
> +		}
> +	}
> +

For a Realm, gmem_abort() is called by kvm_handle_guest_abort() only when
we're faulting in the private (protected) space.

     if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
         ret = gmem_abort(&s2fd);
     else
         ret = user_mem_abort(&s2fd);

With the condition, this block of code can be simplied to handle conversion
(shared -> private) instead of both directions.

     /* Convert the shared address to the private adress for Realm */
     if (kvm_is_realm(vcpu->kvm) &&
         !kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT)) {
         /*
          * KVM_EXIT_MEMORY_FAULT requires an return code of
          * -EFAULT, see the API documentation
          */
         kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
                                       kvm_is_write_fault(vcpu),
                                       false, true);
         return -EFAULT;
     }


[...]

> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   				!write_fault &&
>   				!kvm_vcpu_trap_is_exec_fault(vcpu));
>   
> -		if (kvm_slot_has_gmem(memslot))
> +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
>   			ret = gmem_abort(&s2fd);
>   		else
>   			ret = user_mem_abort(&s2fd);
gmem_abort() is only called for faults in the protected (private) space.

Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-05 11:42 UTC (permalink / raw)
  To: Chao Gao
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, Hansen, Dave, Zhao, Yan Y,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <aiJhScChLZkH44eB@intel.com>

On Fri, Jun 05, 2026 at 01:40:25PM +0800, Chao Gao wrote:
> On Thu, Jun 04, 2026 at 05:59:02PM +0100, Kiryl Shutsemau wrote:
> >On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
> >> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
> >> > > -	scoped_guard(spinlock, &pamt_lock) {
> >> > 
> >> > This converts the scoped_guard() added by the previous patch to
> >> > explicit lock/unlock and goto. It would reduce code churn if the
> >> > previous patch used that form directly.
> >> 
> >> Yea, it's a good point. I actually debated doing it, but decided not to because
> >> the scoped version is cleaner for the non-optimized version. But for
> >> reviewability, never doing the scoped version is probably better.
> >
> >I don't see a reason why we can't keep the scoped_guard() on get side.
> 
> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> with goto, which is discouraged. See [*]
> 
>  :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
>  :that the “goto” statement can jump between scopes, the expectation is that
>  :usage of “goto” and cleanup helpers is never mixed in the same function.

Fair enough.

But it can also be address if we free the PAMT page array with the guard
too :P

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v4 3/3] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Kiryl Shutsemau @ 2026-06-05 11:57 UTC (permalink / raw)
  To: dave.hansen, Binbin Wu
  Cc: tglx, mingo, bp, seanjc, pbonzini, sathyanarayanan.kuppuswamy,
	kai.huang, xiaoyao.li, rick.p.edgecombe, david.laight.linux, ak,
	djbw, tsyrulnikov.borys, x86, kvm, linux-coco, linux-kernel,
	stable
In-Reply-To: <22c789c3-13b1-4c39-898f-2eec3bce98c1@linux.intel.com>

On Fri, Jun 05, 2026 at 03:10:39PM +0800, Binbin Wu wrote:
> 
> 
> On 6/4/2026 10:47 PM, Kiryl Shutsemau (Meta) wrote:
> > According to x86 architecture rules, 32-bit operations zero-extend the
> > result to 64 bits. The current implementation of handle_in() only masks
> > the lower 32 bits, which preserves the upper 32 bits of RAX when a
> > 32-bit port IN instruction is emulated.
> > 
> > Use insn_assign_reg() to write the result back into RAX with proper
> > partial-register-write semantics: 1- and 2-byte forms leave the upper
> > bits untouched, the 4-byte form zero-extends to the full register.
> > 
> > Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> > Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> > Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > Cc: stable@vger.kernel.org
> 
> I think the concern sashiko commented in patch 2 is valid.

Yeah. I guess I'll just use the KVM implementation verbatim.

Dave, any objections?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 12:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-2-seanjc@google.com>

On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> frequency.

That's misleading because fixed frequency means that the frequency does
not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
from CPUID/MSRs.

> In practice, this is likely one big nop, as re-calibration is
> used only for SMP=n kernels, and only for hardware that is 20+ years old,
> i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.

recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
frequency drivers, which means that's absolutely not interesting and
neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
set on those systems.

IOW, this patch is pointless voodoo ware.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Thomas Gleixner @ 2026-06-05 12:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>

On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>  		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> -		crystal_khz = eax_base_mhz * 1000 *
> -			eax_denominator / ebx_numerator;
> +		info.crystal_khz = eax_base_mhz * 1000 *
> +			info.denominator / info.numerator;

Please get rid of this ugly line break. You have 100 characters.


^ permalink raw reply

* [POC] KVM: selftests: Verify conversion works with TDX
From: Ackerley Tng @ 2026-06-05 13:41 UTC (permalink / raw)
  To: devnull+ackerleytng.google.com
  Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
	baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
	dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
	jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
	linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
	mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
	pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
	shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
	tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
	yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com>

This POC shows that conversions works with TDX:

1. Find 2 pages in GVA space, map those twice, once as private and once as
   shared. This avoids having to manipulate page tables in the guest.
2. Use memory as private memory in the guest.
3. Request to convert memory to shared.
4. Write shared memory in the guest, check in the host.
5. Write shared memory in the host, check in the guest.
6. Request to convert memory to private.
7. Use memory as private memory in the guest.

I based this on Lisa's series at [1].

[1] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-0-6983ae4c3a4d@google.com/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/x86/tdx_vm_test.c | 154 ++++++++++++++++++
 1 file changed, 154 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/tdx_vm_test.c b/tools/testing/selftests/kvm/x86/tdx_vm_test.c
index 7cdcaf33b585b..093921af7d93e 100644
--- a/tools/testing/selftests/kvm/x86/tdx_vm_test.c
+++ b/tools/testing/selftests/kvm/x86/tdx_vm_test.c
@@ -26,6 +26,160 @@ TEST(verify_td_lifecycle)
 	kvm_vm_free(vm);
 }

+static gva_t conversions_private_gva;
+static gpa_t conversions_private_gpa;
+static gva_t conversions_shared_gva;
+static gpa_t conversions_shared_gpa;
+static size_t conversions_size;
+
+u64 tdx_map_gpa(u64 gpa, u64 size)
+{
+#define TDG_VP_VMCALL 0
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDVMCALL_EXPOSE_REGS_MASK 0xFC00
+	register u64 r10_reg asm("r10") = TDG_VP_VMCALL;
+	register u64 r11_reg asm("r11") = TDG_VP_VMCALL_MAP_GPA;
+	register u64 r12_reg asm("r12") = gpa;
+	register u64 r13_reg asm("r13") = size;
+	register u64 rax_reg asm("rax") = TDG_VP_VMCALL;
+	register u64 rcx_reg asm("rcx") = TDVMCALL_EXPOSE_REGS_MASK;
+
+	asm volatile(
+	 ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
+	 : "+r" (r10_reg), "+r" (r11_reg)
+	 : "r" (r12_reg), "r" (r13_reg), "r" (rax_reg), "r" (rcx_reg)
+	 : "cc", "memory"
+	);
+
+	return r10_reg;
+}
+
+enum accept_page_level {
+	PAGE_LEVEL_4K = 0,
+	PAGE_LEVEL_2M,
+};
+
+u64 tdx_accept_page(u64 gpa, enum accept_page_level level)
+{
+#define TDG_MEM_PAGE_ACCEPT 6
+	register u64 rax_reg asm("rax") = TDG_MEM_PAGE_ACCEPT;
+	register u64 rcx_reg asm("rcx") = gpa | level;
+
+	asm volatile(
+	 ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
+	 : "+r" (rax_reg)
+	 : "r" (rcx_reg)
+	 : "cc", "memory"
+	);
+
+	return rax_reg;
+}
+
+static void handle_hypercall_map_gpa(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	u64 attributes;
+	size_t size;
+	gpa_t gpa;
+
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+	TEST_ASSERT_EQ(run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+	TEST_ASSERT_EQ(run->hypercall.flags, KVM_EXIT_HYPERCALL_LONG_MODE);
+
+	gpa = run->hypercall.args[0];
+	size = run->hypercall.args[1] * PAGE_SIZE;
+	attributes = 0;
+	if (run->hypercall.args[2] & KVM_MAP_GPA_RANGE_ENCRYPTED)
+		attributes = KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	vm_mem_set_memory_attributes(vcpu->vm, gpa, size, attributes);
+}
+
+#define CONVERSIONS_PRIVATE_VAL 'A'
+#define CONVERSIONS_GUEST_SHARED_VAL 'B'
+#define CONVERSIONS_HOST_SHARED_VAL 'C'
+#define CONVERSIONS_STAGE_WROTE_SHARED 0x99
+
+static void guest_code_conversions(void)
+{
+	char *addr;
+
+	addr = (void *)conversions_private_gva;
+	WRITE_ONCE(*addr, CONVERSIONS_PRIVATE_VAL);
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_PRIVATE_VAL);
+
+	GUEST_ASSERT_EQ(tdx_map_gpa(conversions_shared_gpa, conversions_size), 0);
+
+	addr = (void *)conversions_shared_gva;
+	WRITE_ONCE(*addr, CONVERSIONS_GUEST_SHARED_VAL);
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_GUEST_SHARED_VAL);
+
+	GUEST_SYNC(CONVERSIONS_STAGE_WROTE_SHARED);
+
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_HOST_SHARED_VAL);
+
+	GUEST_ASSERT_EQ(tdx_map_gpa(conversions_private_gpa, conversions_size), 0);
+	GUEST_ASSERT_EQ(tdx_accept_page(conversions_private_gpa, PAGE_LEVEL_4K), 0);
+
+	addr = (void *)conversions_private_gva;
+	WRITE_ONCE(*addr, CONVERSIONS_PRIVATE_VAL);
+	GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_PRIVATE_VAL);
+
+	GUEST_DONE();
+}
+
+TEST(verify_conversions)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct ucall uc;
+	char *test_hva;
+
+	vm = __vm_create(VM_SHAPE_TDX, 1, 0);
+	vcpu = vm_vcpu_add(vm, 0, guest_code_conversions);
+
+	conversions_size = getpagesize();
+
+	conversions_private_gva = vm_alloc_page(vm);
+	conversions_shared_gva = vm_alloc_shared(vm, conversions_size,
+						 KVM_UTIL_MIN_VADDR,
+						 MEM_REGION_TEST_DATA);
+	conversions_private_gpa = addr_gva2gpa(vm, conversions_private_gva);
+	conversions_shared_gpa = conversions_private_gpa | BIT_ULL(vm->pa_bits - 1);
+
+	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
+
+	sync_global_to_guest(vm, conversions_size);
+	sync_global_to_guest(vm, conversions_private_gva);
+	sync_global_to_guest(vm, conversions_private_gpa);
+	sync_global_to_guest(vm, conversions_shared_gva);
+	sync_global_to_guest(vm, conversions_shared_gpa);
+
+	kvm_arch_vm_finalize_vcpus(vm);
+
+	test_hva = addr_gva2hva(vm, conversions_shared_gva);
+
+	vcpu_run(vcpu);
+	handle_hypercall_map_gpa(vcpu);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+	TEST_ASSERT_EQ(uc.args[1], CONVERSIONS_STAGE_WROTE_SHARED);
+
+	TEST_ASSERT_EQ(READ_ONCE(*test_hva), CONVERSIONS_GUEST_SHARED_VAL);
+
+	WRITE_ONCE(*test_hva, CONVERSIONS_HOST_SHARED_VAL);
+	TEST_ASSERT_EQ(READ_ONCE(*test_hva), CONVERSIONS_HOST_SHARED_VAL);
+
+	vcpu_run(vcpu);
+	handle_hypercall_map_gpa(vcpu);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_DONE);
+
+	kvm_vm_free(vm);
+}
+
 int main(int argc, char **argv)
 {
 	TEST_REQUIRE(is_tdx_supported());
--
2.54.0.1032.g2f8565e1d1-goog

^ permalink raw reply related

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-05 13:58 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-19-6983ae4c3a4d@google.com>

Lisa Wang <wyihan@google.com> writes:

> From: Sagi Shahar <sagis@google.com>
>
> Finalize TDX VM after creation to make it runnable.
>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>  tools/testing/selftests/kvm/lib/x86/processor.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> index d84c629a1945..842cac168e99 100644
> --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> @@ -1479,6 +1479,12 @@ bool kvm_arch_has_default_irqchip(void)
>  	return true;
>  }
>
> +void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
> +{
> +	if (is_tdx_vm(vm))
> +		tdx_vm_finalize(vm);
> +}
> +

This doesn't necessarily block this series, we could (re)move this
later: I'm not sure if kvm_arch_vm_finalize_vcpus() is the correct place
to be finalizing the VM.

Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
instead?

The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
__vm_create_with_vcpus().

While building this POC to test conversions [1] I only wanted to create
the vm and vcpus and didn't want to finalize yet, since I still needed
to do more mappings in the guest (and I needed the vm pointer to do
mappings in the guest).

Would calling tdx_vm_finalize() from within vcpu_run(), just once, be
too magical?

It's also possible to have some kvm_vm_finalize() call that can be
explicitly and manually invoked from selftests just for CoCo selftests.

[1] https://lore.kernel.org/all/20260605134153.204152-1-ackerleytng@google.com/

>  void setup_smram(struct kvm_vm *vm, struct kvm_vcpu *vcpu, u64 smram_gpa,
>  		 const void *smi_handler, size_t handler_size)
>  {
>
> --
> 2.54.0.746.g67dd491aae-goog

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Lorenzo Pieralisi @ 2026-06-05 14:35 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
	Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
	Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
	Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <ecef952b-e8c6-4102-933b-c99c46f14431@redhat.com>

On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
> > On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
> > 
> > [...]
> > 
> > > > +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
> > > > +			 kvm_pfn_t pfn, unsigned long map_size,
> > > > +			 enum kvm_pgtable_prot prot,
> > > > +			 struct kvm_mmu_memory_cache *memcache)
> > > > +{
> > > > +	struct realm *realm = &kvm->arch.realm;
> > > > +
> > > > +	/*
> > > > +	 * Write permission is required for now even though it's possible to
> > > > +	 * map unprotected pages (granules) as read-only. It's impossible to
> > > > +	 * map protected pages (granules) as read-only.
> > > > +	 */
> > > > +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> > > > +		return -EFAULT;
> > > > +
> > > 
> > > I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
> > > if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
> > > (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
> > > working any more.
> > > 
> > > > +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
> > > > +	if (!kvm_realm_is_private_address(realm, ipa))
> > > > +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
> > > > +					    memcache);
> > > > +
> > > > +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
> > > > +}
> > > > +
> > > >    static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
> > > >    {
> > > >    	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
> > > > @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > > >    	bool write_fault, exec_fault;
> > > >    	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
> > > >    	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > > -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> > > > +	struct kvm_vcpu *vcpu = s2fd->vcpu;
> > > > +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > > +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
> > > >    	unsigned long mmu_seq;
> > > >    	struct page *page;
> > > > -	struct kvm *kvm = s2fd->vcpu->kvm;
> > > > +	struct kvm *kvm = vcpu->kvm;
> > > >    	void *memcache;
> > > >    	kvm_pfn_t pfn;
> > > >    	gfn_t gfn;
> > > >    	int ret;
> > > > -	memcache = get_mmu_memcache(s2fd->vcpu);
> > > > -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> > > > +	if (kvm_is_realm(vcpu->kvm)) {
> > > > +		/* check for memory attribute mismatch */
> > > > +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > > > +		/*
> > > > +		 * For Realms, the shared address is an alias of the private
> > > > +		 * PA with the top bit set. Thus if the fault address matches
> > > > +		 * the GPA then it is the private alias.
> > > > +		 */
> > > > +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
> > > > +
> > > > +		if (is_priv_gfn != is_priv_fault) {
> > > > +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> > > > +						      kvm_is_write_fault(vcpu),
> > > > +						      false,
> > > > +						      is_priv_fault);
> > > > +			/*
> > > > +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
> > > > +			 * -EFAULT, see the API documentation
> > > > +			 */
> > > > +			return -EFAULT;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	memcache = get_mmu_memcache(vcpu);
> > > > +	ret = topup_mmu_memcache(vcpu, memcache);
> > > >    	if (ret)
> > > >    		return ret;
> > > >    	if (s2fd->nested)
> > > >    		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> > > >    	else
> > > > -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
> > > > +		gfn = gpa >> PAGE_SHIFT;
> > > > -	write_fault = kvm_is_write_fault(s2fd->vcpu);
> > > > -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
> > > > +	write_fault = kvm_is_write_fault(vcpu);
> > > > +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > >    	VM_WARN_ON_ONCE(write_fault && exec_fault);
> > > > @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > > >    	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
> > > >    	if (ret) {
> > > > -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
> > > > +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> > > >    					      write_fault, exec_fault, false);
> > > >    		return ret;
> > > >    	}
> > > > @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > > >    	kvm_fault_lock(kvm);
> > > >    	if (mmu_invalidate_retry(kvm, mmu_seq)) {
> > > >    		ret = -EAGAIN;
> > > > -		goto out_unlock;
> > > > +		goto out_release_page;
> > > > +	}
> > > > +
> > > > +	if (kvm_is_realm(kvm)) {
> > > > +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
> > > > +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
> > > > +		goto out_release_page;
> > > >    	}
> > > >    	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> > > >    						 __pfn_to_phys(pfn), prot,
> > > >    						 memcache, flags);
> > > > -out_unlock:
> > > > +out_release_page:
> > > >    	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
> > > >    	kvm_fault_unlock(kvm);
> > > > @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
> > > >    	 * mapping size to ensure we find the right PFN and lay down the
> > > >    	 * mapping in the right place.
> > > >    	 */
> > > > -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
> > > > +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
> > > >    	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
> > > > @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
> > > >    		prot &= ~KVM_NV_GUEST_MAP_SZ;
> > > >    		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
> > > >    								 prot, flags);
> > > > +	} else if (kvm_is_realm(kvm)) {
> > > > +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
> > > > +				    prot, memcache);
> > > >    	} else {
> > > >    		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
> > > >    							 __pfn_to_phys(pfn), prot,
> > > 
> > > For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
> > > huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
> > > transparent_hugepage_adjust() to be aligned with huge page size. If the
> > > adjustment happened in transparent_hugepage_adjust(), we need to align
> > > s2fd->fault_ipa down to the huge page size either.
> > 
> > All of the above + some RMM changes are needed to get QEmu VMM going
> > with anon pages guest memory backing - currently testing various
> > configurations in the background.
> > 
> 
> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
> that memory slots backed by THP are broken. With THP disabled on the host and
> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
> fxies on the top.
> 
> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/latest)
> [2] https://git.qemu.org/git/qemu.git                (branch: cca/gavin)
> 
> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?

Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
issues you pointed out above.

> If so, I'm not sure if there is a QEMU repository for me to try?

We should be able to submit patches by end of June - we shall let you know
whether we can make something available earlier.

Thanks,
Lorenzo

> 
> Thanks,
> Gavin
> 
> > Thanks,
> > Lorenzo
> > 
> > > > @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > >    	return 0;
> > > >    }
> > > > +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
> > > > +{
> > > > +	gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
> > > > +
> > > > +	return (gpa != fault_ipa);
> > > > +}
> > > > +
> > > >    /**
> > > >     * kvm_handle_guest_abort - handles all 2nd stage aborts
> > > >     * @vcpu:	the VCPU pointer
> > > > @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > >    		nested = &nested_trans;
> > > >    	}
> > > > -	gfn = ipa >> PAGE_SHIFT;
> > > > +	gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
> > > >    	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> > > > +
> > > >    	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> > > >    	write_fault = kvm_is_write_fault(vcpu);
> > > >    	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
> > > > @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > >    		 * of the page size.
> > > >    		 */
> > > >    		ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
> > > > -		ret = io_mem_abort(vcpu, ipa);
> > > > +		ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
> > > >    		goto out_unlock;
> > > >    	}
> > > > @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > >    				!write_fault &&
> > > >    				!kvm_vcpu_trap_is_exec_fault(vcpu));
> > > > -		if (kvm_slot_has_gmem(memslot))
> > > > +		if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
> > > >    			ret = gmem_abort(&s2fd);
> > > >    		else
> > > >    			ret = user_mem_abort(&s2fd);
> > > > @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > >    	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> > > >    		return false;
> > > > +	/* We don't support aging for Realms */
> > > > +	if (kvm_is_realm(kvm))
> > > > +		return true;
> > > > +
> > > >    	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> > > >    						   range->start << PAGE_SHIFT,
> > > >    						   size, true);
> > > > @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > >    	if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> > > >    		return false;
> > > > +	/* We don't support aging for Realms */
> > > > +	if (kvm_is_realm(kvm))
> > > > +		return true;
> > > > +
> > > >    	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> > > >    						   range->start << PAGE_SHIFT,
> > > >    						   size, false);
> > > > @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > >    		return -EFAULT;
> > > >    	/*
> > > > -	 * Only support guest_memfd backed memslots with mappable memory, since
> > > > -	 * there aren't any CoCo VMs that support only private memory on arm64.
> > > > +	 * Only support guest_memfd backed memslots with mappable memory,
> > > > +	 * unless the guest is a CCA realm guest.
> > > >    	 */
> > > > -	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> > > > +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
> > > > +	    !kvm_is_realm(kvm))
> > > >    		return -EINVAL;
> > > >    	hva = new->userspace_addr;
> > > > diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> > > > index cae29fd3353c..761b38a4071c 100644
> > > > --- a/arch/arm64/kvm/rmi.c
> > > > +++ b/arch/arm64/kvm/rmi.c
> > > > @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
> > > >    	return ret;
> > > >    }
> > > > +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
> > > > +{
> > > > +	unsigned long out = 0;
> > > > +
> > > > +	switch (size) {
> > > > +	case P4D_SIZE:
> > > > +		out = 3 | (1 << 2);
> > > > +		break;
> > > > +	case PUD_SIZE:
> > > > +		out = 2 | (1 << 2);
> > > > +		break;
> > > > +	case PMD_SIZE:
> > > > +		out = 1 | (1 << 2);
> > > > +		break;
> > > > +	case PAGE_SIZE:
> > > > +		out = 0 | (1 << 2);
> > > > +		break;
> > > > +	default:
> > > > +		/*
> > > > +		 * Only support mapping at the page level granulatity when
> > > > +		 * it's an unusual length. This should get us back onto a larger
> > > > +		 * block size for the subsequent mappings.
> > > > +		 */
> > > > +		out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	WARN_ON(phys & ~PAGE_MASK);
> > > > +
> > > > +	out |= phys & PAGE_MASK;
> > > > +
> > > > +	return out;
> > > > +}
> > > > +
> > > > +int realm_map_protected(struct kvm *kvm,
> > > > +			unsigned long ipa,
> > > > +			kvm_pfn_t pfn,
> > > > +			unsigned long map_size,
> > > > +			struct kvm_mmu_memory_cache *memcache)
> > > > +{
> > > > +	struct realm *realm = &kvm->arch.realm;
> > > > +	phys_addr_t phys = __pfn_to_phys(pfn);
> > > > +	phys_addr_t base_phys = phys;
> > > > +	phys_addr_t rd = virt_to_phys(realm->rd);
> > > > +	unsigned long base_ipa = ipa;
> > > > +	unsigned long ipa_top = ipa + map_size;
> > > > +	int ret = 0;
> > > > +
> > > > +	if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
> > > > +		    !IS_ALIGNED(ipa, map_size)))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (rmi_delegate_range(phys, map_size)) {
> > > > +		/*
> > > > +		 * It's likely we raced with another VCPU on the same
> > > > +		 * fault. Assume the other VCPU has handled the fault
> > > > +		 * and return to the guest.
> > > > +		 */
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	while (ipa < ipa_top) {
> > > > +		unsigned long flags = RMI_ADDR_TYPE_SINGLE;
> > > > +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > > > +		unsigned long out_top;
> > > > +
> > > > +		ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
> > > > +				       &out_top);
> > > > +
> > > > +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > > > +			/* Create missing RTTs and retry */
> > > > +			int level = RMI_RETURN_INDEX(ret);
> > > > +
> > > > +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > > > +			ret = realm_create_rtt_levels(realm, ipa, level,
> > > > +						      KVM_PGTABLE_LAST_LEVEL,
> > > > +						      memcache);
> > > > +			if (ret)
> > > > +				goto err_undelegate;
> > > > +
> > > > +			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> > > > +					       range_desc, &out_top);
> > > > +		}
> > > > +
> > > > +		if (WARN_ON(ret))
> > > > +			goto err_undelegate;
> > > > +
> > > > +		phys += out_top - ipa;
> > > > +		ipa = out_top;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_undelegate:
> > > > +	realm_unmap_private_range(kvm, base_ipa, ipa, true);
> > > > +	if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
> > > > +		/* Page can't be returned to NS world so is lost */
> > > > +		get_page(phys_to_page(base_phys));
> > > > +	}
> > > > +	return -ENXIO;
> > > > +}
> > > > +
> > > > +int realm_map_non_secure(struct realm *realm,
> > > > +			 unsigned long ipa,
> > > > +			 kvm_pfn_t pfn,
> > > > +			 unsigned long size,
> > > > +			 enum kvm_pgtable_prot prot,
> > > > +			 struct kvm_mmu_memory_cache *memcache)
> > > > +{
> > > > +	unsigned long attr, flags = 0;
> > > > +	phys_addr_t rd = virt_to_phys(realm->rd);
> > > > +	phys_addr_t phys = __pfn_to_phys(pfn);
> > > > +	unsigned long ipa_top = ipa + size;
> > > > +	int ret;
> > > > +
> > > > +	if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
> > > > +		    !IS_ALIGNED(ipa, size)))
> > > > +		return -EINVAL;
> > > > +
> > > > +	switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
> > > > +	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
> > > > +		return -EINVAL;
> > > > +	case KVM_PGTABLE_PROT_DEVICE:
> > > > +		attr = MT_S2_FWB_DEVICE_nGnRE;
> > > > +		break;
> > > > +	case KVM_PGTABLE_PROT_NORMAL_NC:
> > > > +		attr = MT_S2_FWB_NORMAL_NC;
> > > > +		break;
> > > > +	default:
> > > > +		attr = MT_S2_FWB_NORMAL;
> > > > +	}
> > > > +
> > > > +	flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
> > > > +
> > > > +	if (prot & KVM_PGTABLE_PROT_R)
> > > > +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
> > > > +	if (prot & KVM_PGTABLE_PROT_W)
> > > > +		flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
> > > > +
> > > > +	flags |= RMI_ADDR_TYPE_SINGLE;
> > > > +
> > > > +	while (ipa < ipa_top) {
> > > > +		unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > > > +		unsigned long out_top;
> > > > +
> > > > +		ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
> > > > +					 &out_top);
> > > > +
> > > > +		if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > > > +			/* Create missing RTTs and retry */
> > > > +			int level = RMI_RETURN_INDEX(ret);
> > > > +
> > > > +			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > > > +			ret = realm_create_rtt_levels(realm, ipa, level,
> > > > +						      KVM_PGTABLE_LAST_LEVEL,
> > > > +						      memcache);
> > > > +			if (ret)
> > > > +				return ret;
> > > > +
> > > > +			ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
> > > > +						 range_desc, &out_top);
> > > > +		}
> > > > +
> > > > +		if (WARN_ON(ret))
> > > > +			return ret;
> > > > +
> > > > +		phys += out_top - ipa;
> > > > +		ipa = out_top;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >    static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > >    			      struct page *src_page, void *opaque)
> > > >    {
> > > 
> > > Thanks,
> > > Gavin
> > > 
> > 
> 

^ permalink raw reply

* Re: [PATCH v14 17/44] arm64: RMI: RTT tear down
From: Steven Price @ 2026-06-05 15:01 UTC (permalink / raw)
  To: Wei-Lin Chang, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <7egwow26r6sbbtm53mujbhpwyts2utzv2ddth7554kqwbk7k7d@iptjpvvbsc2n>

On 26/05/2026 23:27, Wei-Lin Chang wrote:
> Hi,
> 
> On Wed, May 13, 2026 at 02:17:25PM +0100, Steven Price wrote:
>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>> that the RMM creates/destroys entries as necessary. The physical pages
>> to store the page tables are delegated to the realm as required, and can
>> be undelegated when no longer used.
>>
>> Creating new RTTs is the easy part, tearing down is a little more
>> tricky. The result of realm_rtt_destroy() can be used to effectively
>> walk the tree and destroy the entries (undelegating pages that were
>> given to the realm).
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * Avoid the double call of kvm_free_stage2_pgd() by splitting the work
>>    across that and a new function kvm_realm_uninit_stage2() which is
>>    only called for realm guests.
>> Changes since v12:
>>  * Simplify some functions now we know RMM page size is the same as the
>>    host's.
>> Changes since v11:
>>  * Moved some code from earlier in the series to this one so that it's
>>    added when it's first used.
>> Changes since v10:
>>  * RME->RMI rename.
>>  * Some code to handle freeing stage 2 PGD moved into this patch where
>>    it belongs.
>> Changes since v9:
>>  * Add a comment clarifying that root level RTTs are not destroyed until
>>    after the RD is destroyed.
>> Changes since v8:
>>  * Introduce free_rtt() wrapper which calls free_delegated_granule()
>>    followed by kvm_account_pgtable_pages(). This makes it clear where an
>>    RTT is being freed rather than just a delegated granule.
>> Changes since v6:
>>  * Move rme_rtt_level_mapsize() and supporting defines from kvm_rme.h
>>    into rme.c as they are only used in that file.
>> Changes since v5:
>>  * Rename some RME_xxx defines to do with page sizes as RMM_xxx - they are
>>    a property of the RMM specification not the RME architecture.
>> Changes since v2:
>>  * Moved {alloc,free}_delegated_page() and ensure_spare_page() to a
>>    later patch when they are actually used.
>>  * Some simplifications now rmi_xxx() functions allow NULL as an output
>>    parameter.
>>  * Improved comments and code layout.
>> ---
>>  arch/arm64/include/asm/kvm_rmi.h |   7 ++
>>  arch/arm64/kvm/mmu.c             |  21 ++++-
>>  arch/arm64/kvm/rmi.c             | 148 +++++++++++++++++++++++++++++++
>>  3 files changed, 174 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 9de34983ee52..06ba0d4745c6 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -64,5 +64,12 @@ u32 kvm_realm_ipa_limit(void);
>>  
>>  int kvm_init_realm(struct kvm *kvm);
>>  void kvm_destroy_realm(struct kvm *kvm);
>> +void kvm_realm_destroy_rtts(struct kvm *kvm);
>> +
>> +static inline bool kvm_realm_is_private_address(struct realm *realm,
>> +						unsigned long addr)
>> +{
>> +	return !(addr & BIT(realm->ia_bits - 1));
>> +}
>>  
>>  #endif /* __ASM_KVM_RMI_H */
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ba8286472286..eb56d4e7f21a 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1024,9 +1024,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>>  	return err;
>>  }
>>  
>> +static void kvm_realm_uninit_stage2(struct kvm_s2_mmu *mmu)
>> +{
>> +	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> +	struct realm *realm = &kvm->arch.realm;
>> +
>> +	if (kvm_realm_state(kvm) != REALM_STATE_ACTIVE)
>> +		return;
>> +
>> +	write_lock(&kvm->mmu_lock);
>> +	kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true);
>> +	write_unlock(&kvm->mmu_lock);
>> +	kvm_realm_destroy_rtts(kvm);
>> +}
>> +
>>  void kvm_uninit_stage2_mmu(struct kvm *kvm)
>>  {
>> -	kvm_free_stage2_pgd(&kvm->arch.mmu);
>> +	if (kvm_is_realm(kvm))
>> +		kvm_realm_uninit_stage2(&kvm->arch.mmu);
>> +	else
>> +		kvm_free_stage2_pgd(&kvm->arch.mmu);
>>  	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>>  }
>>  
>> @@ -1103,7 +1120,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>>  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>  {
>>  	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> -	struct kvm_pgtable *pgt = NULL;
>> +	struct kvm_pgtable *pgt;
> 
> Is this included by accident?

Thanks for spotting that. Yes that change shouldn't have sneaked in
here. The original code before this series had the redundant assignment
to NULL. But it's unrelated to this patch so I'll drop the change.

Thanks,
Steve

> 
>>  
>>  	write_lock(&kvm->mmu_lock);
>>  	pgt = mmu->pgt;
> 
> [...]
> 
> Thanks,
> Wei-Lin Chang


^ permalink raw reply

* Re: [PATCH v14 17/44] arm64: RMI: RTT tear down
From: Steven Price @ 2026-06-05 15:01 UTC (permalink / raw)
  To: Wei-Lin Chang, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <mpesc2j3czpunbg3pvgwbotvfn7vahaabvoiu77vd2g5uervho@255lwycekmxh>

On 26/05/2026 23:32, Wei-Lin Chang wrote:
> Hi,
> 
> On Wed, May 13, 2026 at 02:17:25PM +0100, Steven Price wrote:
>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>> that the RMM creates/destroys entries as necessary. The physical pages
>> to store the page tables are delegated to the realm as required, and can
>> be undelegated when no longer used.
>>
>> Creating new RTTs is the easy part, tearing down is a little more
>> tricky. The result of realm_rtt_destroy() can be used to effectively
>> walk the tree and destroy the entries (undelegating pages that were
>> given to the realm).
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * Avoid the double call of kvm_free_stage2_pgd() by splitting the work
>>    across that and a new function kvm_realm_uninit_stage2() which is
>>    only called for realm guests.
>> Changes since v12:
>>  * Simplify some functions now we know RMM page size is the same as the
>>    host's.
>> Changes since v11:
>>  * Moved some code from earlier in the series to this one so that it's
>>    added when it's first used.
>> Changes since v10:
>>  * RME->RMI rename.
>>  * Some code to handle freeing stage 2 PGD moved into this patch where
>>    it belongs.
>> Changes since v9:
>>  * Add a comment clarifying that root level RTTs are not destroyed until
>>    after the RD is destroyed.
>> Changes since v8:
>>  * Introduce free_rtt() wrapper which calls free_delegated_granule()
>>    followed by kvm_account_pgtable_pages(). This makes it clear where an
>>    RTT is being freed rather than just a delegated granule.
>> Changes since v6:
>>  * Move rme_rtt_level_mapsize() and supporting defines from kvm_rme.h
>>    into rme.c as they are only used in that file.
>> Changes since v5:
>>  * Rename some RME_xxx defines to do with page sizes as RMM_xxx - they are
>>    a property of the RMM specification not the RME architecture.
>> Changes since v2:
>>  * Moved {alloc,free}_delegated_page() and ensure_spare_page() to a
>>    later patch when they are actually used.
>>  * Some simplifications now rmi_xxx() functions allow NULL as an output
>>    parameter.
>>  * Improved comments and code layout.
>> ---
>>  arch/arm64/include/asm/kvm_rmi.h |   7 ++
>>  arch/arm64/kvm/mmu.c             |  21 ++++-
>>  arch/arm64/kvm/rmi.c             | 148 +++++++++++++++++++++++++++++++
>>  3 files changed, 174 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 9de34983ee52..06ba0d4745c6 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -64,5 +64,12 @@ u32 kvm_realm_ipa_limit(void);
>>  
>>  int kvm_init_realm(struct kvm *kvm);
>>  void kvm_destroy_realm(struct kvm *kvm);
>> +void kvm_realm_destroy_rtts(struct kvm *kvm);
>> +
>> +static inline bool kvm_realm_is_private_address(struct realm *realm,
>> +						unsigned long addr)
>> +{
>> +	return !(addr & BIT(realm->ia_bits - 1));
>> +}
>>  
>>  #endif /* __ASM_KVM_RMI_H */
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ba8286472286..eb56d4e7f21a 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1024,9 +1024,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>>  	return err;
>>  }
>>  
>> +static void kvm_realm_uninit_stage2(struct kvm_s2_mmu *mmu)
>> +{
>> +	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> +	struct realm *realm = &kvm->arch.realm;
>> +
>> +	if (kvm_realm_state(kvm) != REALM_STATE_ACTIVE)
>> +		return;
>> +
>> +	write_lock(&kvm->mmu_lock);
>> +	kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true);
>> +	write_unlock(&kvm->mmu_lock);
>> +	kvm_realm_destroy_rtts(kvm);
>> +}
>> +
>>  void kvm_uninit_stage2_mmu(struct kvm *kvm)
>>  {
>> -	kvm_free_stage2_pgd(&kvm->arch.mmu);
>> +	if (kvm_is_realm(kvm))
>> +		kvm_realm_uninit_stage2(&kvm->arch.mmu);
>> +	else
>> +		kvm_free_stage2_pgd(&kvm->arch.mmu);
>>  	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>>  }
>>  
>> @@ -1103,7 +1120,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>>  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>  {
>>  	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> -	struct kvm_pgtable *pgt = NULL;
>> +	struct kvm_pgtable *pgt;
>>  
>>  	write_lock(&kvm->mmu_lock);
>>  	pgt = mmu->pgt;
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index f51ec667445e..5b00ccca4af3 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -11,6 +11,14 @@
>>  #include <asm/rmi_cmds.h>
>>  #include <asm/virt.h>
>>  
>> +static inline unsigned long rmi_rtt_level_mapsize(int level)
>> +{
>> +	if (WARN_ON(level > KVM_PGTABLE_LAST_LEVEL))
>> +		return PAGE_SIZE;
>> +
>> +	return (1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
>> +}
>> +
>>  static bool rmi_has_feature(unsigned long feature)
>>  {
>>  	return !!u64_get_bits(rmm_feat_reg0, feature);
>> @@ -21,6 +29,144 @@ u32 kvm_realm_ipa_limit(void)
>>  	return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>>  }
>>  
>> +static int get_start_level(struct realm *realm)
>> +{
>> +	return 4 - stage2_pgtable_levels(realm->ia_bits);
>> +}
>> +
>> +static void free_rtt(phys_addr_t phys)
>> +{
>> +	if (free_delegated_page(phys))
>> +		return;
>> +
>> +	kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>> +}
>> +
>> +/*
>> + * realm_rtt_destroy - Destroy an RTT at @level for @addr.
>> + *
>> + * Returns - Result of the RMI_RTT_DESTROY call, and:
>> + * @rtt_granule:	RTT granule, if the RTT was destroyed.
>> + * @next_addr:		IPA corresponding to the next possible valid entry we
>> + *			can target
>> + */
>> +static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
>> +			     int level, phys_addr_t *rtt_granule,
>> +			     unsigned long *next_addr)
>> +{
>> +	unsigned long out_rtt;
>> +	int ret;
>> +
>> +	ret = rmi_rtt_destroy(virt_to_phys(realm->rd), addr, level,
>> +			      &out_rtt, next_addr);
>> +
>> +	*rtt_granule = out_rtt;
>> +
>> +	return ret;
>> +}
> 
> Looks like out_rtt can be simplified out.

The issue here is there's a type conversion going on. rmi_rtt_destroy()
takes an "unsigned long *" to match the general approach of using
"unsigned long" for the inputs/outputs of SMCCC calls. But rtt_granule
is a "phys_addr_t". While we know these are (currently) the same size,
they are not the same type according to the compiler - phys_addr_t is
"long long unsigned int".

Thanks,
Steve

> [...]
> 
> Thanks,
> Wei-Lin Chang


^ permalink raw reply

* Re: [PATCH v14 19/44] arm64: RMI: Allocate/free RECs to match vCPUs
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
  To: Wei-Lin Chang, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <2uvtjhncf57yek5i4fupdefunukmidzw452mcavnmixpr5u3qd@uoaktzpak3nl>

On 26/05/2026 23:39, Wei-Lin Chang wrote:
> Hi,
> 
> On Wed, May 13, 2026 at 02:17:27PM +0100, Steven Price wrote:
>> The RMM maintains a data structure known as the Realm Execution Context
>> (or REC). It is similar to struct kvm_vcpu and tracks the state of the
>> virtual CPUs. KVM must delegate memory and request the structures are
>> created when vCPUs are created, and suitably tear down on destruction.
>>
>> RECs may require additional pages (e.g. for storing larger register
>> state for SVE). The RMM can request extra pages for this purpose using
>> the Stateful RMI Operations (SRO) functionality to request pages during
>> REC creation. These pages are then passed back to the host from the RMM
>> ('reclaimed') when the REC is destroyed. The kernel tracking object
>> (struct rmi_sro_state) is stored in the realm_rec structure to avoid
>> memory allocation during the destruction path.
>>
>> Note that only some of register state for the REC can be set by KVM, the
>> rest is defined by the RMM (zeroed). The register state then cannot be
>> changed by KVM after the REC is created (except when the guest
>> explicitly requests this e.g. by performing a PSCI call).
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * Support SRO for REC creation/destruction instead of auxiliary
>>    granules.
>> Changes since v12:
>>  * Use the new range-based delegation RMI.
>> Changes since v11:
>>  * Remove the KVM_ARM_VCPU_REC feature. User space no longer needs to
>>    configure each VCPU separately, RECs are created on the first VCPU
>>    run of the guest.
>> Changes since v9:
>>  * Size the aux_pages array according to the PAGE_SIZE of the host.
>> Changes since v7:
>>  * Add comment explaining the aux_pages array.
>>  * Rename "undeleted_failed" variable to "should_free" to avoid a
>>    confusing double negative.
>> Changes since v6:
>>  * Avoid reporting the KVM_ARM_VCPU_REC feature if the guest isn't a
>>    realm guest.
>>  * Support host page size being larger than RMM's granule size when
>>    allocating/freeing aux granules.
>> Changes since v5:
>>  * Separate the concept of vcpu_is_rec() and
>>    kvm_arm_vcpu_rec_finalized() by using the KVM_ARM_VCPU_REC feature as
>>    the indication that the VCPU is a REC.
>> Changes since v2:
>>  * Free rec->run earlier in kvm_destroy_realm() and adapt to previous patches.
>> ---
>>  arch/arm64/include/asm/kvm_emulate.h |   2 +-
>>  arch/arm64/include/asm/kvm_host.h    |   3 +
>>  arch/arm64/include/asm/kvm_rmi.h     |  17 +++++
>>  arch/arm64/kvm/arm.c                 |   6 ++
>>  arch/arm64/kvm/reset.c               |   1 +
>>  arch/arm64/kvm/rmi.c                 | 105 +++++++++++++++++++++++++++
>>  6 files changed, 133 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 82fd777bd9bb..2e69fe494716 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -714,7 +714,7 @@ static inline bool kvm_realm_is_created(struct kvm *kvm)
>>  
>>  static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>>  {
>> -	return false;
>> +	return kvm_is_realm(vcpu->kvm);
>>  }
>>  
>>  #endif /* __ARM64_KVM_EMULATE_H__ */
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 3512696ed506..39b5de03d0fe 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -969,6 +969,9 @@ struct kvm_vcpu_arch {
>>  
>>  	/* Hyp-readable copy of kvm_vcpu::pid */
>>  	pid_t pid;
>> +
>> +	/* Realm meta data */
>> +	struct realm_rec rec;
>>  };
>>  
>>  /*
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 8bd743093ccf..d99bf4fc3c39 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -59,6 +59,22 @@ struct realm {
>>  	unsigned int ia_bits;
>>  };
>>  
>> +/**
>> + * struct realm_rec - Additional per VCPU data for a Realm
>> + *
>> + * @mpidr: MPIDR (Multiprocessor Affinity Register) value to identify this VCPU
>> + * @rec_page: Kernel VA of the RMM's private page for this REC
>> + * @aux_pages: Additional pages private to the RMM for this REC
>> + * @run: Kernel VA of the RmiRecRun structure shared with the RMM
>> + * @sro: A preallocated SRO state context
>> + */
>> +struct realm_rec {
>> +	unsigned long mpidr;
>> +	void *rec_page;
>> +	struct rec_run *run;
>> +	struct rmi_sro_state *sro;
>> +};
>> +
>>  void kvm_init_rmi(void);
>>  u32 kvm_realm_ipa_limit(void);
>>  
>> @@ -66,6 +82,7 @@ int kvm_init_realm(struct kvm *kvm);
>>  int kvm_activate_realm(struct kvm *kvm);
>>  void kvm_destroy_realm(struct kvm *kvm);
>>  void kvm_realm_destroy_rtts(struct kvm *kvm);
>> +void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>>  
>>  static inline bool kvm_realm_is_private_address(struct realm *realm,
>>  						unsigned long addr)
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index eb2b61fe1f0a..93d34762db91 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -586,6 +586,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>>  	/* Force users to call KVM_ARM_VCPU_INIT */
>>  	vcpu_clear_flag(vcpu, VCPU_INITIALIZED);
>>  
>> +	vcpu->arch.rec.mpidr = INVALID_HWID;
>> +
>>  	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>>  
>>  	/* Set up the timer */
>> @@ -1651,6 +1653,10 @@ static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu,
>>  	if (test_bit(KVM_ARM_VCPU_HAS_EL2, &features))
>>  		return -EINVAL;
>>  
>> +	/* Realms are incompatible with AArch32 */
>> +	if (vcpu_is_rec(vcpu))
>> +		return -EINVAL;
>> +
>>  	return 0;
>>  }
>>  
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index b963fd975aac..c18cdca7d125 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -161,6 +161,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>>  	free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>>  	kfree(vcpu->arch.vncr_tlb);
>>  	kfree(vcpu->arch.ccsidr);
>> +	kvm_destroy_rec(vcpu);
>>  }
>>  
>>  static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 849111817af7..353a5ca45e78 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -173,9 +173,108 @@ static int realm_ensure_created(struct kvm *kvm)
>>  	return -ENXIO;
>>  }
>>  
>> +static int kvm_create_rec(struct kvm_vcpu *vcpu)
>> +{
>> +	struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
>> +	unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
>> +	struct realm *realm = &vcpu->kvm->arch.realm;
>> +	struct realm_rec *rec = &vcpu->arch.rec;
>> +	unsigned long rec_page_phys;
>> +	struct rec_params *params;
>> +	int r, i;
>> +
>> +	if (rec->run)
>> +		return -EBUSY;
>> +
>> +	/*
>> +	 * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2
>> +	 * flag covers v0.2 and onwards.
>> +	 */
>> +	if (!vcpu_has_feature(vcpu, KVM_ARM_VCPU_PSCI_0_2))
>> +		return -EINVAL;
>> +
>> +	BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE);
>> +	BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE);
>> +
>> +	params = (struct rec_params *)get_zeroed_page(GFP_KERNEL);
>> +	rec->rec_page = (void *)__get_free_page(GFP_KERNEL);
>> +	rec->run = (void *)get_zeroed_page(GFP_KERNEL);
> 
> Should this be cast to (struct rec_run *) ?

Yes it probably should - I'll update. Although IMHO get_zeroed_page()
should really return void * - but I know that would be a contentious change.

Thanks,
Steve

>> +	rec->sro = kmalloc_obj(*rec->sro);
>> +	if (!params || !rec->rec_page || !rec->run || !rec->sro) {
>> +		r = -ENOMEM;
>> +		goto out_free_pages;
>> +	}
>> +
>> +	for (i = 0; i < ARRAY_SIZE(params->gprs); i++)
>> +		params->gprs[i] = vcpu_regs->regs[i];
>> +
>> +	params->pc = vcpu_regs->pc;
>> +
>> +	if (vcpu->vcpu_id == 0)
>> +		params->flags |= REC_PARAMS_FLAG_RUNNABLE;
>> +
>> +	rec_page_phys = virt_to_phys(rec->rec_page);
>> +
>> +	if (rmi_delegate_page(rec_page_phys)) {
>> +		r = -ENXIO;
>> +		goto out_free_pages;
>> +	}
>> +
>> +	params->mpidr = mpidr;
>> +
>> +	if (rmi_rec_create(virt_to_phys(realm->rd), rec_page_phys,
>> +			   virt_to_phys(params), rec->sro)) {
>> +		r = -ENXIO;
>> +		goto out_undelegate_rmm_rec;
>> +	}
>> +
>> +	rec->mpidr = mpidr;
>> +
>> +	free_page((unsigned long)params);
>> +	return 0;
>> +
>> +out_undelegate_rmm_rec:
>> +	if (WARN_ON(rmi_undelegate_page(rec_page_phys)))
>> +		rec->rec_page = NULL;
>> +out_free_pages:
>> +	free_page((unsigned long)rec->run);
>> +	free_page((unsigned long)rec->rec_page);
>> +	free_page((unsigned long)params);
>> +	kfree(rec->sro);
>> +	rec->run = NULL;
>> +	return r;
>> +}
>> +
> 
> [...]
> 
> Thanks,
> Wei-Lin Chang


^ permalink raw reply

* Re: [PATCH v14 20/44] arm64: RMI: Support for the VGIC in realms
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <5ea74b6a-a51a-415e-b53f-5ece9829dee8@redhat.com>

On 28/05/2026 05:07, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The RMM provides emulation of a VGIC to the realm guest. With RMM v2.0
>> the registers are passed in the system registers so this works similar
>> to a normal guest, but kvm_arch_vcpu_put() need reordering to early out,
>> and realm guests don't support GICv2 even if the host does.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes from v12:
>>   * GIC registers are now passed in the system registers rather than via
>>     rec_entry/rec_exit which removes most of the changes.
>> Changes from v11:
>>   * Minor changes to align with the previous patches. Note that the VGIC
>>     handling will change with RMM v2.0.
>> Changes from v10:
>>   * Make sure we sync the VGIC v4 state, and only populate valid lrs from
>>     the list.
>> Changes from v9:
>>   * Copy gicv3_vmcr from the RMM at the same time as gicv3_hcr rather
>>     than having to handle that as a special case.
>> Changes from v8:
>>   * Propagate gicv3_hcr to from the RMM.
>> Changes from v5:
>>   * Handle RMM providing fewer GIC LRs than the hardware supports.
>> ---
>>   arch/arm64/kvm/arm.c            | 11 ++++++++---
>>   arch/arm64/kvm/vgic/vgic-init.c |  2 +-
>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 93d34762db91..21d9dfdb1ea0 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -786,19 +786,24 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>           kvm_call_hyp_nvhe(__pkvm_vcpu_put);
>>       }
>>   +    kvm_timer_vcpu_put(vcpu);
>> +    kvm_vgic_put(vcpu);
>> +
>> +    vcpu->cpu = -1;
>> +
>> +    if (vcpu_is_rec(vcpu))
>> +        return;
>> +
> 
> For a REC, kvm_vcpu_{load, put}_debug() becomes unbalanced in
> kvm_arch_vcpu_{load, put}().
> kvm_vcpu_load_debug() is called in kvm_arch_vcpu_load(), but
> kvm_vcpu_put_debug() won't
> be called in kvm_arch_vcpu_put() after this whole series is applied.

Good catch. Yes that's not quite right.

Thanks,
Steve

>>       kvm_vcpu_put_debug(vcpu);
>>       kvm_arch_vcpu_put_fp(vcpu);
>>       if (has_vhe())
>>           kvm_vcpu_put_vhe(vcpu);
>> -    kvm_timer_vcpu_put(vcpu);
>> -    kvm_vgic_put(vcpu);
>>       kvm_vcpu_pmu_restore_host(vcpu);
>>       if (vcpu_has_nv(vcpu))
>>           kvm_vcpu_put_hw_mmu(vcpu);
>>       kvm_arm_vmid_clear_active();
>>         vcpu_clear_on_unsupported_cpu(vcpu);
>> -    vcpu->cpu = -1;
>>   }
>>     static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/
>> vgic-init.c
>> index 933983bb2005..a9db963dfd23 100644
>> --- a/arch/arm64/kvm/vgic/vgic-init.c
>> +++ b/arch/arm64/kvm/vgic/vgic-init.c
>> @@ -81,7 +81,7 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
>>        * the proper checks already.
>>        */
>>       if (type == KVM_DEV_TYPE_ARM_VGIC_V2 &&
>> -        !kvm_vgic_global_state.can_emulate_gicv2)
>> +        (!kvm_vgic_global_state.can_emulate_gicv2 || kvm_is_realm(kvm)))
>>           return -ENODEV;
>>         /*
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 22/44] arm64: RMI: Handle realm enter/exit
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <e6cb82fc-a8be-49d3-8fa3-0107c8ab97f7@redhat.com>

On 28/05/2026 05:38, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> Entering a realm is done using a SMC call to the RMM. On exit the
>> exit-codes need to be handled slightly differently to the normal KVM
>> path so define our own functions for realm enter/exit and hook them
>> in if the guest is a realm guest.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> ---
>> Chanegs since v13:
>>   * The RMM is now required to provide an ESR value with the correct
>>     information to emulate MMIO, so we no longer need to hardcode 0s in
>>     rec_exit_sys_reg().
>>   * The PSCI changes mean that there is a potential race when turning on
>>     a VCPU which can cause a RMI_ERROR_REC return. Exit to user space
>>     with -EAGAIN in this case.
>> Changes since v12:
>>   * Call guest_state_{enter,exit}_irqoff() around rmi_rec_enter().
>>   * Add handling of the IRQ exception case where IRQs need to be briefly
>>     enabled before exiting guest timing.
>> Changes since v8:
>>   * Introduce kvm_rec_pre_enter() called before entering an atomic
>>     section to handle operations that might require memory allocation
>>     (specifically completing a RIPAS change introduced in a later patch).
>>   * Updates to align with upstream changes to hpfar_el2 which now
>> (ab)uses
>>     HPFAR_EL2_NS as a valid flag.
>>   * Fix exit reason when racing with PSCI shutdown to return
>>     KVM_EXIT_SHUTDOWN rather than KVM_EXIT_UNKNOWN.
>> Changes since v7:
>>   * A return of 0 from kvm_handle_sys_reg() doesn't mean the register has
>>     been read (although that can never happen in the current code). Tidy
>>     up the condition to handle any future refactoring.
>> Changes since v6:
>>   * Use vcpu_err() rather than pr_err/kvm_err when there is an associated
>>     vcpu to the error.
>>   * Return -EFAULT for KVM_EXIT_MEMORY_FAULT as per the documentation for
>>     this exit type.
>>   * Split code handling a RIPAS change triggered by the guest to the
>>     following patch.
>> Changes since v5:
>>   * For a RIPAS_CHANGE request from the guest perform the actual RIPAS
>>     change on next entry rather than immediately on the exit. This allows
>>     the VMM to 'reject' a RIPAS change by refusing to continue
>>     scheduling.
>> Changes since v4:
>>   * Rename handle_rme_exit() to handle_rec_exit()
>>   * Move the loop to copy registers into the REC enter structure from the
>>     to rec_exit_handlers callbacks to kvm_rec_enter(). This fixes a bug
>>     where the handler exits to user space and user space wants to modify
>>     the GPRS.
>>   * Some code rearrangement in rec_exit_ripas_change().
>> Changes since v2:
>>   * realm_set_ipa_state() now provides an output parameter for the
>>     top_iap that was changed. Use this to signal the VMM with the correct
>>     range that has been transitioned.
>>   * Adapt to previous patch changes.
>> ---
>>   arch/arm64/include/asm/kvm_rmi.h |   4 +
>>   arch/arm64/kvm/Makefile          |   2 +-
>>   arch/arm64/kvm/arm.c             |  26 ++++-
>>   arch/arm64/kvm/rmi-exit.c        | 186 +++++++++++++++++++++++++++++++
>>   arch/arm64/kvm/rmi.c             |  42 +++++++
>>   5 files changed, 254 insertions(+), 6 deletions(-)
>>   create mode 100644 arch/arm64/kvm/rmi-exit.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index d99bf4fc3c39..feb534a6678e 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -84,6 +84,10 @@ void kvm_destroy_realm(struct kvm *kvm);
>>   void kvm_realm_destroy_rtts(struct kvm *kvm);
>>   void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>>   +int kvm_rec_enter(struct kvm_vcpu *vcpu);
>> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>> +
>>   static inline bool kvm_realm_is_private_address(struct realm *realm,
>>                           unsigned long addr)
>>   {
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index ed3cf30eb06e..4a2d52fdb6a2 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -16,7 +16,7 @@ CFLAGS_handle_exit.o += -Wno-override-init
>>   kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
>>        inject_fault.o va_layout.o handle_exit.o config.o \
>>        guest.o debug.o reset.o sys_regs.o stacktrace.o \
>> -     vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o \
>> +     vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o rmi-exit.o \
>>        arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \
>>        vgic/vgic.o vgic/vgic-init.o \
>>        vgic/vgic-irqfd.o vgic/vgic-v2.o \
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 21d9dfdb1ea0..ed88a203b892 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1331,6 +1331,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>           if (ret > 0)
>>               ret = check_vcpu_requests(vcpu);
>>   +        if (ret > 0 && vcpu_is_rec(vcpu))
>> +            ret = kvm_rec_pre_enter(vcpu);
>> +
>>           /*
>>            * Preparing the interrupts to be injected also
>>            * involves poking the GIC, which must be done in a
>> @@ -1378,7 +1381,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>           trace_kvm_entry(*vcpu_pc(vcpu));
>>           guest_timing_enter_irqoff();
>>   -        ret = kvm_arm_vcpu_enter_exit(vcpu);
>> +        if (vcpu_is_rec(vcpu))
>> +            ret = kvm_rec_enter(vcpu);
>> +        else
>> +            ret = kvm_arm_vcpu_enter_exit(vcpu);
>>             vcpu->mode = OUTSIDE_GUEST_MODE;
>>           vcpu->stat.exits++;
>> @@ -1424,7 +1430,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>            * context synchronization event) is necessary to ensure that
>>            * pending interrupts are taken.
>>            */
>> -        if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
>> +        if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ ||
>> +            (vcpu_is_rec(vcpu) &&
>> +             vcpu->arch.rec.run->exit.exit_reason == RMI_EXIT_IRQ)) {
>>               local_irq_enable();
>>               isb();
>>               local_irq_disable();
> 
> The condition could be posssibly imprecise because ARM_EXCEPTION_CODE(ret)
> can be ARM_EXCEPTION_IRQ even for a REC. So the precise condition would be:
> 
>         if ((!vcpu_is_rec(vcpu) && ARM_EXCEPTION_CODE(ret) ==
> ARM_EXCEPTION_IRQ) ||
>             (vcpu_is_rec(vcpu) && vcpu->arch.rec.run->exit.exit_reason
> == RMI_EXIT_IRQ)) {

Good point - I guess this wouldn't have shown up in testing because
there's no harm (other than performance) in the ISB.

>> @@ -1436,8 +1444,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>             trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu),
>> *vcpu_pc(vcpu));
>>   -        /* Exit types that need handling before we can be preempted */
>> -        handle_exit_early(vcpu, ret);
>> +        if (!vcpu_is_rec(vcpu)) {
>> +            /*
>> +             * Exit types that need handling before we can be
>> +             * preempted
>> +             */
>> +            handle_exit_early(vcpu, ret);
>> +        }
>>             kvm_nested_sync_hwstate(vcpu);
>>   @@ -1462,7 +1475,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>> *vcpu)
>>               ret = ARM_EXCEPTION_IL;
>>           }
>>   -        ret = handle_exit(vcpu, ret);
>> +        if (vcpu_is_rec(vcpu))
>> +            ret = handle_rec_exit(vcpu, ret);
>> +        else
>> +            ret = handle_exit(vcpu, ret);
>>       }
>>         /* Tell userspace about in-kernel device output levels */
>> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
>> new file mode 100644
>> index 000000000000..e7c51b6cf6ce
>> --- /dev/null
>> +++ b/arch/arm64/kvm/rmi-exit.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/kvm_host.h>
>> +#include <kvm/arm_hypercalls.h>
>> +#include <kvm/arm_psci.h>
>> +
>> +#include <asm/rmi_smc.h>
>> +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_rmi.h>
>> +#include <asm/kvm_mmu.h>
>> +
>> +typedef int (*exit_handler_fn)(struct kvm_vcpu *vcpu);
>> +
>> +static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> +    vcpu_err(vcpu, "Unhandled exit reason from realm (ESR: %#llx)\n",
>> +         rec->run->exit.esr);
>> +    return -ENXIO;
>> +}
>> +
> 
> s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu), rec->run->exit.esr has been
> copied to the storage space pointed by kvm_vcpu_get_esr() in its caller.

Ack

>> +static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
>> +{
>> +    return kvm_handle_guest_abort(vcpu);
>> +}
>> +
>> +static int rec_exit_sync_iabt(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> +    vcpu_err(vcpu, "Unhandled instruction abort (ESR: %#llx).\n",
>> +         rec->run->exit.esr);
>> +    return -ENXIO;
>> +}
>> +
> 
> s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu)

Ack

>> +static int rec_exit_sys_reg(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +    unsigned long esr = kvm_vcpu_get_esr(vcpu);
>> +    int rt = kvm_vcpu_sys_get_rt(vcpu);
>> +    bool is_write = (esr & ESR_ELx_SYS64_ISS_DIR_MASK) ==
>> ESR_ELx_SYS64_ISS_DIR_WRITE;
>> +    int ret;
>> +
>> +    if (is_write)
>> +        vcpu_set_reg(vcpu, rt, rec->run->exit.gprs[rt]);
>> +
>> +    ret = kvm_handle_sys_reg(vcpu);
>> +    if (!is_write)
>> +        rec->run->enter.gprs[rt] = vcpu_get_reg(vcpu, rt);
>> +
>> +    return ret;
>> +}
>> +
>> +static exit_handler_fn rec_exit_handlers[] = {
>> +    [0 ... ESR_ELx_EC_MAX]    = rec_exit_reason_notimpl,
>> +    [ESR_ELx_EC_SYS64]    = rec_exit_sys_reg,
>> +    [ESR_ELx_EC_DABT_LOW]    = rec_exit_sync_dabt,
>> +    [ESR_ELx_EC_IABT_LOW]    = rec_exit_sync_iabt
>> +};
>> +
>> +static int rec_exit_psci(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +    int i;
>> +
>> +    for (i = 0; i < REC_RUN_GPRS; i++)
>> +        vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]);
>> +
>> +    return kvm_smccc_call_handler(vcpu);
>> +}
>> +
>> +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
>> +{
>> +    struct kvm *kvm = vcpu->kvm;
>> +    struct realm *realm = &kvm->arch.realm;
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +    unsigned long base = rec->run->exit.ripas_base;
>> +    unsigned long top = rec->run->exit.ripas_top;
>> +    unsigned long ripas = rec->run->exit.ripas_value;
>> +
>> +    if (!kvm_realm_is_private_address(realm, base) ||
>> +        !kvm_realm_is_private_address(realm, top - 1)) {
>> +        vcpu_err(vcpu, "Invalid RIPAS_CHANGE for %#lx - %#lx, ripas:
>> %#lx\n",
>> +             base, top, ripas);
>> +        /* Set RMI_REJECT bit */
>> +        rec->run->enter.flags = REC_ENTER_FLAG_RIPAS_RESPONSE;
>> +        return -EINVAL;
>> +    }
> 
> I doubt if the flag (REC_ENTER_FLAG_RIPAS_RESPONSE) will be handed over
> to RMM
> since the negative return value forces we're exiting to VMM like QEMU where
> how this problematic case can be handled is TBD.

It's perhaps a bit non-obvious but enter.flags is cleared on the exit.
So even if we return to the VMM the flags will be kept for the next entry.

I agree it is somewhat TBD exactly how this case should be handled -
there's a bunch of "VM did something stupid" cases like this that are a
bit problematic.

Thanks,
Steve

>> +
>> +    /* Exit to VMM, the actual RIPAS change is done on next entry */
>> +    kvm_prepare_memory_fault_exit(vcpu, base, top - base, false, false,
>> +                      ripas == RMI_RAM);
>> +
>> +    /*
>> +     * KVM_EXIT_MEMORY_FAULT requires an return code of -EFAULT, see the
>> +     * API documentation
>> +     */
>> +    return -EFAULT;
>> +}
>> +
>> +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> +    __vcpu_assign_sys_reg(vcpu, CNTV_CTL_EL0, rec->run->exit.cntv_ctl);
>> +    __vcpu_assign_sys_reg(vcpu, CNTV_CVAL_EL0, rec->run-
>> >exit.cntv_cval);
>> +    __vcpu_assign_sys_reg(vcpu, CNTP_CTL_EL0, rec->run->exit.cntp_ctl);
>> +    __vcpu_assign_sys_reg(vcpu, CNTP_CVAL_EL0, rec->run-
>> >exit.cntp_cval);
>> +
>> +    kvm_realm_timers_update(vcpu);
>> +}
>> +
>> +/*
>> + * Return > 0 to return to guest, < 0 on error, 0 (and set
>> exit_reason) on
>> + * proper exit to userspace.
>> + */
>> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +    u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
>> +    unsigned long status, index;
>> +
>> +    status = RMI_RETURN_STATUS(rec_run_ret);
>> +    index = RMI_RETURN_INDEX(rec_run_ret);
>> +
>> +    /*
>> +     * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we
>> might
>> +     * see the following status code and index indicating an attempt
>> to run
>> +     * a REC when the RD state is SYSTEM_OFF.  In this case, we just
>> need to
>> +     * return to user space which can deal with the system event or
>> will try
>> +     * to run the KVM VCPU again, at which point we will no longer
>> attempt
>> +     * to enter the Realm because we will have a sleep request
>> pending on
>> +     * the VCPU as a result of KVM's PSCI handling.
>> +     */
>> +    if (status == RMI_ERROR_REALM) {
>> +        vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>> +        return 0;
>> +    }
>> +
>> +    /*
>> +     * If a VCPU has been turned on, but the REC state hasn't been
>> updated
>> +     * we may experience RMI_ERROR_REC. Exit to the userspace with -
>> EAGAIN
>> +     * for a retry.
>> +     */
>> +    if (status == RMI_ERROR_REC)
>> +        return -EAGAIN;
>> +    if (rec_run_ret)
>> +        return -ENXIO;
>> +
>> +    vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
>> +    vcpu->arch.fault.far_el2 = rec->run->exit.far;
>> +    /* HPFAR_EL2 is only valid for RMI_EXIT_SYNC */
>> +    vcpu->arch.fault.hpfar_el2 = 0;
>> +
>> +    update_arch_timer_irq_lines(vcpu);
>> +
>> +    /* Reset the emulation flags for the next run of the REC */
>> +    rec->run->enter.flags = 0;
>> +
>> +    switch (rec->run->exit.exit_reason) {
>> +    case RMI_EXIT_SYNC:
>> +        /*
>> +         * HPFAR_EL2_NS is hijacked to indicate a valid HPFAR value,
>> +         * see __get_fault_info()
>> +         */
>> +        vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar |
>> HPFAR_EL2_NS;
>> +        return rec_exit_handlers[esr_ec](vcpu);
>> +    case RMI_EXIT_IRQ:
>> +    case RMI_EXIT_FIQ:
>> +    case RMI_EXIT_SERROR:
>> +        return 1;
>> +    case RMI_EXIT_PSCI:
>> +        return rec_exit_psci(vcpu);
>> +    case RMI_EXIT_RIPAS_CHANGE:
>> +        return rec_exit_ripas_change(vcpu);
>> +    }
>> +
>> +    kvm_pr_unimpl("Unsupported exit reason: %u\n",
>> +              rec->run->exit.exit_reason);
>> +    vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> +    return 0;
>> +}
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 353a5ca45e78..d8a5fb12db2d 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -173,6 +173,48 @@ static int realm_ensure_created(struct kvm *kvm)
>>       return -ENXIO;
>>   }
>>   +/*
>> + * kvm_rec_pre_enter - Complete operations before entering a REC
>> + *
>> + * Some operations require work to be completed before entering a
>> realm. That
>> + * work may require memory allocation so cannot be done in the
>> kvm_rec_enter()
>> + * call.
>> + *
>> + * Return: 1 if we should enter the guest
>> + *       0 if we should exit to userspace
>> + *       < 0 if we should exit to userspace, where the return value
>> indicates
>> + *       an error
>> + */
>> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> +    if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
>> +        return -EINVAL;
>> +
>> +    switch (rec->run->exit.exit_reason) {
>> +    case RMI_EXIT_HOST_CALL:
>> +        for (int i = 0; i < REC_RUN_GPRS; i++)
>> +            rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>> +        break;
>> +    }
>> +
>> +    return 1;
>> +}
>> +
>> +int noinstr kvm_rec_enter(struct kvm_vcpu *vcpu)
>> +{
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +    int ret;
>> +
>> +    guest_state_enter_irqoff();
>> +    ret = rmi_rec_enter(virt_to_phys(rec->rec_page),
>> +                virt_to_phys(rec->run));
>> +    guest_state_exit_irqoff();
>> +
>> +    return ret;
>> +}
>> +
>>   static int kvm_create_rec(struct kvm_vcpu *vcpu)
>>   {
>>       struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 23/44] arm64: RMI: Handle RMI_EXIT_RIPAS_CHANGE
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <yq5ay0hfspok.fsf@kernel.org>

On 19/05/2026 10:40, Aneesh Kumar K.V wrote:
> Steven Price <steven.price@arm.com> writes:
> 
> ...
> 
>> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long start,
>> +			   unsigned long size, bool unmap_private,
>> +			   bool may_block)
>> +{
>> +	unsigned long end = start + size;
>> +	struct realm *realm = &kvm->arch.realm;
>> +
>> +	if (!kvm_realm_is_created(kvm))
>> +		return;
>> +
>> +	end = min(BIT(realm->ia_bits - 1), end);
>> +
>> +	realm_unmap_shared_range(kvm, start, end, may_block);
>> +	if (unmap_private)
>> +		realm_unmap_private_range(kvm, start, end, may_block);
>> +}
>> +
>  
> kvm_gmem_invalidate_begin() indicates a private-only invalidation. How
> is that supported?

Because we treat the private and shared spaces are aliasing we don't
really support a "private-only" invalidation. So the shared space will
be invalidated as well. Something has gone wrong if we've ended up with
the 'same' IPA being used in both the private and shared spaces.

Private has to be treated slightly specially because removing a private
mapping is observable by the guest (the page can't be reinserted without
the guest agreeing and the contents being wiped). For shared mappings
the page can simply be refaulted.

That said, I'll look into Wei-Lin's suggestion to use
kvm_gfn_range_filter which would allow all three combinations of
private-only, shared-only and private+shared.

Thanks,
Steve

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox