* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Steven Price @ 2026-06-08 9:36 UTC (permalink / raw)
To: Gavin Shan, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <ea4be6c5-9506-4253-80c5-c76c9ac3b77d@redhat.com>
On 28/05/2026 06:30, Gavin Shan wrote:
> Hi Steve,
>
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The VMM needs to populate the realm with some data before starting (e.g.
>> a kernel and initrd). This is measured by the RMM and used as part of
>> the attestation later on.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * Rename realm_create_protected_data_page() to realm_data_map_init().
>> Changes since v12:
>> * The ioctl now updates the structure with the amount populated rather
>> than returning this through the ioctl return code.
>> * Use the new RMM v2.0 range based RMI calls.
>> * Adapt to upstream changes in kvm_gmem_populate().
>> Changes since v11:
>> * The multiplex CAP is gone and there's a new ioctl which makes use of
>> the generic kvm_gmem_populate() functionality.
>> Changes since v7:
>> * Improve the error codes.
>> * Other minor changes from review.
>> Changes since v6:
>> * Handle host potentially having a larger page size than the RMM
>> granule.
>> * Drop historic "par" (protected address range) from
>> populate_par_region() - it doesn't exist within the current
>> architecture.
>> * Add a cond_resched() call in kvm_populate_realm().
>> Changes since v5:
>> * Refactor to use PFNs rather than tracking struct page in
>> realm_create_protected_data_page().
>> * Pull changes from a later patch (in the v5 series) for accessing
>> pages from a guest memfd.
>> * Do the populate in chunks to avoid holding locks for too long and
>> triggering RCU stall warnings.
>> ---
>> arch/arm64/include/asm/kvm_rmi.h | 4 ++
>> arch/arm64/kvm/Kconfig | 1 +
>> arch/arm64/kvm/arm.c | 13 ++++
>> arch/arm64/kvm/rmi.c | 106 +++++++++++++++++++++++++++++++
>> 4 files changed, 124 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index 007249a13dbc..a2b6bc412a22 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -88,6 +88,10 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu);
>> int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>> int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>> +struct kvm_arm_rmi_populate;
>> +
>> +int kvm_arm_rmi_populate(struct kvm *kvm,
>> + struct kvm_arm_rmi_populate *arg);
>> void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long ipa,
>> unsigned long size,
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 4e16719fda22..d0cd011cf672 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -38,6 +38,7 @@ menuconfig KVM
>> select GUEST_PERF_EVENTS if PERF_EVENTS
>> select KVM_GUEST_MEMFD
>> select KVM_GENERIC_MEMORY_ATTRIBUTES
>> + select HAVE_KVM_ARCH_GMEM_POPULATE
>> help
>> Support hosting virtualized guest machines.
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index ed88a203b892..073ba9181da9 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -2131,6 +2131,19 @@ int kvm_arch_vm_ioctl(struct file *filp,
>> unsigned int ioctl, unsigned long arg)
>> return -EFAULT;
>> return kvm_vm_ioctl_get_reg_writable_masks(kvm, &range);
>> }
>> + case KVM_ARM_RMI_POPULATE: {
>> + struct kvm_arm_rmi_populate req;
>> + int ret;
>> +
>> + if (!kvm_is_realm(kvm))
>> + return -ENXIO;
>> + if (copy_from_user(&req, argp, sizeof(req)))
>> + return -EFAULT;
>> + ret = kvm_arm_rmi_populate(kvm, &req);
>> + if (copy_to_user(argp, &req, sizeof(req)))
>> + return -EFAULT;
>> + return ret;
>> + }
>
> s/return ret/return 0; The variable 'ret' can be dropped.
kvm_arm_rmi_populate() may return an error though. E.g. if the
"reserved" field is set then it's kvm_arm_rmi_populate() that detects
that and returns -EINVAL.
>> default:
>> return -EINVAL;
>> }
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index a89873a5eb77..209087bcf399 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long start,
>> realm_unmap_private_range(kvm, start, end, may_block);
>> }
>> +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>> + kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>> + unsigned long flags)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + phys_addr_t dst_phys, src_phys;
>> + int ret;
>> +
>> + dst_phys = __pfn_to_phys(dst_pfn);
>> + src_phys = __pfn_to_phys(src_pfn);
>> +
>> + if (rmi_delegate_page(dst_phys))
>> + return -ENXIO;
>> +
>> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
>
> KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);
Ack.
>> + ret = realm_create_rtt_levels(realm, ipa, level,
>> + KVM_PGTABLE_LAST_LEVEL, NULL);
>> + if (!ret) {
>> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys,
>> + flags);
>> + }
>> + }
>> +
>> + if (ret) {
>> + if (WARN_ON(rmi_undelegate_page(dst_phys))) {
>> + /* Undelegate failed, so we leak the page */
>> + get_page(pfn_to_page(dst_pfn));
>> + }
>> + }
>> +
>
> if (ret && WARN_ON(rmi_undelegate_page(dst_phys)) {
> /* Leak the page that fails to be undelegated */
> get_page(pfn_to_page(dst_pfn));
> }
Ack
>> + return ret;
>> +}
>> +
>> +static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> + struct page *src_page, void *opaque)
>> +{
>> + unsigned long data_flags = *(unsigned long *)opaque;
>> + phys_addr_t ipa = gfn_to_gpa(gfn);
>> +
>> + if (!src_page)
>> + return -EOPNOTSUPP;
>> +
>> + return realm_data_map_init(kvm, ipa, pfn, page_to_pfn(src_page),
>> + data_flags);
>> +}
>> +
>> +static long populate_region(struct kvm *kvm,
>> + gfn_t base_gfn,
>> + unsigned long pages,
>> + u64 uaddr,
>> + unsigned long data_flags)
>> +{
>> + long ret = 0;
>> +
>> + mutex_lock(&kvm->slots_lock);
>> + ret = kvm_gmem_populate(kvm, base_gfn, u64_to_user_ptr(uaddr),
>> pages,
>> + populate_region_cb, &data_flags);
>> + mutex_unlock(&kvm->slots_lock);
>> +
>> + return ret;
>> +}
>> +
>> enum ripas_action {
>> RIPAS_INIT,
>> RIPAS_SET,
>> @@ -574,6 +643,43 @@ static int realm_ensure_created(struct kvm *kvm)
>> return -ENXIO;
>> }
>> +int kvm_arm_rmi_populate(struct kvm *kvm,
>> + struct kvm_arm_rmi_populate *args)
>> +{
>> + unsigned long data_flags = 0;
>> + unsigned long ipa_start = args->base;
>> + unsigned long ipa_end = ipa_start + args->size;
>> + long pages_populated;
>> + int ret;
>> +
>> + if (args->reserved ||
>> + (args->flags & ~KVM_ARM_RMI_POPULATE_FLAGS_MEASURE) ||
>> + !IS_ALIGNED(ipa_start, PAGE_SIZE) ||
>> + !IS_ALIGNED(ipa_end, PAGE_SIZE) ||
>> + !IS_ALIGNED(args->source_uaddr, PAGE_SIZE))
>> + return -EINVAL;
>> +
>
> There are more conditions missed here:
>
> args->size == 0, return 0;
> args->base + args->size < args->base, return -EINVAL; // wrapped range
Good catch. args->size == 0 can trigger a WARN_ON currently. I'll put
the "return 0" after the realm_ensure_created() call so the behaviour
matches.
I don't think the wrapped range is quite such a problem - but detecting
it and rejecting it early seems like a good idea.
>> + ret = realm_ensure_created(kvm);
>> + if (ret)
>> + return ret;
>> +
>> + if (args->flags & KVM_ARM_RMI_POPULATE_FLAGS_MEASURE)
>> + data_flags |= RMI_MEASURE_CONTENT;
>> +
>> + pages_populated = populate_region(kvm, gpa_to_gfn(ipa_start),
>> + args->size >> PAGE_SHIFT,
>> + args->source_uaddr, data_flags);
>> +
>> + if (pages_populated < 0)
>> + return pages_populated;
>
> pages_populaged is 'unsigned long', this function returns a 'int' value.
pages_populated is *signed* long. This is handling an error code - so if
it's negative we expect the error code to be between -1 and -MAX_ERRNO
which should easily fit within the 'int' return.
For positive values we continue below (encoding the potentially larger
number in the args outputs) and return 0.
Thanks,
Steve
>> +
>> + args->size -= pages_populated << PAGE_SHIFT;
>> + args->source_uaddr += pages_populated << PAGE_SHIFT;
>> + args->base += pages_populated << PAGE_SHIFT;
>> +
>> + return 0;
>> +}
>> +
>> static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
>> {
>> struct kvm *kvm = vcpu->kvm;
>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-08 9:30 UTC (permalink / raw)
To: Gavin Shan, Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <3359f788-07fa-41a1-9ac7-45c58577c1fa@redhat.com>
On 05/06/2026 07:23, Gavin Shan wrote:
> Hi Steve,
>
> On 5/13/26 11:17 PM, Steven Price wrote:
>> At runtime if the realm guest accesses memory which hasn't yet been
>> mapped then KVM needs to either populate the region or fault the guest.
>>
>> For memory in the lower (protected) region of IPA a fresh page is
>> provided to the RMM which will zero the contents. For memory in the
>> upper (shared) region of IPA, the memory from the memslot is mapped
>> into the realm VM non secure.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * Numerous changes due to rebasing.
>> * Fix addr_range_desc() to encode the correct block size.
>> Changes since v12:
>> * Switch to RMM v2.0 range based APIs.
>> Changes since v11:
>> * Adapt to upstream changes.
>> Changes since v10:
>> * RME->RMI renaming.
>> * Adapt to upstream gmem changes.
>> Changes since v9:
>> * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>> may_block to avoid stall warnings.
>> * Minor coding style fixes.
>> Changes since v8:
>> * Propagate the may_block flag.
>> * Minor comments and coding style changes.
>> Changes since v7:
>> * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>> internally WARN when necessary.
>> Changes since v6:
>> * Handle PAGE_SIZE being larger than RMM granule size.
>> * Some minor renaming following review comments.
>> Changes since v5:
>> * Reduce use of struct page in preparation for supporting the RMM
>> having a different page size to the host.
>> * Handle a race when delegating a page where another CPU has faulted on
>> a the same page (and already delegated the physical page) but not yet
>> mapped it. In this case simply return to the guest to either use the
>> mapping from the other CPU (or refault if the race is lost).
>> * The changes to populate_par_region() are moved into the previous
>> patch where they belong.
>> Changes since v4:
>> * Code cleanup following review feedback.
>> * Drop the PTE_SHARED bit when creating unprotected page table entries.
>> This is now set by the RMM and the host has no control of it and the
>> spec requires the bit to be set to zero.
>> Changes since v2:
>> * Avoid leaking memory if failing to map it in the realm.
>> * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>> * Adapt to changes in previous patches.
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 8 ++
>> arch/arm64/include/asm/kvm_rmi.h | 12 ++
>> arch/arm64/kvm/mmu.c | 128 ++++++++++++++++----
>> arch/arm64/kvm/rmi.c | 173 +++++++++++++++++++++++++++
>> 4 files changed, 301 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/
>> include/asm/kvm_emulate.h
>> index 2e69fe494716..8b6f9d26b5d8 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -712,6 +712,14 @@ static inline bool kvm_realm_is_created(struct
>> kvm *kvm)
>> return kvm_is_realm(kvm) && kvm_realm_state(kvm) !=
>> REALM_STATE_NONE;
>> }
>> +static inline gpa_t kvm_gpa_from_fault(struct kvm *kvm, phys_addr_t ipa)
>> +{
>> + if (!kvm_is_realm(kvm))
>> + return ipa;
>> +
>> + return ipa & ~BIT(kvm->arch.realm.ia_bits - 1);
>> +}
>> +
>> static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>> {
>> return kvm_is_realm(vcpu->kvm);
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index a2b6bc412a22..b65cfec10dee 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -6,6 +6,7 @@
>> #ifndef __ASM_KVM_RMI_H
>> #define __ASM_KVM_RMI_H
>> +#include <asm/kvm_pgtable.h>
>> #include <asm/rmi_smc.h>
>> /**
>> @@ -97,6 +98,17 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long size,
>> bool unmap_private,
>> bool may_block);
>> +int realm_map_protected(struct kvm *kvm,
>> + unsigned long base_ipa,
>> + kvm_pfn_t pfn,
>> + unsigned long size,
>> + struct kvm_mmu_memory_cache *memcache);
>> +int realm_map_non_secure(struct realm *realm,
>> + unsigned long ipa,
>> + kvm_pfn_t pfn,
>> + unsigned long size,
>> + enum kvm_pgtable_prot prot,
>> + struct kvm_mmu_memory_cache *memcache);
>> static inline bool kvm_realm_is_private_address(struct realm *realm,
>> unsigned long addr)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ac2a0f0106b0..776ffe56d17e 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -334,8 +334,15 @@ static void __unmap_stage2_range(struct
>> kvm_s2_mmu *mmu, phys_addr_t start, u64
>> lockdep_assert_held_write(&kvm->mmu_lock);
>> WARN_ON(size & ~PAGE_MASK);
>> - WARN_ON(stage2_apply_range(mmu, start, end,
>> KVM_PGT_FN(kvm_pgtable_stage2_unmap),
>> - may_block));
>> +
>> + if (kvm_is_realm(kvm)) {
>> + kvm_realm_unmap_range(kvm, start, size, !only_shared,
>> + may_block);
>> + } else {
>> + WARN_ON(stage2_apply_range(mmu, start, end,
>> + KVM_PGT_FN(kvm_pgtable_stage2_unmap),
>> + may_block));
>> + }
>> }
>> void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
>> @@ -358,7 +365,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
>> phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
>> phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
>> - kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
>> + if (kvm_is_realm(kvm))
>> + kvm_realm_unmap_range(kvm, addr, end - addr, false, true);
>> + else
>> + kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
>> }
>> /**
>> @@ -1103,6 +1113,10 @@ void stage2_unmap_vm(struct kvm *kvm)
>> struct kvm_memory_slot *memslot;
>> int idx, bkt;
>> + /* For realms this is handled by the RMM so nothing to do here */
>> + if (kvm_is_realm(kvm))
>> + return;
>> +
>> idx = srcu_read_lock(&kvm->srcu);
>> mmap_read_lock(current->mm);
>> write_lock(&kvm->mmu_lock);
>> @@ -1528,6 +1542,29 @@ static bool kvm_vma_mte_allowed(struct
>> vm_area_struct *vma)
>> return vma->vm_flags & VM_MTE_ALLOWED;
>> }
>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>> + kvm_pfn_t pfn, unsigned long map_size,
>> + enum kvm_pgtable_prot prot,
>> + struct kvm_mmu_memory_cache *memcache)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> +
>> + /*
>> + * Write permission is required for now even though it's possible to
>> + * map unprotected pages (granules) as read-only. It's impossible to
>> + * map protected pages (granules) as read-only.
>> + */
>> + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>> + return -EFAULT;
>> +
>
> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in
> @prot
> if the stage2 fault is raised due to memory read. With -EFAULT returned
> to VMM
> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
> working any more.
>
>> + ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>> + if (!kvm_realm_is_private_address(realm, ipa))
>> + return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>> + memcache);
>> +
>> + return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>> +}
>> +
>> static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>> {
>> switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma-
>> >vm_page_prot))) {
>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct
>> kvm_s2_fault_desc *s2fd)
>> bool write_fault, exec_fault;
>> enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>> - struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>> + struct kvm_vcpu *vcpu = s2fd->vcpu;
>> + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>> + gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>> unsigned long mmu_seq;
>> struct page *page;
>> - struct kvm *kvm = s2fd->vcpu->kvm;
>> + struct kvm *kvm = vcpu->kvm;
>> void *memcache;
>> kvm_pfn_t pfn;
>> gfn_t gfn;
>> int ret;
>> - memcache = get_mmu_memcache(s2fd->vcpu);
>> - ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>> + if (kvm_is_realm(vcpu->kvm)) {
>> + /* check for memory attribute mismatch */
>> + bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>> + /*
>> + * For Realms, the shared address is an alias of the private
>> + * PA with the top bit set. Thus if the fault address matches
>> + * the GPA then it is the private alias.
>> + */
>> + bool is_priv_fault = (gpa == s2fd->fault_ipa);
>> +
>> + if (is_priv_gfn != is_priv_fault) {
>> + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>> + kvm_is_write_fault(vcpu),
>> + false,
>> + is_priv_fault);
>> + /*
>> + * KVM_EXIT_MEMORY_FAULT requires an return code of
>> + * -EFAULT, see the API documentation
>> + */
>> + return -EFAULT;
>> + }
>> + }
>> +
>> + memcache = get_mmu_memcache(vcpu);
>> + ret = topup_mmu_memcache(vcpu, memcache);
>> if (ret)
>> return ret;
>> if (s2fd->nested)
>> gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>> else
>> - gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>> + gfn = gpa >> PAGE_SHIFT;
>> - write_fault = kvm_is_write_fault(s2fd->vcpu);
>> - exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>> + write_fault = kvm_is_write_fault(vcpu);
>> + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>> VM_WARN_ON_ONCE(write_fault && exec_fault);
>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct
>> kvm_s2_fault_desc *s2fd)
>> ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>> if (ret) {
>> - kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa,
>> PAGE_SIZE,
>> + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>> write_fault, exec_fault, false);
>> return ret;
>> }
>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct
>> kvm_s2_fault_desc *s2fd)
>> kvm_fault_lock(kvm);
>> if (mmu_invalidate_retry(kvm, mmu_seq)) {
>> ret = -EAGAIN;
>> - goto out_unlock;
>> + goto out_release_page;
>> + }
>> +
>> + if (kvm_is_realm(kvm)) {
>> + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>> + PAGE_SIZE, KVM_PGTABLE_PROT_R |
>> KVM_PGTABLE_PROT_W, memcache);
>> + goto out_release_page;
>> }
>> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa,
>> PAGE_SIZE,
>> __pfn_to_phys(pfn), prot,
>> memcache, flags);
>> -out_unlock:
>> +out_release_page:
>> kvm_release_faultin_page(kvm, page, !!ret, prot &
>> KVM_PGTABLE_PROT_W);
>> kvm_fault_unlock(kvm);
>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const
>> struct kvm_s2_fault_desc *s2fd,
>> * mapping size to ensure we find the right PFN and lay down the
>> * mapping in the right place.
>> */
>> - s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >>
>> PAGE_SHIFT;
>> + s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa,
>> s2vi->vma_pagesize)) >> PAGE_SHIFT;
>> s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct
>> kvm_s2_fault_desc *s2fd,
>> prot &= ~KVM_NV_GUEST_MAP_SZ;
>> ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt,
>> gfn_to_gpa(gfn),
>> prot, flags);
>> + } else if (kvm_is_realm(kvm)) {
>> + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>> + prot, memcache);
>> } else {
>> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt,
>> gfn_to_gpa(gfn), mapping_size,
>> __pfn_to_phys(pfn), prot,
>
> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the
> sake of
> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
> transparent_hugepage_adjust() to be aligned with huge page size. If the
> adjustment happened in transparent_hugepage_adjust(), we need to align
> s2fd->fault_ipa down to the huge page size either.
>
>
>> @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>> return 0;
>> }
>> +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
>> +{
>> + gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
>> +
>> + return (gpa != fault_ipa);
>> +}
>> +
>> /**
>> * kvm_handle_guest_abort - handles all 2nd stage aborts
>> * @vcpu: the VCPU pointer
>> @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>> nested = &nested_trans;
>> }
>> - gfn = ipa >> PAGE_SHIFT;
>> + gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
>> memslot = gfn_to_memslot(vcpu->kvm, gfn);
>> +
>> hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>> write_fault = kvm_is_write_fault(vcpu);
>> if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
>> @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>> * of the page size.
>> */
>> ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
>> - ret = io_mem_abort(vcpu, ipa);
>> + ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
>> goto out_unlock;
>> }
>> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>> !write_fault &&
>> !kvm_vcpu_trap_is_exec_fault(vcpu));
>> - if (kvm_slot_has_gmem(memslot))
>> + if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu-
>> >kvm, fault_ipa))
>> ret = gmem_abort(&s2fd);
>> else
>> ret = user_mem_abort(&s2fd);
>> @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct
>> kvm_gfn_range *range)
>> if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>> return false;
>> + /* We don't support aging for Realms */
>> + if (kvm_is_realm(kvm))
>> + return true;
>> +
>> return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm-
>> >arch.mmu.pgt,
>> range->start << PAGE_SHIFT,
>> size, true);
>> @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct
>> kvm_gfn_range *range)
>> if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>> return false;
>> + /* We don't support aging for Realms */
>> + if (kvm_is_realm(kvm))
>> + return true;
>> +
>> return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm-
>> >arch.mmu.pgt,
>> range->start << PAGE_SHIFT,
>> size, false);
>> @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm
>> *kvm,
>> return -EFAULT;
>> /*
>> - * Only support guest_memfd backed memslots with mappable memory,
>> since
>> - * there aren't any CoCo VMs that support only private memory on
>> arm64.
>> + * Only support guest_memfd backed memslots with mappable memory,
>> + * unless the guest is a CCA realm guest.
>> */
>> - if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
>> + if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
>> + !kvm_is_realm(kvm))
>> return -EINVAL;
>> hva = new->userspace_addr;
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index cae29fd3353c..761b38a4071c 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm,
>> unsigned long ipa,
>> return ret;
>> }
>> +static unsigned long addr_range_desc(unsigned long phys, unsigned
>> long size)
>> +{
>> + unsigned long out = 0;
>> +
>> + switch (size) {
>> + case P4D_SIZE:
>> + out = 3 | (1 << 2);
>> + break;
>> + case PUD_SIZE:
>> + out = 2 | (1 << 2);
>> + break;
>> + case PMD_SIZE:
>> + out = 1 | (1 << 2);
>> + break;
>> + case PAGE_SIZE:
>> + out = 0 | (1 << 2);
>> + break;
>> + default:
>> + /*
>> + * Only support mapping at the page level granulatity when
>> + * it's an unusual length. This should get us back onto a larger
>> + * block size for the subsequent mappings.
>> + */
>> + out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>> + break;
>> + }
>> +
>> + WARN_ON(phys & ~PAGE_MASK);
>> +
>> + out |= phys & PAGE_MASK;
>> +
>> + return out;
>> +}
>> +
>> +int realm_map_protected(struct kvm *kvm,
>> + unsigned long ipa,
>> + kvm_pfn_t pfn,
>> + unsigned long map_size,
>> + struct kvm_mmu_memory_cache *memcache)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + phys_addr_t phys = __pfn_to_phys(pfn);
>> + phys_addr_t base_phys = phys;
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + unsigned long base_ipa = ipa;
>> + unsigned long ipa_top = ipa + map_size;
>> + int ret = 0;
>> +
>> + if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>> + !IS_ALIGNED(ipa, map_size)))
>> + return -EINVAL;
>> +
>> + if (rmi_delegate_range(phys, map_size)) {
>> + /*
>> + * It's likely we raced with another VCPU on the same
>> + * fault. Assume the other VCPU has handled the fault
>> + * and return to the guest.
>> + */
>> + return 0;
>> + }
>> +
>> + while (ipa < ipa_top) {
>> + unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>> + unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>> + unsigned long out_top;
>> +
>> + ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>> + &out_top);
>> +
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>> + ret = realm_create_rtt_levels(realm, ipa, level,
>> + KVM_PGTABLE_LAST_LEVEL,
>> + memcache);
Could we give the RMM a chance to make use of the Block mappings by
creating the Missing RTTs to the level that may work for the current
range_desc ? i.e., if the range_desc is a 2M block size, we could create
tables upto L2 in the first go and if the RMM still needs RTT, we could
go further down to the KVM_PGTABLE_LAST_LEVEL. I understand this is
kind of an optimisation, so may be we could defer it. (Same applies for
the non_secure map below).
>> + if (ret)
>> + goto err_undelegate;
>> +
>> + ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
>> + range_desc, &out_top);
>> + }
>> +
>> + if (WARN_ON(ret))
>> + goto err_undelegate;
>> +
>> + phys += out_top - ipa;
>> + ipa = out_top;
>> + }
>> +
>> + return 0;
>> +
>> +err_undelegate:
>> + realm_unmap_private_range(kvm, base_ipa, ipa, true);
>> + if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
>> + /* Page can't be returned to NS world so is lost */
>> + get_page(phys_to_page(base_phys));
>> + }
>> + return -ENXIO;
>> +}
>> +
>> +int realm_map_non_secure(struct realm *realm,
>> + unsigned long ipa,
>> + kvm_pfn_t pfn,
>> + unsigned long size,
>> + enum kvm_pgtable_prot prot,
>> + struct kvm_mmu_memory_cache *memcache)
>> +{
>> + unsigned long attr, flags = 0;
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + phys_addr_t phys = __pfn_to_phys(pfn);
>> + unsigned long ipa_top = ipa + size;
>> + int ret;
>> +
>> + if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
>> + !IS_ALIGNED(ipa, size)))
>> + return -EINVAL;
>> +
>> + switch (prot & (KVM_PGTABLE_PROT_DEVICE |
>> KVM_PGTABLE_PROT_NORMAL_NC)) {
>> + case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
>> + return -EINVAL;
>> + case KVM_PGTABLE_PROT_DEVICE:
>> + attr = MT_S2_FWB_DEVICE_nGnRE;
>> + break;
>> + case KVM_PGTABLE_PROT_NORMAL_NC:
>> + attr = MT_S2_FWB_NORMAL_NC;
>> + break;
>> + default:
>> + attr = MT_S2_FWB_NORMAL;
>> + }
>> +
>> + flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
>> +
>> + if (prot & KVM_PGTABLE_PROT_R)
>> + flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP,
>> RMI_S2AP_DIRECT_READ);
>> + if (prot & KVM_PGTABLE_PROT_W)
>> + flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP,
>> RMI_S2AP_DIRECT_WRITE);
>> +
>> + flags |= RMI_ADDR_TYPE_SINGLE;
>> +
>> + while (ipa < ipa_top) {
>> + unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>> + unsigned long out_top;
>> +
>> + ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
>> + &out_top);
>> +
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>> + ret = realm_create_rtt_levels(realm, ipa, level,
>> + KVM_PGTABLE_LAST_LEVEL,
^^ Same as above.
Suzuki
>> + memcache);
>> + if (ret)
>> + return ret;
>> +
>> + ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
>> + range_desc, &out_top);
>> + }
>> +
>> + if (WARN_ON(ret))
>> + return ret;
>> +
>> + phys += out_top - ipa;
>> + ipa = out_top;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t
>> pfn,
>> struct page *src_page, void *opaque)
>> {
>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-08 9:14 UTC (permalink / raw)
To: Dave Hansen
Cc: Chao Gao, Edgecombe, Rick P, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Huang, Kai, Zhao, Yan Y,
seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
pbonzini@redhat.com, nik.borisov@suse.com,
linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
x86@kernel.org
In-Reply-To: <572868d7-4794-4fec-b80f-97d8434d5fb6@intel.com>
On Fri, Jun 05, 2026 at 09:23:21AM -0700, Dave Hansen wrote:
> On 6/5/26 04:42, Kiryl Shutsemau wrote:
> >>> I don't see a reason why we can't keep the scoped_guard() on get side.
> >> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> >> with goto, which is discouraged. See [*]
> >>
> >> :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
> >> :that the “goto” statement can jump between scopes, the expectation is that
> >> :usage of “goto” and cleanup helpers is never mixed in the same function.
> > Fair enough.
> >
> > But it can also be address if we free the PAMT page array with the guard
> > too :P
>
> How important is this patch? I see "Optimize" but I read "Optional".
>
> If we're arguing about it, maybe we should just kick it out and focus on
> the more important bits.
I don't think it is optional for anything outside of test setup.
Without the optimization, we have all KVM memory allocations serialized
on a single spinlock. And we do alloc_pamt_array()/free_pamt_array() all
the time too.
And since the lock is global, it is an easy DoS attack vector: one guest
can do a shared->private->shared conversion loop and make every guest on
the host suffer.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Suzuki K Poulose @ 2026-06-08 9:03 UTC (permalink / raw)
To: Aneesh Kumar K.V, Sudeep Holla
Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
Greg KH, Jeremy Linton, Jonathan Cameron, Lorenzo Pieralisi,
Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <yq5ao6hlzbpa.fsf@kernel.org>
On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
> Sudeep Holla <sudeep.holla@kernel.org> writes:
>
>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>>
>>> ...
>>>
>>>> +static const struct smccc_device_info smccc_devices[] __initconst = {
>>>> + {
>>>> + .func_id = ARM_SMCCC_TRNG_VERSION,
>>>> + .requires_smc = false,
>>>> + .min_return = ARM_SMCCC_TRNG_MIN_VERSION,
>>>> + .device_name = "arm-smccc-trng",
>>>> + },
>>>> +};
>>>> +
>>>> +static bool __init
>>>> +smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
>>>> +{
>>>> + struct arm_smccc_res res;
>>>> + unsigned long ret;
>>>> +
>>>> + if (!IS_ENABLED(CONFIG_ARM64))
>>>> + return false;
>>>> +
>>>> + if (smccc_conduit == SMCCC_CONDUIT_NONE)
>>>> + return false;
>>>> +
>>>> + if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
>>>> + return false;
>>>> +
>>>> + arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
>>>> + ret = res.a0;
>>>> +
>>>> + if ((s32)ret < 0)
>>>> + return false;
>>>> +
>>>> + return ret >= smccc_dev->min_return;
>>>> +}
>>>> +
>>>>
>>>
>>> I am not sure we want the check to be as simple as ret < 0. Some
>>> function IDs may return input errors based on the supplied arguments
>>> (for example, RMI_ERROR_INPUT). In those cases, we would likely want
>>> this to be handled via a callback.
>>>
>>
>> As I mentioned in response to Suzuki, we can defer that to probe of
>> that device. If *_VERSION, succeeds SMCCC core can add that device and
>> leave the rest to the core keeping the core and bus layer simple IMO.
>>
>>> We also want to use conditional compilation for some function IDs.
>>> Given the callback approach and the #ifdefs, I wonder whether what we
>>> currently have is actually simpler and more flexible.”
>>>
>>
>> I was trying to avoid conditional compilation altogether and hence the
>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
>> in above snippet must come as some condition to this generic probe.
>>
>> Adding any more logic or callback defeats the bus idea here if we need
>> to rely/depend on multiple conditional compilation or callbacks IMO.
>>
>> Let's find see if it can work with what we are adding now and may add in
>> near future and then decide.
>>
>
> If we move all the conditional checks to the driver probe path, then I
> think this can work. Something like the below:
>
> struct smccc_device_info {
> u32 func_id;
> bool requires_smc;
> const char *device_name;
> };
>
> static const struct smccc_device_info smccc_devices[] __initconst = {
> {
> .func_id = ARM_SMCCC_TRNG_VERSION,
> .requires_smc = false,
> .device_name = "arm-smccc-trng",
> },
>
> {
> .func_id = RSI_ABI_VERSION,
Don't we need parameters passed to this (Requested Interface version for
e.g.) ? See more below.
> .requires_smc = true,
> .device_name = RSI_DEV_NAME,
> },
> };
>
> static bool __init smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
> {
> unsigned long ret;
> struct arm_smccc_res res;
>
> if (smccc_conduit == SMCCC_CONDUIT_NONE)
> return false;
>
> if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
> return false;
>
> arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
> ret = res.a0;
>
> if ((s32)ret == SMCCC_RET_NOT_SUPPORTED)
Is this a reliable check for all possible SMCCC services ? i.e., Are we
expected to get RET_NOT_SUPPORTED for any service for which the backend
is not available ?
Also, as pointed out RSI_ABI_VERSION may return other errors based on
the input (requested version, e.g., RSI_ERROR_INPUT) and we may still go
ahead and register the device ?
> return false;
>
> return true;
> }
>
> static int __init smccc_devices_init(void)
> {
> struct arm_smccc_device *sdev;
> const struct smccc_device_info *smccc_dev;
>
> for (int i = 0; i < ARRAY_SIZE(smccc_devices); i++) {
> smccc_dev = &smccc_devices[i];
>
> if (!smccc_probe_smccc_device(smccc_dev))
> continue;
>
> sdev = arm_smccc_device_register(smccc_dev->device_name);
> if (IS_ERR(sdev))
> pr_err("%s: could not register device: %ld\n",
> smccc_dev->device_name, PTR_ERR(sdev));
>
> }
>
> return 0;
> }
> device_initcall(smccc_devices_init);
>
> with the diff to hw_random/smccc_trng
>
> modified arch/arm64/include/asm/archrandom.h
> @@ -12,7 +12,7 @@
>
> extern bool smccc_trng_available;
>
> -static inline bool __init smccc_probe_trng(void)
> +static inline bool smccc_probe_trng(void)
> {
> struct arm_smccc_res res;
>
> modified drivers/char/hw_random/arm_smccc_trng.c
> @@ -19,6 +19,8 @@
> #include <linux/arm-smccc.h>
> #include <linux/arm-smccc-bus.h>
>
> +#include <asm/archrandom.h>
> +
> #ifdef CONFIG_ARM64
> #define ARM_SMCCC_TRNG_RND ARM_SMCCC_TRNG_RND64
> #define MAX_BITS_PER_CALL (3 * 64UL)
> @@ -98,6 +100,10 @@ static int smccc_trng_probe(struct arm_smccc_device *sdev)
> {
> struct hwrng *trng;
>
> + /* validate the minimum version requirement */
> + if (!smccc_probe_trng())
> + return -ENODEV;
> +
> trng = devm_kzalloc(&sdev->dev, sizeof(*trng), GFP_KERNEL);
> if (!trng)
> return -ENOMEM;
>
> We can also move arch/arm64/include/asm/rsi_smc.h to
> include/linux/arm-rsi-smccc.h. There was a suggestion to move these
super minor nit: arm-smccc-rsi.h ?
Cheers
Suzuki
> firmware interfaces out of architecture-specific code:
>
> https://lore.kernel.org/all/agsNO9cc7H-b0H8L@willie-the-truck
>
> This will also avoid the #ifdef CONFIG_ARM64
>
> -aneesh
^ permalink raw reply
* Re: [PATCH v7 10/42] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Vlastimil Babka (SUSE) @ 2026-06-08 8:55 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-10-2f0fae496530@google.com>
On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> When converting memory to private in guest_memfd, it is necessary to ensure
> that the pages are not currently being accessed by any other part of the
> kernel or userspace to avoid any current user writing to guest private
> memory.
>
> guest_memfd checks for unexpected refcounts to determine whether a page is
> still in use. The only expected refcounts after unmapping the range
> requested for conversion are those that are held by guest_memfd itself.
Is it sufficient to only check, and not also freeze the refcount? (i.e.
using folio_ref_freeze()), because without freezing, anything (e.g.
compaction's pfn-based scanner) could do a speculative folio_try_get() and
the checked refcount becomes stale.
Might be ok if we know that no such speculative increment can result in
actually touching the page contents, and the extra refcount and something
inspecting the struct folio won't interfere with anything else. Then it
could be just a comment mentioning why it's safe.
IIRC the compaction's scanning can result in a migration here so it's
probably ok?
> Update the kvm_memory_attributes2 structure to include an error_offset
> field. This allows KVM to report the exact offset where a conversion
> failed to userspace. If the safety check fails, return -EAGAIN and copy
> the error_offset back to userspace so that it can potentially retry the
> operation or handle the failure gracefully.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> include/uapi/linux/kvm.h | 3 ++-
> virt/kvm/guest_memfd.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 65 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6bbf68a83813..0b55258573d3d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1658,7 +1658,8 @@ struct kvm_memory_attributes2 {
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 426917d22a2b6..2767992955752 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -572,9 +572,45 @@ static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> }
>
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> + size_t nr_pages, pgoff_t *err_index)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + const int filemap_get_folios_refcount = 1;
> + pgoff_t last = start + nr_pages - 1;
> + struct folio_batch fbatch;
> + bool safe = true;
> + pgoff_t next;
> + int i;
> +
> + folio_batch_init(&fbatch);
> +
> + next = start;
> + while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
> +
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_ref_count(folio) !=
> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
> + safe = false;
> + *err_index = max(start, folio->index);
> + break;
> + }
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +
> + return safe;
> +}
> +
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> - size_t nr_pages, uint64_t attrs)
> + size_t nr_pages, uint64_t attrs,
> + pgoff_t *err_index)
> {
> + bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> struct address_space *mapping = inode->i_mapping;
> struct gmem_inode *gi = GMEM_I(inode);
> pgoff_t end = start + nr_pages;
> @@ -588,8 +624,21 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>
> mas_init(&mas, mt, start);
> r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> - if (r)
> + if (r) {
> + *err_index = start;
> goto out;
> + }
> +
> + if (to_private) {
> + unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> + if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> + err_index)) {
> + mas_destroy(&mas);
> + r = -EAGAIN;
> + goto out;
> + }
> + }
>
> /*
> * From this point on guest_memfd has performed necessary
> @@ -609,9 +658,10 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> struct gmem_file *f = file->private_data;
> struct inode *inode = file_inode(file);
> struct kvm_memory_attributes2 attrs;
> + pgoff_t err_index;
> size_t nr_pages;
> pgoff_t index;
> - int i;
> + int i, r;
>
> if (copy_from_user(&attrs, argp, sizeof(attrs)))
> return -EFAULT;
> @@ -635,8 +685,16 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>
> nr_pages = attrs.size >> PAGE_SHIFT;
> index = attrs.offset >> PAGE_SHIFT;
> - return __kvm_gmem_set_attributes(inode, index, nr_pages,
> - attrs.attributes);
> + r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> + &err_index);
> + if (r) {
> + attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + return -EFAULT;
> + }
> +
> + return r;
> }
>
> static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
>
^ permalink raw reply
* Re: [PATCH v14 24/44] KVM: arm64: Handle realm MMIO emulation
From: Steven Price @ 2026-06-08 8:49 UTC (permalink / raw)
To: Gavin Shan, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <8b648b59-c411-4126-be18-686d2927f24a@redhat.com>
On 28/05/2026 06:03, Gavin Shan wrote:
> Hi Steve,
>
> On 5/13/26 11:17 PM, Steven Price wrote:
>> MMIO emulation for a realm cannot be done directly with the VM's
>> registers as they are protected from the host. However, for emulatable
>> data aborts, the RMM uses GPRS[0] to provide the read/written value.
>> We can transfer this from/to the equivalent VCPU's register entry and
>> then depend on the generic MMIO handling code in KVM.
>>
>> For a MMIO read, the value is placed in the shared RecExit structure
>> during kvm_handle_mmio_return() rather than in the VCPU's register
>> entry.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since v7:
>> * New comment for rec_exit_sync_dabt() explaining the call to
>> vcpu_set_reg().
>> Changes since v5:
>> * Inject SEA to the guest is an emulatable MMIO access triggers a data
>> abort.
>> * kvm_handle_mmio_return() - disable kvm_incr_pc() for a REC (as the PC
>> isn't under the host's control) and move the REC_ENTER_EMULATED_MMIO
>> flag setting to this location (as that tells the RMM to skip the
>> instruction).
>> ---
>> arch/arm64/kvm/inject_fault.c | 4 +++-
>> arch/arm64/kvm/mmio.c | 16 ++++++++++++----
>> arch/arm64/kvm/rmi-exit.c | 14 ++++++++++++++
>> 3 files changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/
>> inject_fault.c
>> index 89982bd3345f..6492397b73d7 100644
>> --- a/arch/arm64/kvm/inject_fault.c
>> +++ b/arch/arm64/kvm/inject_fault.c
>> @@ -228,7 +228,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu,
>> bool is_pabt, u32 addr)
>> static void __kvm_inject_sea(struct kvm_vcpu *vcpu, bool iabt, u64
>> addr)
>> {
>> - if (vcpu_el1_is_32bit(vcpu))
>> + if (unlikely(vcpu_is_rec(vcpu)))
>> + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_INJECT_SEA;
>> + else if (vcpu_el1_is_32bit(vcpu))
>> inject_abt32(vcpu, iabt, addr);
>> else
>> inject_abt64(vcpu, iabt, addr);
>> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
>> index e2285ed8c91d..6a8cb927fcca 100644
>> --- a/arch/arm64/kvm/mmio.c
>> +++ b/arch/arm64/kvm/mmio.c
>> @@ -6,6 +6,7 @@
>> #include <linux/kvm_host.h>
>> #include <asm/kvm_emulate.h>
>> +#include <asm/rmi_smc.h>
>> #include <trace/events/kvm.h>
>> #include "trace.h"
>> @@ -138,14 +139,21 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
>> trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr,
>> &data);
>> data = vcpu_data_host_to_guest(vcpu, data, len);
>> - vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>> +
>> + if (vcpu_is_rec(vcpu))
>> + vcpu->arch.rec.run->enter.gprs[0] = data;
>> + else
>> + vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>> }
>> /*
>> * The MMIO instruction is emulated and should not be re-executed
>> * in the guest.
>> */
>> - kvm_incr_pc(vcpu);
>> + if (vcpu_is_rec(vcpu))
>> + vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_EMULATED_MMIO;
>> + else
>> + kvm_incr_pc(vcpu);
>> return 1;
>> }
>> @@ -167,14 +175,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu,
>> phys_addr_t fault_ipa)
>> * No valid syndrome? Ask userspace for help if it has
>> * volunteered to do so, and bail out otherwise.
>> *
>> - * In the protected VM case, there isn't much userspace can do
>> + * In the protected/realm VM case, there isn't much userspace can do
>> * though, so directly deliver an exception to the guest.
>> */
>> if (!kvm_vcpu_dabt_isvalid(vcpu)) {
>> trace_kvm_mmio_nisv(*vcpu_pc(vcpu), esr,
>> kvm_vcpu_get_hfar(vcpu), fault_ipa);
>> - if (vcpu_is_protected(vcpu))
>> + if (vcpu_is_protected(vcpu) || vcpu_is_rec(vcpu))
>> return kvm_inject_sea_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
>> if (test_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER,
>> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
>> index e7c51b6cf6ce..8ec0d179eba2 100644
>> --- a/arch/arm64/kvm/rmi-exit.c
>> +++ b/arch/arm64/kvm/rmi-exit.c
>> @@ -25,6 +25,20 @@ static int rec_exit_reason_notimpl(struct kvm_vcpu
>> *vcpu)
>> static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
>> {
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> + /*
>> + * In the case of a write, copy over gprs[0] to the target GPR,
>> + * preparing to handle MMIO write fault. The content to be
>> written has
>> + * been saved to gprs[0] by the RMM (even if another register was
>> used
>> + * by the guest). In the case of normal memory access this is
>> redundant
>> + * (the guest will replay the instruction), but the overhead is
>> + * minimal.
>> + */
>> + if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu))
>> + vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu),
>> + rec->run->exit.gprs[0]);
>> +
>
> { } is needed here.
Indeed - I'm surprised checkpatch didn't manage to flag that. I'll fix.
Thanks,
Steve
>> return kvm_handle_guest_abort(vcpu);
>> }
>>
>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v7 14/42] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Vlastimil Babka (SUSE) @ 2026-06-08 8:45 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-14-2f0fae496530@google.com>
On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount. This can lead to a false positive,
> incorrectly indicating that the folio is in use and preventing the
> conversion, even if it is otherwise safe. The conversion process might not
> be on the same CPU that holds the folio in its fbatch, making a simple
> per-CPU check insufficient.
>
> To address this, drain all CPUs' lru_add fbatches if an unexpectedly high
> refcount is encountered during the safety check. This is performed at most
> once per conversion request. Draining only if the folio in question may be
> lru cached.
>
> guest_memfd folios are unevictable, so they can only reside in the lru_add
> fbatch. If the folio's refcount is still unsafe after draining, then the
> conversion is truly deemed unsafe.
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Sudeep Holla @ 2026-06-08 8:39 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-coco, linux-arm-kernel, linux-kernel, Sudeep Holla,
Catalin Marinas, Greg KH, Jeremy Linton, Jonathan Cameron,
Lorenzo Pieralisi, Mark Rutland, Will Deacon, Steven Price,
Suzuki K Poulose
In-Reply-To: <yq5ao6hlzbpa.fsf@kernel.org>
On Mon, Jun 08, 2026 at 01:49:13PM +0530, Aneesh Kumar K.V wrote:
> Sudeep Holla <sudeep.holla@kernel.org> writes:
>
> > On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
> >> Sudeep Holla <sudeep.holla@kernel.org> writes:
> >>
> >> ...
> >>
> >> > +static const struct smccc_device_info smccc_devices[] __initconst = {
> >> > + {
> >> > + .func_id = ARM_SMCCC_TRNG_VERSION,
> >> > + .requires_smc = false,
> >> > + .min_return = ARM_SMCCC_TRNG_MIN_VERSION,
> >> > + .device_name = "arm-smccc-trng",
> >> > + },
> >> > +};
> >> > +
> >> > +static bool __init
> >> > +smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
> >> > +{
> >> > + struct arm_smccc_res res;
> >> > + unsigned long ret;
> >> > +
> >> > + if (!IS_ENABLED(CONFIG_ARM64))
> >> > + return false;
> >> > +
> >> > + if (smccc_conduit == SMCCC_CONDUIT_NONE)
> >> > + return false;
> >> > +
> >> > + if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
> >> > + return false;
> >> > +
> >> > + arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
> >> > + ret = res.a0;
> >> > +
> >> > + if ((s32)ret < 0)
> >> > + return false;
> >> > +
> >> > + return ret >= smccc_dev->min_return;
> >> > +}
> >> > +
> >> >
> >>
> >> I am not sure we want the check to be as simple as ret < 0. Some
> >> function IDs may return input errors based on the supplied arguments
> >> (for example, RMI_ERROR_INPUT). In those cases, we would likely want
> >> this to be handled via a callback.
> >>
> >
> > As I mentioned in response to Suzuki, we can defer that to probe of
> > that device. If *_VERSION, succeeds SMCCC core can add that device and
> > leave the rest to the core keeping the core and bus layer simple IMO.
> >
> >> We also want to use conditional compilation for some function IDs.
> >> Given the callback approach and the #ifdefs, I wonder whether what we
> >> currently have is actually simpler and more flexible.”
> >>
> >
> > I was trying to avoid conditional compilation altogether and hence the
> > reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
> > in above snippet must come as some condition to this generic probe.
> >
> > Adding any more logic or callback defeats the bus idea here if we need
> > to rely/depend on multiple conditional compilation or callbacks IMO.
> >
> > Let's find see if it can work with what we are adding now and may add in
> > near future and then decide.
> >
>
> If we move all the conditional checks to the driver probe path, then I
> think this can work. Something like the below:
>
Sounds good to me.
[...]
> We can also move arch/arm64/include/asm/rsi_smc.h to
> include/linux/arm-rsi-smccc.h. There was a suggestion to move these
> firmware interfaces out of architecture-specific code:
>
> https://lore.kernel.org/all/agsNO9cc7H-b0H8L@willie-the-truck
>
Ah OK, sorry I had missed this.
--
Regards,
Sudeep
^ permalink raw reply
* Re: [PATCH v13 15/22] KVM: selftests: Call KVM_TDX_INIT_VCPU when creating a new TDX vcpu
From: Binbin Wu @ 2026-06-08 8:34 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-15-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> index 9660ea9d2f31..4d01f806b37d 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> @@ -39,6 +39,30 @@ static inline bool is_tdx_vm(struct kvm_vm *vm)
> __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, vm); \
> })
>
> +#define __tdx_vcpu_ioctl(vcpu, cmd, _flags, arg) \
> +({ \
> + int r; \
> + \
> + union { \
> + struct kvm_tdx_cmd c; \
> + unsigned long raw; \
> + } tdx_cmd = { .c = { \
> + .id = (cmd), \
> + .flags = (u32)(_flags), \
> + .data = (u64)(arg), \
Nit:
The two lines' backslashes are misaligned.
> + } }; \
> + \
> + r = __vcpu_ioctl(vcpu, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd.raw); \
> + r ?: tdx_cmd.c.hw_error; \
Similar issue of the truncation of upper bits.
Though TDX KVM code never sets hw_error currently for vcpu version.
> +})
> +
> +#define tdx_vcpu_ioctl(vcpu, cmd, flags, arg) \
> +({ \
> + int ret = __tdx_vcpu_ioctl(vcpu, cmd, flags, arg); \
> + \
> + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, (vcpu)->vm); \
> +})
> +
> void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
> void tdx_vm_setup_boot_code_region(struct kvm_vm *vm);
> void tdx_vm_setup_boot_parameters_region(struct kvm_vm *vm, u32 nr_runnable_vcpus);
^ permalink raw reply
* [PATCH 3/4] x86/msr: Switch wrmsrl() users to wrmsrq()
From: Juergen Gross @ 2026-06-08 8:28 UTC (permalink / raw)
To: linux-kernel, x86, linux-perf-users, kvm, linux-coco,
linux-hyperv, linux-pm
Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
James Morse, Babu Moger, Sean Christopherson, Paolo Bonzini,
Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
Artem Bityutskiy, Len Brown
In-Reply-To: <20260608082809.3492719-1-jgross@suse.com>
wrmsrl() is a deprecated synonym for wrmsrq(). Switch its users to
wrmsrq().
Signed-off-by: Juergen Gross <jgross@suse.com>
---
arch/x86/events/amd/uncore.c | 2 +-
arch/x86/events/intel/core.c | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kvm/pmu.c | 6 +++---
arch/x86/kvm/vmx/tdx.c | 6 +++---
drivers/hv/mshv_vtl_main.c | 2 +-
drivers/idle/intel_idle.c | 2 +-
8 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 98ef4bf9911a..7dc6af4231cc 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -975,7 +975,7 @@ static void amd_uncore_umc_read(struct perf_event *event)
* that the counter never gets a chance to saturate.
*/
if (new & BIT_ULL(63 - COUNTER_SHIFT)) {
- wrmsrl(hwc->event_base, 0);
+ wrmsrq(hwc->event_base, 0);
local64_set(&hwc->prev_count, 0);
} else {
local64_set(&hwc->prev_count, new);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index dd1e3aa75ee9..e9baa64dc962 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3166,12 +3166,12 @@ static void intel_pmu_config_acr(int idx, u64 mask, u32 reload)
}
if (cpuc->acr_cfg_b[idx] != mask) {
- wrmsrl(msr_b + msr_offset, mask);
+ wrmsrq(msr_b + msr_offset, mask);
cpuc->acr_cfg_b[idx] = mask;
}
/* Only need to update the reload value when there is a valid config value. */
if (mask && cpuc->acr_cfg_c[idx] != reload) {
- wrmsrl(msr_c + msr_offset, reload);
+ wrmsrq(msr_c + msr_offset, reload);
cpuc->acr_cfg_c[idx] = reload;
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c5ed0bc1f831..e4918c32a822 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -532,7 +532,7 @@ static void resctrl_abmc_config_one_amd(void *info)
{
union l3_qos_abmc_cfg *abmc_cfg = info;
- wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
+ wrmsrq(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
}
/*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b85e715ebb30..d44afbe005bb 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -708,7 +708,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Reset hw history on AMD CPUs */
if (cpu_feature_enabled(X86_FEATURE_AMD_WORKLOAD_CLASS))
- wrmsrl(MSR_AMD_WORKLOAD_HRST, 0x1);
+ wrmsrq(MSR_AMD_WORKLOAD_HRST, 0x1);
return prev_p;
}
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e218352e3423..aee70e5dc15d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1313,14 +1313,14 @@ static void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu)
pmc = &pmu->gp_counters[i];
if (pmc->counter != rdpmc(i))
- wrmsrl(gp_counter_msr(i), pmc->counter);
- wrmsrl(gp_eventsel_msr(i), pmc->eventsel_hw);
+ wrmsrq(gp_counter_msr(i), pmc->counter);
+ wrmsrq(gp_eventsel_msr(i), pmc->eventsel_hw);
}
for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
pmc = &pmu->fixed_counters[i];
if (pmc->counter != rdpmc(INTEL_PMC_FIXED_RDPMC_BASE | i))
- wrmsrl(fixed_counter_msr(i), pmc->counter);
+ wrmsrq(fixed_counter_msr(i), pmc->counter);
}
}
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 04ce321ebdf3..cb50e23c39ca 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -823,7 +823,7 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
return;
++vcpu->stat.host_state_reload;
- wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
+ wrmsrq(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
vt->guest_state_loaded = false;
}
@@ -1048,10 +1048,10 @@ static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu)
/*
* Likewise, even if a TDX hosts didn't support XSS both arms of
- * the comparison would be 0 and the wrmsrl would be skipped.
+ * the comparison would be 0 and the wrmsrq would be skipped.
*/
if (kvm_host.xss != (kvm_tdx->xfam & kvm_caps.supported_xss))
- wrmsrl(MSR_IA32_XSS, kvm_host.xss);
+ wrmsrq(MSR_IA32_XSS, kvm_host.xss);
}
#define TDX_DEBUGCTL_PRESERVED (DEBUGCTLMSR_BTF | \
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index f5d27f28d6ad..0d3d4161974f 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -596,7 +596,7 @@ static int mshv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set)
} else {
/* Handle MSRs */
if (set)
- wrmsrl(reg_table[i].msr_addr, *reg64);
+ wrmsrq(reg_table[i].msr_addr, *reg64);
else
rdmsrq(reg_table[i].msr_addr, *reg64);
}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 15c698291b32..67d5993c7387 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -2379,7 +2379,7 @@ static void intel_c1_demotion_toggle(void *enable)
msr_val |= NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE;
else
msr_val &= ~(NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE);
- wrmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
+ wrmsrq(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
}
static ssize_t intel_c1_demotion_store(struct device *dev,
--
2.54.0
^ permalink raw reply related
* [PATCH 0/4] x86/msr: Get rid of rdmsrl() and wrmsrl()
From: Juergen Gross @ 2026-06-08 8:28 UTC (permalink / raw)
To: linux-kernel, x86, linux-perf-users, linux-hyperv, linux-pm, kvm,
linux-coco
Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
James Morse, Babu Moger, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
Artem Bityutskiy, Len Brown, Sean Christopherson, Paolo Bonzini,
Kiryl Shutsemau, Rick Edgecombe
rdsmrl() and wrmsrl() are deprecated aliases of rdmsrq() and wrmsrq().
Switch all users and remove the deprecated variants.
Juergen Gross (4):
x86/msr: Switch rdmsrl() users to rdmsrq()
x86/msr: Remove rdmsrl()
x86/msr: Switch wrmsrl() users to wrmsrq()
x86/msr: Remove wrmsrl()
arch/x86/events/amd/uncore.c | 4 ++--
arch/x86/events/intel/core.c | 4 ++--
arch/x86/include/asm/msr.h | 5 -----
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kvm/pmu.c | 6 +++---
arch/x86/kvm/vmx/tdx.c | 6 +++---
drivers/hv/mshv_vtl_main.c | 4 ++--
drivers/idle/intel_idle.c | 6 +++---
9 files changed, 18 insertions(+), 23 deletions(-)
--
2.54.0
^ permalink raw reply
* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Aneesh Kumar K.V @ 2026-06-08 8:19 UTC (permalink / raw)
To: Sudeep Holla
Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
Sudeep Holla, Greg KH, Jeremy Linton, Jonathan Cameron,
Lorenzo Pieralisi, Mark Rutland, Will Deacon, Steven Price,
Suzuki K Poulose
In-Reply-To: <20260604-juicy-daft-starling-3eec1f@sudeepholla>
Sudeep Holla <sudeep.holla@kernel.org> writes:
> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>
>> ...
>>
>> > +static const struct smccc_device_info smccc_devices[] __initconst = {
>> > + {
>> > + .func_id = ARM_SMCCC_TRNG_VERSION,
>> > + .requires_smc = false,
>> > + .min_return = ARM_SMCCC_TRNG_MIN_VERSION,
>> > + .device_name = "arm-smccc-trng",
>> > + },
>> > +};
>> > +
>> > +static bool __init
>> > +smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
>> > +{
>> > + struct arm_smccc_res res;
>> > + unsigned long ret;
>> > +
>> > + if (!IS_ENABLED(CONFIG_ARM64))
>> > + return false;
>> > +
>> > + if (smccc_conduit == SMCCC_CONDUIT_NONE)
>> > + return false;
>> > +
>> > + if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
>> > + return false;
>> > +
>> > + arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
>> > + ret = res.a0;
>> > +
>> > + if ((s32)ret < 0)
>> > + return false;
>> > +
>> > + return ret >= smccc_dev->min_return;
>> > +}
>> > +
>> >
>>
>> I am not sure we want the check to be as simple as ret < 0. Some
>> function IDs may return input errors based on the supplied arguments
>> (for example, RMI_ERROR_INPUT). In those cases, we would likely want
>> this to be handled via a callback.
>>
>
> As I mentioned in response to Suzuki, we can defer that to probe of
> that device. If *_VERSION, succeeds SMCCC core can add that device and
> leave the rest to the core keeping the core and bus layer simple IMO.
>
>> We also want to use conditional compilation for some function IDs.
>> Given the callback approach and the #ifdefs, I wonder whether what we
>> currently have is actually simpler and more flexible.”
>>
>
> I was trying to avoid conditional compilation altogether and hence the
> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
> in above snippet must come as some condition to this generic probe.
>
> Adding any more logic or callback defeats the bus idea here if we need
> to rely/depend on multiple conditional compilation or callbacks IMO.
>
> Let's find see if it can work with what we are adding now and may add in
> near future and then decide.
>
If we move all the conditional checks to the driver probe path, then I
think this can work. Something like the below:
struct smccc_device_info {
u32 func_id;
bool requires_smc;
const char *device_name;
};
static const struct smccc_device_info smccc_devices[] __initconst = {
{
.func_id = ARM_SMCCC_TRNG_VERSION,
.requires_smc = false,
.device_name = "arm-smccc-trng",
},
{
.func_id = RSI_ABI_VERSION,
.requires_smc = true,
.device_name = RSI_DEV_NAME,
},
};
static bool __init smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
{
unsigned long ret;
struct arm_smccc_res res;
if (smccc_conduit == SMCCC_CONDUIT_NONE)
return false;
if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
return false;
arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
ret = res.a0;
if ((s32)ret == SMCCC_RET_NOT_SUPPORTED)
return false;
return true;
}
static int __init smccc_devices_init(void)
{
struct arm_smccc_device *sdev;
const struct smccc_device_info *smccc_dev;
for (int i = 0; i < ARRAY_SIZE(smccc_devices); i++) {
smccc_dev = &smccc_devices[i];
if (!smccc_probe_smccc_device(smccc_dev))
continue;
sdev = arm_smccc_device_register(smccc_dev->device_name);
if (IS_ERR(sdev))
pr_err("%s: could not register device: %ld\n",
smccc_dev->device_name, PTR_ERR(sdev));
}
return 0;
}
device_initcall(smccc_devices_init);
with the diff to hw_random/smccc_trng
modified arch/arm64/include/asm/archrandom.h
@@ -12,7 +12,7 @@
extern bool smccc_trng_available;
-static inline bool __init smccc_probe_trng(void)
+static inline bool smccc_probe_trng(void)
{
struct arm_smccc_res res;
modified drivers/char/hw_random/arm_smccc_trng.c
@@ -19,6 +19,8 @@
#include <linux/arm-smccc.h>
#include <linux/arm-smccc-bus.h>
+#include <asm/archrandom.h>
+
#ifdef CONFIG_ARM64
#define ARM_SMCCC_TRNG_RND ARM_SMCCC_TRNG_RND64
#define MAX_BITS_PER_CALL (3 * 64UL)
@@ -98,6 +100,10 @@ static int smccc_trng_probe(struct arm_smccc_device *sdev)
{
struct hwrng *trng;
+ /* validate the minimum version requirement */
+ if (!smccc_probe_trng())
+ return -ENODEV;
+
trng = devm_kzalloc(&sdev->dev, sizeof(*trng), GFP_KERNEL);
if (!trng)
return -ENOMEM;
We can also move arch/arm64/include/asm/rsi_smc.h to
include/linux/arm-rsi-smccc.h. There was a suggestion to move these
firmware interfaces out of architecture-specific code:
https://lore.kernel.org/all/agsNO9cc7H-b0H8L@willie-the-truck
This will also avoid the #ifdef CONFIG_ARM64
-aneesh
^ permalink raw reply
* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Binbin Wu @ 2026-06-08 8:03 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-13-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> Set the initial state of the first memory region as shared if it is
> backed by guest_memfd, so that the KVM selftest framework functions can
> populate mmap()-ed guest_memfd memory the same way memory from other
> memory providers are populated.
>
> For CoCo VMs, pages that need to be private are explicitly set to
> private before executing the VM.
>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
> tools/testing/selftests/kvm/lib/kvm_util.c | 16 ++++++++++------
> 1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 9a29540fff40..1bab7d76a59c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -484,8 +484,10 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
> u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
> nr_extra_pages);
> struct userspace_mem_region *slot0;
> + u64 gmem_flags = 0;
> struct kvm_vm *vm;
> - int i, flags;
> + int flags = 0;
> + int i;
>
> kvm_set_files_rlimit(nr_runnable_vcpus);
>
> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
> vm = ____vm_create(shape);
>
> /*
> - * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
> - * for CoCo VMs that require GUEST_MEMFD backed private memory.
> + * Force GUEST_MEMFD for the primary memory region if necessary, and
> + * initialize it as shared so the selftest framework can populate it
> + * exactly like other memory providers.
> */
> - flags = 0;
> - if (is_guest_memfd_required(shape))
> + if (is_guest_memfd_required(shape)) {
> flags |= KVM_MEM_GUEST_MEMFD;
> + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> + }
>
> - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> + vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
The build failed due to this:
lib/kvm_util.c: In function ‘__vm_create’:
lib/kvm_util.c:507:9: error: too many arguments to function ‘vm_mem_add’
507 | vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
| ^~~~~~~~~~
In file included from lib/kvm_util.c:9:
include/kvm_util.h:714:6: note: declared here
714 | void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
| ^~~~~~~~~~
lib/kvm_util.c: At top level:
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
It seems the patch set doesn't wire gmem_flags parameter to vm_mem_add().
> for (i = 0; i < NR_MEM_REGIONS; i++)
> vm->memslots[i] = 0;
>
>
^ permalink raw reply
* Re: [PATCH v13 12/22] KVM: selftests: Back the first memory region with guest_memfd for TDX
From: Binbin Wu @ 2026-06-08 7:31 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-12-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> Force GUEST_MEMFD for the primary memory region of TDX VMs.
>
> TDX must use guest_memfd for private pages as there is no alternative
> mechanism supported by the TDX architecture.
This sounds a bit confusing to me.
It's the kernel/KVM only support guest_memfd for private pages.
The restriction is not from the "TDX architecture".
Also, the short log is also a bit confusing.
Why describe the "first memory region" here?
>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
> tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index d1befa3f4b30..9a29540fff40 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -472,7 +472,7 @@ void kvm_set_files_rlimit(u32 nr_vcpus)
> static bool is_guest_memfd_required(struct vm_shape shape)
> {
> #ifdef __x86_64__
> - return shape.type == KVM_X86_SNP_VM;
> + return (shape.type == KVM_X86_SNP_VM || shape.type == KVM_X86_TDX_VM);
> #else
> return false;
> #endif
>
^ permalink raw reply
* Re: [PATCH v13 11/22] KVM: selftests: Set up TDX boot parameters region
From: Binbin Wu @ 2026-06-08 7:23 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-11-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]> +void tdx_vm_load_common_boot_parameters(struct kvm_vm *vm)
> +{
> + struct td_boot_parameters *params =
> + addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA);
> + u32 cr4;
> +
> + TEST_ASSERT_EQ(vm->mode, VM_MODE_PXXVYY_4K);
> +
> + cr4 = kvm_get_default_cr4();
> + if (vm->mmu.pgtable_levels == 5)
> + cr4 |= X86_CR4_LA57;
This should be removed after the 5-level paging code is added back in
kvm_get_default_cr4().
^ permalink raw reply
* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-06-08 6:54 UTC (permalink / raw)
To: Kishen Maloor
Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li
In-Reply-To: <5c06bdbc-c847-4dfa-9c89-afbb52946b8f@intel.com>
On Sat, Jun 06, 2026 at 09:36:41PM -0700, Kishen Maloor wrote:
> On 5/21/26 8:41 PM, Xu Yilun wrote:
> > ...
> > This series has 2 distinct parts:
> >
> > Patches 1-4: TDX Module Extensions enabling
> > Patches 5-15: DICE-based TDX Quoting, primarily Peter's work.
> >
> Perhaps the extensions enabling patches could be organized more simply as
> these three?
>
> 1. Add TDX extensions metadata structure and accessor
> 2. Add TDH.EXT.MEM.ADD
> 3. Add TDH.EXT.INIT and wire extensions init into init_tdx_module()
>
> This introduces the SEAMCALLs and lets the wiring land with the patch
> that completes the init flow, avoiding a separate "enable" patch.
Yes, several comments point to a same concern for patch organization - no
need a separate "enable" patch. Also a more sound justfication to me is,
the Extension will not actually been enabled until an add-on feature is
explicitly configured (See patch #15). So we could add steps in nature
order without worrying the incomplete flow breaks the kernel.
My reordering is:
1. Add a placeholder for Extension initialization to hook into
init_tdx_module(). Give a chance to explain the considerations of
the enable-at-boot-up policy.
2. Detect if Extension is required based on the metadata, if no, skip.
So no side effect for following steps.
3. Add TDH.EXT.MEM.ADD
4. Add TDH.EXT.INIT
>
^ permalink raw reply
* Re: [PATCH v13 09/22] KVM: selftests: Expose functions to get default sregs values
From: Binbin Wu @ 2026-06-08 6:39 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-9-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]
> +
> +static inline u64 kvm_get_default_cr4(void)
> +{
> + u64 cr4 = X86_CR4_PAE | X86_CR4_OSFXSR;
> +
> + if (kvm_cpu_has(X86_FEATURE_XSAVE))
> + cr4 |= X86_CR4_OSXSAVE;
> + return cr4;
> +}
> +
[...]
> @@ -647,16 +643,12 @@ static void vcpu_init_sregs(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> vcpu_sregs_get(vcpu, &sregs);
>
> sregs.idt.base = vm->arch.idt;
> - sregs.idt.limit = NUM_INTERRUPTS * sizeof(struct idt_entry) - 1;
> + sregs.idt.limit = kvm_get_default_idt_limit();
> sregs.gdt.base = vm->arch.gdt;
> - sregs.gdt.limit = getpagesize() - 1;
> -
> - sregs.cr0 = X86_CR0_PE | X86_CR0_NE | X86_CR0_PG;
> - sregs.cr4 |= X86_CR4_PAE | X86_CR4_OSFXSR;
> - if (kvm_cpu_has(X86_FEATURE_XSAVE))
> - sregs.cr4 |= X86_CR4_OSXSAVE;
> - if (vm->mmu.pgtable_levels == 5)
> - sregs.cr4 |= X86_CR4_LA57;
I guess the 5-level paging thing is dropped unexpectedly during rebase?
> + sregs.gdt.limit = kvm_get_default_gdt_limit();
>
> + sregs.cr0 = kvm_get_default_cr0();
> + sregs.cr4 |= kvm_get_default_cr4();
> sregs.efer |= (EFER_LME | EFER_LMA | EFER_NX);
>
> kvm_seg_set_unusable(&sregs.ldt);
>
^ permalink raw reply
* Re: [PATCH v13 06/22] tools: include: Add kbuild.h for assembly structure offsets
From: Binbin Wu @ 2026-06-08 6:12 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-6-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sagi Shahar <sagis@google.com>
>
> Add the Kbuild macros needed to enable the filechk_offsets mechanism to
> generate C header files containing structure member offset information.
>
> Tools depending on assembly code that operate on structures have to
> hardcode the offsets of structure members. The Kbuild infrastructure
> can instead generate C header files with these offsets automatically,
> allowing them to be included in assembly code as symbolic constants.
>
> For example, the TDX guest boot code requires access to parameters
> passed in the C structure(struct td_boot_parameters). This header
^
Nit: missing a space.
> provides the macros needed to extract these offsets from C code and
> expose them to assembly, ensuring the two remain synchronized.
>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> tools/include/linux/kbuild.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/tools/include/linux/kbuild.h b/tools/include/linux/kbuild.h
> new file mode 100644
> index 000000000000..957fd55cd159
> --- /dev/null
> +++ b/tools/include/linux/kbuild.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __TOOLS_LINUX_KBUILD_H
> +#define __TOOLS_LINUX_KBUILD_H
> +
> +#define DEFINE(sym, val) \
> + asm volatile("\n.ascii \"->" #sym " %0 " #val "\"" : : "i" (val))
> +
> +#define OFFSET(sym, str, mem) \
> + DEFINE(sym, __builtin_offsetof(struct str, mem))
> +
> +#endif /* __TOOLS_LINUX_KBUILD_H */
>
^ permalink raw reply
* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Binbin Wu @ 2026-06-08 5:57 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-3-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> index f647e6ca6b34..48d4bd36c35b 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> @@ -11,4 +11,34 @@ static inline bool is_tdx_vm(struct kvm_vm *vm)
> return vm->type == KVM_X86_TDX_VM;
> }
>
> +/*
> + * TDX ioctls
> + * Use underscores to avoid collisions with struct member names.
> + */
> +#define __tdx_vm_ioctl(vm, cmd, _flags, arg) \
> +({ \
> + int r; \
> + \
> + union { \
> + struct kvm_tdx_cmd c; \
> + unsigned long raw; \
> + } tdx_cmd = { .c = { \
> + .id = (cmd), \
> + .flags = (u32)(_flags), \
> + .data = (u64)(arg), \
Nit:
The two lines' backslashes are misaligned.
> + } }; \
> + \
> + r = __vm_ioctl(vm, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd.raw); \
> + r ?: tdx_cmd.c.hw_error; \
> +})
> +
> +#define tdx_vm_ioctl(vm, cmd, flags, arg) \
> +({ \
> + int ret = __tdx_vm_ioctl(vm, cmd, flags, arg); \
tdx_cmd.c.hw_error is u64 and it could be assigned to ret, which is a int,
the upper bits could be truncated if the upper 32-bit is set.
> + \
> + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, vm); \
> +})
> +
> +void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
> +
> #endif /* SELFTESTS_TDX_TDX_UTIL_H */[...]
^ permalink raw reply
* Re: [PATCH v6 00/11] Dynamic PAMT
From: Tony Lindgren @ 2026-06-08 5:45 UTC (permalink / raw)
To: Rick Edgecombe
Cc: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
vannapurve, x86, chao.gao, yan.y.zhao, kai.huang
In-Reply-To: <20260526023515.288829-1-rick.p.edgecombe@intel.com>
On Mon, May 25, 2026 at 07:35:04PM -0700, Rick Edgecombe wrote:
> For a simple small server with mostly physical contiguous RAM and no CXL
> complications, the basic implementation should be close to optimal anyway.
> And for big servers, an 8GB allocation is going to have less impact. In
> the end Dynamic PAMT *is* an optimization that we will force on as a
> good default option. Even with all the optimizations we could throw at it,
> if the system is 100% TDs, Dynamic PAMT could come out slightly behind. So
> judgment on good defaults is needed regardless.
From usage point of view it's not just a memory optimization though.
These patches make it easier to see what gets allocated for TDX IMO.
This based on rebasing other patches on the dynamic PAMT series a few
times over the past year. So for the series:
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v6 03/11] x86/virt/tdx: Add tdx_alloc/free_control_page() helpers
From: Yan Zhao @ 2026-06-08 2:18 UTC (permalink / raw)
To: Binbin Wu
Cc: Rick Edgecombe, bp, dave.hansen, hpa, kas, kvm, linux-coco,
linux-doc, linux-kernel, mingo, nik.borisov, pbonzini, seanjc,
tglx, vannapurve, x86, chao.gao, kai.huang, Kirill A. Shutemov
In-Reply-To: <50566572-6379-4100-8845-404f695e59cd@linux.intel.com>
On Mon, Jun 08, 2026 at 10:11:58AM +0800, Binbin Wu wrote:
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 82dc27aecf297..74e75db5728c7 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -37,6 +37,7 @@
> >
> > #include <uapi/asm/mce.h>
> > #include <asm/tdx_global_metadata.h>
> > +#include <linux/mm.h>
>
> I think the header is not needed here.
Right. This version does not invoke page_address() in tdx.h for
tdx_alloc_control_page() any more.
Also no need to include mm.h for tdx.c (which has invoked page_address() before
this patch), since tdx.c includes memblock.h which further includes mm.h.
^ permalink raw reply
* Re: [PATCH v6 03/11] x86/virt/tdx: Add tdx_alloc/free_control_page() helpers
From: Binbin Wu @ 2026-06-08 2:11 UTC (permalink / raw)
To: Rick Edgecombe
Cc: bp, dave.hansen, hpa, kas, kvm, linux-coco, linux-doc,
linux-kernel, mingo, nik.borisov, pbonzini, seanjc, tglx,
vannapurve, x86, chao.gao, yan.y.zhao, kai.huang,
Kirill A. Shutemov
In-Reply-To: <20260526023515.288829-4-rick.p.edgecombe@intel.com>
On 5/26/2026 10:35 AM, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Add helpers to use when allocating or preparing pages that are handed to
> the TDX-Module for use as control/S-EPT pages, and thus need Dynamic PAMT
> adjustments.
>
> The TDX module tracks some state for each page of physical memory that it
> might use. It calls this state the PAMT. It includes separate state for
> each page size a physical page could be utilized at within the TDX module
> (1GB, 2MB, 4KB). In Dynamic PAMT, only the 4KB page size state is
> allocated dynamically. So for pages that TDX will use as 2MB physically
> contiguous pages, Dynamic PAMT backing is not needed.
>
> KVM will need to hand pages to the TDX module that it will use at 4KB
> granularity. So these pages will need Dynamic PAMT backing added before
> they are used by the TDX module, and removed afterwards.
>
> Add tdx_alloc_control_page() and tdx_free_control_page() to handle both
> page allocation and Dynamic PAMT installation. Make them behave like
> normal alloc/free functions where allocation can fail in the case of no
> memory, but free (with any necessary Dynamic PAMT release) always
> succeeds. Do this so they can support the existing TDX flows that require
> teardowns to succeed.
>
> Also create tdx_pamt_get/put() to handle installing Dynamic PAMT 4KB
> backing for pages that are already allocated (such as KVM's use of S-EPT
> page tables or guest private memory). Have them take a pfn instead of a
> struct page, as future changes will want to use these helpers for guest
> pages which are tracked by PFN.
>
> Don't CLFLUSH the Dynamic PAMT pages handed to the TDX module, as is done
> for some other SEAMCALLs, as the TDX docs specify that this is only
> needed on "TD private memory or TD control structure page".
>
> Since these allocations will be easily user triggerable, account the
> memory.
>
> Leave logic to handle concurrency issues for future changes.
>
> Assisted-by: GitHub Copilot:claude-opus-4-6 Claude:claude-opus-4-7 Sashiko:claude-opus-4-6
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
One comment below.
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 82dc27aecf297..74e75db5728c7 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -37,6 +37,7 @@
>
> #include <uapi/asm/mce.h>
> #include <asm/tdx_global_metadata.h>
> +#include <linux/mm.h>
I think the header is not needed here.
> #include <linux/pgtable.h>
>
> /*
> @@ -160,6 +161,12 @@ void tdx_guest_keyid_free(unsigned int keyid);
>
> void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
>
> +/* Number PAMT pages to be provided to TDX module per 2MB region of PA */
> +#define TDX_DPAMT_ENTRY_PAGE_CNT 2
> +
> +struct page *tdx_alloc_control_page(void);
> +void tdx_free_control_page(struct page *page);
> +
> struct tdx_td {
> /* TD root structure: */
> struct page *tdr_page;
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9ebd192cb5c17..9e0812d87ab06 100644[...]
^ permalink raw reply
* Re: [RFC PATCH 15/15] x86/virt/tdx: Enable TDX Quoting extension
From: Kishen Maloor @ 2026-06-07 4:41 UTC (permalink / raw)
To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-16-yilun.xu@linux.intel.com>
On 5/21/26 8:41 PM, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> Enable the TDX Quoting feature via TDH.SYS.CONFIG when supported by the
> TDX module.
>
> The TDX Quoting extension generates TDX attestation Quotes via a
> SEAMCALL, without using a discrete Quoting engine.
>
> TDX Module supports add-on TDX features (e.g. TDX Quoting & TDX Module
> Extensions) that should be manually enabled by host. It extends
> TDH.SYS.CONFIG for host to choose to enable them on bootup.
>
> Call TDH.SYS.CONFIG with a new bitmap input parameter to specify which
> features to enable. The bitmap uses the same definitions as
> TDX_FEATURES0. But note not all bits in TDX_FEATURES0 are valid for
> configuration, e.g. TDX Module Extensions is a service that supports TDX
> Quoting, it is implicitly enabled when TDX Quoting is enabled. Setting
> TDX_FEATURES0_EXT in the bitmap has no effect.
>
> TDX Module advances the version of TDH.SYS.CONFIG for the change, so
> use the latest version (v1) for add-on feature enabling. But supporting
> existing Modules which only support v0 is still necessary until they are
> deprecated. In fact, it is unlikely that TDH.SYS.CONFIG ever needs to
> change again and the code would stay in v1. So there is little value
> in worrying about deprecating v0 to save a couple lines of code in 5-7
> years when these original TDX platforms sunset.
>
> TDX Module updates global metadata when add-on features are enabled.
> Host should update the cached tdx_sysinfo to reflect these changes.
>
> Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> ---
> arch/x86/virt/vmx/tdx/tdx.h | 4 +++-
> arch/x86/virt/vmx/tdx/tdx.c | 24 ++++++++++++++++++++++--
> 2 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 10aff23cd01f..524a14c01aa6 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -58,7 +58,8 @@
> #define TDH_PHYMEM_CACHE_WB 40
> #define TDH_PHYMEM_PAGE_WBINVD 41
> #define TDH_VP_WR 43
> -#define TDH_SYS_CONFIG 45
> +#define TDH_SYS_CONFIG_V0 45
Is it necessary to add _Vx macros when multiple versions can co-exist?
Just wondering if it would be cleaner in the following way?
- Leave the macros set at the current (non-deprecated) baseline version.
- Select vX using SEAMCALL_LEAF_VER() in config_tdx_module() when a vX feature
is enabled.
u64 seamcall_fn = TDH_SYS_CONFIG;
...
if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE) {
...
seamcall_fn = SEAMCALL_LEAF_VER(TDH_SYS_CONFIG, 1);
> +#define TDH_SYS_CONFIG SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
> #define TDH_EXT_INIT 60
> #define TDH_EXT_MEM_ADD 61
> #define TDH_SYS_DISABLE 69
> @@ -97,6 +98,7 @@ struct tdmr_info {
> /* Bit definitions of TDX_FEATURES0 metadata field */
> #define TDX_FEATURES0_NO_RBP_MOD BIT(18)
> #define TDX_FEATURES0_EXT BIT_ULL(39)
> +#define TDX_FEATURES0_QUOTE BIT_ULL(50)
>
> /*
> * Do not put any hardware-defined TDX structure representations below
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f7600f930c6e..86e5b7ad19b3 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1049,6 +1049,7 @@ static __init int construct_tdmrs(struct list_head *tmb_list,
> static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
> u64 global_keyid)
> {
> + u64 seamcall_fn = TDH_SYS_CONFIG_V0;
> struct tdx_module_args args = {};
> u64 *tdmr_pa_array;
> size_t array_sz;
> @@ -1074,8 +1075,22 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
> args.rcx = __pa(tdmr_pa_array);
> args.rdx = tdmr_list->nr_consumed_tdmrs;
> args.r8 = global_keyid;
> - ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
>
> + if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE) {
> + args.r9 |= TDX_FEATURES0_QUOTE;
> + /* These parameters require version >= 1 */
> + seamcall_fn = TDH_SYS_CONFIG;
> + }
> +
> + ret = seamcall_prerr(seamcall_fn, &args);
> + if (ret)
> + goto free_tdmr;
> +
> + /* enabling TDX Quoting may change tdx_sysinfo, update it */
> + if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE)
> + ret = get_tdx_sys_info(&tdx_sysinfo);
> +
> +free_tdmr:
> /* Free the array as it is not required anymore. */
> kfree(tdmr_pa_array);
>
> @@ -1384,12 +1399,17 @@ static void tdx_quote_init(void)
> unsigned int nr_quote_pages;
> u64 r;
>
> + if (!(tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE))
> + return;
> +
> do {
> r = seamcall(TDH_QUOTE_INIT, &args);
> } while (r == TDX_INTERRUPTED_RESUMABLE);
>
> - if (r)
> + if (r) {
> + pr_err("Failed to enable quoting extension: 0x%llx\n", r);
> return;
> + }
>
> /* Quoting metadata is valid only after initialization */
> if (get_tdx_sys_info_quote(&tdx_sysinfo.quote))
^ permalink raw reply
* Re: [PATCH 04/15] x86/virt/tdx: Enable the Extensions right after basic TDX Module init
From: Kishen Maloor @ 2026-06-07 4:38 UTC (permalink / raw)
To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-5-yilun.xu@linux.intel.com>
On 5/21/26 8:41 PM, Xu Yilun wrote:
> The detailed initialization flow for TDX Module Extensions has been
> fully implemented. Enable the flow after basic TDX Module
> initialization.
>
> Theoretically, the Extensions doesn't need to be enabled right after
> basic TDX initialization. It could be enabled right before the first
> Extension SEAMCALL is issued. That would save or postpone memory usage.
> But it isn't worth the complexity, the needs for the Extensions are vast
> but the savings are little for a typical TDX capable system (about
> 0.001% of memory). So the Linux decision is to just enable it along with
> the basic TDX.
>
> Note that the Extensions initialization flow will still not start if no
> add-on features require Extensions. The enabling of add-on features will
> be in later patches. Until then, the system hasn't consumed extra memory.
>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 16 ++++++++++------
> 1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ff2b96c20d2b..dad5ec642723 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1180,7 +1180,7 @@ static __init int init_tdmrs(struct tdmr_info_list *tdmr_list)
> return 0;
> }
>
> -static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> +static __init void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> {
> u64 *entries = page_to_virt(root);
> int i;
> @@ -1193,7 +1193,7 @@ static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> #define HPA_LIST_INFO_PFN GENMASK_U64(51, 12)
> #define HPA_LIST_INFO_LAST_ENTRY GENMASK_U64(63, 55)
>
> -static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> +static __init u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> {
> return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
> FIELD_PREP(HPA_LIST_INFO_PFN, page_to_pfn(root)) |
> @@ -1201,7 +1201,7 @@ static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> }
>
> /* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */
> -static int tdx_ext_init(void)
> +static __init int tdx_ext_init(void)
> {
> struct tdx_module_args args = {};
> u64 r;
> @@ -1216,7 +1216,7 @@ static int tdx_ext_init(void)
> return 0;
> }
>
> -static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> +static __init int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> {
> struct tdx_module_args args = {
> .rcx = to_hpa_list_info(root, nr_pages),
> @@ -1240,7 +1240,7 @@ static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> return 0;
> }
>
> -static int tdx_ext_mem_setup(void)
> +static __init int tdx_ext_mem_setup(void)
> {
> unsigned int nr_pages;
> struct page *page;
> @@ -1301,7 +1301,7 @@ static int tdx_ext_mem_setup(void)
> return ret;
> }
>
> -static int __maybe_unused init_tdx_ext(void)
> +static __init int init_tdx_ext(void)
> {
> int ret;
>
> @@ -1373,6 +1373,10 @@ static __init int init_tdx_module(void)
> if (ret)
> goto err_reset_pamts;
>
> + ret = init_tdx_ext();
> + if (ret)
> + goto err_reset_pamts;
Is it a reasonable policy to fail TDX bringup entirely upon failing
initialization of extensions (which are "add-on features")?
The handling of tdx_quote_init() in Patch 6 suggests a more
best-effort approach.
> +
> pr_info("%lu KB allocated for PAMT\n", tdmrs_count_pamt_kb(&tdx_tdmr_list));
>
> out_put_tdxmem:
^ permalink raw reply
* Re: [PATCH 02/15] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Kishen Maloor @ 2026-06-07 4:38 UTC (permalink / raw)
To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-3-yilun.xu@linux.intel.com>
On 5/21/26 8:41 PM, Xu Yilun wrote:
> TDX Module introduces a new concept called "TDX Module Extensions" to
> support long running / hard-irq preemptible flows inside. This makes TDX
> Module capable of handling complex tasks through "Extension SEAMCALLs".
> Adding more memory to TDX Module is the first step to enable Extensions.
>
> Currently, TDX Module memory use is relatively static. But, the
> Extensions need to use memory more dynamically. While 'static' here
> means the kernel provides necessary amount of memory to TDX Module for
> its basic functionalities, 'dynamic' means extra memory is needed only
> if new add-on features are to be enabled. So add a new memory feeding
> process backed by a new SEAMCALL TDH.EXT.MEM.ADD.
>
> The process is mostly the same as adding PAMT. The kernel queries TDX
> Module how much memory needed, allocates it, hands it over, and never
> gets it back.
>
> TDH.EXT.MEM.ADD uses a new parameter type HPA_LIST_INFO to provide
> control (private) pages to TDX Module. This type represents a list of
> pages for TDX Module to access. It needs a 'root page' which contains
> the list of HPAs of the pages. It collapses the HPA of the root page
> and the number of valid HPAs into a 64 bit raw value for SEAMCALL
> parameters. The root page is always a medium, TDX Module never keeps
> the root page.
>
> Introduce a tdx_clflush_hpa_list() helper to flush shared cache before
> SEAMCALL, to avoid shared cache writeback damaging these private pages.
>
> For now, TDX Module Extensions consumes relatively large amount of
> memory (~50MB). Use contiguous page allocation to avoid permanently
> fragment too much memory. Print the allocation amount on TDX Module
> Extensions initialization for visibility.
>
> Co-developed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
> arch/x86/virt/vmx/tdx/tdx.h | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 118 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 119 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index a5eec8e3cc71..2335f88bbb10 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -46,6 +46,7 @@
> #define TDH_PHYMEM_PAGE_WBINVD 41
> #define TDH_VP_WR 43
> #define TDH_SYS_CONFIG 45
> +#define TDH_EXT_MEM_ADD 61
> #define TDH_SYS_DISABLE 69
>
> /*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c0c6281b08a5..622399d8da68 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -31,6 +31,7 @@
> #include <linux/syscore_ops.h>
> #include <linux/idr.h>
> #include <linux/kvm_types.h>
> +#include <linux/bitfield.h>
> #include <asm/page.h>
> #include <asm/special_insns.h>
> #include <asm/msr-index.h>
> @@ -1179,6 +1180,123 @@ static __init int init_tdmrs(struct tdmr_info_list *tdmr_list)
> return 0;
> }
>
> +static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> +{
> + u64 *entries = page_to_virt(root);
> + int i;
> +
> + for (i = 0; i < nr_pages; i++)
> + clflush_cache_range(__va(entries[i]), PAGE_SIZE);
> +}
> +
> +#define HPA_LIST_INFO_FIRST_ENTRY GENMASK_U64(11, 3)
> +#define HPA_LIST_INFO_PFN GENMASK_U64(51, 12)
> +#define HPA_LIST_INFO_LAST_ENTRY GENMASK_U64(63, 55)
> +
> +static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> +{
> + return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
> + FIELD_PREP(HPA_LIST_INFO_PFN, page_to_pfn(root)) |
> + FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, nr_pages - 1);
> +}
> +
> +static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> +{
> + struct tdx_module_args args = {
> + .rcx = to_hpa_list_info(root, nr_pages),
> + };
> + u64 r;
> +
> + tdx_clflush_hpa_list(root, nr_pages);
> +
> + do {
> + /*
> + * TDH_EXT_MEM_ADD is designed to use output parameter RCX to
> + * override/update input parameter RCX, so the caller doesn't
> + * have to do manual parameter update on retry call.
> + */
> + r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> + } while (r == TDX_INTERRUPTED_RESUMABLE);
The retry loop compares the full return value against TDX_INTERRUPTED_RESUMABLE. Should
it mask with TDX_SEAMCALL_STATUS_MASK first, in case the module sets any
lower detail bits?
Ditto for TDH.EXT.INIT in patch 3.
> +
> + if (r != TDX_SUCCESS)
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int tdx_ext_mem_setup(void)
> +{
> + unsigned int nr_pages;
> + struct page *page;
> + u64 *root;
> + unsigned int i;
> + int ret;
> +
> + nr_pages = tdx_sysinfo.ext.memory_pool_required_pages;
> + /*
> + * memory_pool_required_pages == 0 means no need to add pages,
> + * skip the memory setup.
> + */
> + if (!nr_pages)
> + return 0;
> +
> + root = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!root)
> + return -ENOMEM;
> +
> + page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
> + &node_online_map);
The SEAMCALL takes a scatter list (HPA_LIST_INFO), so the module
doesn't require contiguity. If the goal is just to avoid scattering
pages across many 2MB regions, maybe dense, 2MB-aligned allocations should
achieve that without a single pool-wide contiguous block.
> + if (!page) {
> + ret = -ENOMEM;
> + goto out_free_root;
> + }
> +
> + for (i = 0; i < nr_pages;) {
> + unsigned int nents = min(nr_pages - i,
> + PAGE_SIZE / sizeof(*root));
> + int j;
> +
> + for (j = 0; j < nents; j++)
> + root[j] = page_to_phys(page + i + j);
Would it be better to allocate per-batch (i.e. one root page's worth
at a time) rather than the whole pool up front?
That way an intermediate TDH.EXT.MEM.ADD failure wouldn't leak
all nr_pages. Also, a batch is up to 512 pages (= 2MB) and its allocation
could be 2MB-aligned, addressing your fragmentation concern.
> +
> + ret = tdx_ext_mem_add(virt_to_page(root), nents);
> + /*
> + * No SEAMCALLs to reclaim the added pages. For simple error
> + * handling, leak all pages.
> + */
> + WARN_ON_ONCE(ret);
> + if (ret)
> + break;
> +
> + i += nents;
> + }
> +
> + /*
> + * Extensions memory can't be reclaimed once added, print out the
> + * amount, stop tracking it and free the root page, no matter success
> + * or failure.
> + */
> + pr_info("%lu KB allocated for TDX Module Extensions\n",
> + nr_pages * PAGE_SIZE / 1024);
> +
> +out_free_root:
> + kfree(root);
> +
> + return ret;
> +}
> +
> +static int __maybe_unused init_tdx_ext(void)
Could this be named init_tdx_extensions() instead to disambiguate
from tdx_ext_init() in patch 3?
> +{
> + if (!(tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_EXT))
> + return 0;
> +
> + /* No feature requires TDX Module Extensions. */
> + if (!tdx_sysinfo.ext.ext_required)
> + return 0;
> +
> + return tdx_ext_mem_setup();
> +}
> +
> static __init int init_tdx_module(void)
> {
> int ret;
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox