* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-05 11:42 UTC (permalink / raw)
To: Chao Gao
Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Huang, Kai, Hansen, Dave, Zhao, Yan Y,
seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
pbonzini@redhat.com, nik.borisov@suse.com,
linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
x86@kernel.org
In-Reply-To: <aiJhScChLZkH44eB@intel.com>
On Fri, Jun 05, 2026 at 01:40:25PM +0800, Chao Gao wrote:
> On Thu, Jun 04, 2026 at 05:59:02PM +0100, Kiryl Shutsemau wrote:
> >On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
> >> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
> >> > > - scoped_guard(spinlock, &pamt_lock) {
> >> >
> >> > This converts the scoped_guard() added by the previous patch to
> >> > explicit lock/unlock and goto. It would reduce code churn if the
> >> > previous patch used that form directly.
> >>
> >> Yea, it's a good point. I actually debated doing it, but decided not to because
> >> the scoped version is cleaner for the non-optimized version. But for
> >> reviewability, never doing the scoped version is probably better.
> >
> >I don't see a reason why we can't keep the scoped_guard() on get side.
>
> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> with goto, which is discouraged. See [*]
>
> :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
> :that the “goto” statement can jump between scopes, the expectation is that
> :usage of “goto” and cleanup helpers is never mixed in the same function.
Fair enough.
But it can also be address if we free the PAMT page array with the guard
too :P
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v4 3/3] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Kiryl Shutsemau @ 2026-06-05 11:57 UTC (permalink / raw)
To: dave.hansen, Binbin Wu
Cc: tglx, mingo, bp, seanjc, pbonzini, sathyanarayanan.kuppuswamy,
kai.huang, xiaoyao.li, rick.p.edgecombe, david.laight.linux, ak,
djbw, tsyrulnikov.borys, x86, kvm, linux-coco, linux-kernel,
stable
In-Reply-To: <22c789c3-13b1-4c39-898f-2eec3bce98c1@linux.intel.com>
On Fri, Jun 05, 2026 at 03:10:39PM +0800, Binbin Wu wrote:
>
>
> On 6/4/2026 10:47 PM, Kiryl Shutsemau (Meta) wrote:
> > According to x86 architecture rules, 32-bit operations zero-extend the
> > result to 64 bits. The current implementation of handle_in() only masks
> > the lower 32 bits, which preserves the upper 32 bits of RAX when a
> > 32-bit port IN instruction is emulated.
> >
> > Use insn_assign_reg() to write the result back into RAX with proper
> > partial-register-write semantics: 1- and 2-byte forms leave the upper
> > bits untouched, the 4-byte form zero-extends to the full register.
> >
> > Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> > Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> > Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > Cc: stable@vger.kernel.org
>
> I think the concern sashiko commented in patch 2 is valid.
Yeah. I guess I'll just use the KVM implementation verbatim.
Dave, any objections?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 12:33 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-2-seanjc@google.com>
On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> frequency.
That's misleading because fixed frequency means that the frequency does
not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
from CPUID/MSRs.
> In practice, this is likely one big nop, as re-calibration is
> used only for SMP=n kernels, and only for hardware that is 20+ years old,
> i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
frequency drivers, which means that's absolutely not interesting and
neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
set on those systems.
IOW, this patch is pointless voodoo ware.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Thomas Gleixner @ 2026-06-05 12:37 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>
On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> - crystal_khz = eax_base_mhz * 1000 *
> - eax_denominator / ebx_numerator;
> + info.crystal_khz = eax_base_mhz * 1000 *
> + info.denominator / info.numerator;
Please get rid of this ugly line break. You have 100 characters.
^ permalink raw reply
* [POC] KVM: selftests: Verify conversion works with TDX
From: Ackerley Tng @ 2026-06-05 13:41 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com>
This POC shows that conversions works with TDX:
1. Find 2 pages in GVA space, map those twice, once as private and once as
shared. This avoids having to manipulate page tables in the guest.
2. Use memory as private memory in the guest.
3. Request to convert memory to shared.
4. Write shared memory in the guest, check in the host.
5. Write shared memory in the host, check in the guest.
6. Request to convert memory to private.
7. Use memory as private memory in the guest.
I based this on Lisa's series at [1].
[1] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-0-6983ae4c3a4d@google.com/
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/x86/tdx_vm_test.c | 154 ++++++++++++++++++
1 file changed, 154 insertions(+)
diff --git a/tools/testing/selftests/kvm/x86/tdx_vm_test.c b/tools/testing/selftests/kvm/x86/tdx_vm_test.c
index 7cdcaf33b585b..093921af7d93e 100644
--- a/tools/testing/selftests/kvm/x86/tdx_vm_test.c
+++ b/tools/testing/selftests/kvm/x86/tdx_vm_test.c
@@ -26,6 +26,160 @@ TEST(verify_td_lifecycle)
kvm_vm_free(vm);
}
+static gva_t conversions_private_gva;
+static gpa_t conversions_private_gpa;
+static gva_t conversions_shared_gva;
+static gpa_t conversions_shared_gpa;
+static size_t conversions_size;
+
+u64 tdx_map_gpa(u64 gpa, u64 size)
+{
+#define TDG_VP_VMCALL 0
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDVMCALL_EXPOSE_REGS_MASK 0xFC00
+ register u64 r10_reg asm("r10") = TDG_VP_VMCALL;
+ register u64 r11_reg asm("r11") = TDG_VP_VMCALL_MAP_GPA;
+ register u64 r12_reg asm("r12") = gpa;
+ register u64 r13_reg asm("r13") = size;
+ register u64 rax_reg asm("rax") = TDG_VP_VMCALL;
+ register u64 rcx_reg asm("rcx") = TDVMCALL_EXPOSE_REGS_MASK;
+
+ asm volatile(
+ ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
+ : "+r" (r10_reg), "+r" (r11_reg)
+ : "r" (r12_reg), "r" (r13_reg), "r" (rax_reg), "r" (rcx_reg)
+ : "cc", "memory"
+ );
+
+ return r10_reg;
+}
+
+enum accept_page_level {
+ PAGE_LEVEL_4K = 0,
+ PAGE_LEVEL_2M,
+};
+
+u64 tdx_accept_page(u64 gpa, enum accept_page_level level)
+{
+#define TDG_MEM_PAGE_ACCEPT 6
+ register u64 rax_reg asm("rax") = TDG_MEM_PAGE_ACCEPT;
+ register u64 rcx_reg asm("rcx") = gpa | level;
+
+ asm volatile(
+ ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
+ : "+r" (rax_reg)
+ : "r" (rcx_reg)
+ : "cc", "memory"
+ );
+
+ return rax_reg;
+}
+
+static void handle_hypercall_map_gpa(struct kvm_vcpu *vcpu)
+{
+ struct kvm_run *run = vcpu->run;
+ u64 attributes;
+ size_t size;
+ gpa_t gpa;
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(run->hypercall.flags, KVM_EXIT_HYPERCALL_LONG_MODE);
+
+ gpa = run->hypercall.args[0];
+ size = run->hypercall.args[1] * PAGE_SIZE;
+ attributes = 0;
+ if (run->hypercall.args[2] & KVM_MAP_GPA_RANGE_ENCRYPTED)
+ attributes = KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+ vm_mem_set_memory_attributes(vcpu->vm, gpa, size, attributes);
+}
+
+#define CONVERSIONS_PRIVATE_VAL 'A'
+#define CONVERSIONS_GUEST_SHARED_VAL 'B'
+#define CONVERSIONS_HOST_SHARED_VAL 'C'
+#define CONVERSIONS_STAGE_WROTE_SHARED 0x99
+
+static void guest_code_conversions(void)
+{
+ char *addr;
+
+ addr = (void *)conversions_private_gva;
+ WRITE_ONCE(*addr, CONVERSIONS_PRIVATE_VAL);
+ GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_PRIVATE_VAL);
+
+ GUEST_ASSERT_EQ(tdx_map_gpa(conversions_shared_gpa, conversions_size), 0);
+
+ addr = (void *)conversions_shared_gva;
+ WRITE_ONCE(*addr, CONVERSIONS_GUEST_SHARED_VAL);
+ GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_GUEST_SHARED_VAL);
+
+ GUEST_SYNC(CONVERSIONS_STAGE_WROTE_SHARED);
+
+ GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_HOST_SHARED_VAL);
+
+ GUEST_ASSERT_EQ(tdx_map_gpa(conversions_private_gpa, conversions_size), 0);
+ GUEST_ASSERT_EQ(tdx_accept_page(conversions_private_gpa, PAGE_LEVEL_4K), 0);
+
+ addr = (void *)conversions_private_gva;
+ WRITE_ONCE(*addr, CONVERSIONS_PRIVATE_VAL);
+ GUEST_ASSERT_EQ(READ_ONCE(*addr), CONVERSIONS_PRIVATE_VAL);
+
+ GUEST_DONE();
+}
+
+TEST(verify_conversions)
+{
+ struct kvm_vcpu *vcpu;
+ struct kvm_vm *vm;
+ struct ucall uc;
+ char *test_hva;
+
+ vm = __vm_create(VM_SHAPE_TDX, 1, 0);
+ vcpu = vm_vcpu_add(vm, 0, guest_code_conversions);
+
+ conversions_size = getpagesize();
+
+ conversions_private_gva = vm_alloc_page(vm);
+ conversions_shared_gva = vm_alloc_shared(vm, conversions_size,
+ KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ conversions_private_gpa = addr_gva2gpa(vm, conversions_private_gva);
+ conversions_shared_gpa = conversions_private_gpa | BIT_ULL(vm->pa_bits - 1);
+
+ vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
+
+ sync_global_to_guest(vm, conversions_size);
+ sync_global_to_guest(vm, conversions_private_gva);
+ sync_global_to_guest(vm, conversions_private_gpa);
+ sync_global_to_guest(vm, conversions_shared_gva);
+ sync_global_to_guest(vm, conversions_shared_gpa);
+
+ kvm_arch_vm_finalize_vcpus(vm);
+
+ test_hva = addr_gva2hva(vm, conversions_shared_gva);
+
+ vcpu_run(vcpu);
+ handle_hypercall_map_gpa(vcpu);
+
+ vcpu_run(vcpu);
+ TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+ TEST_ASSERT_EQ(uc.args[1], CONVERSIONS_STAGE_WROTE_SHARED);
+
+ TEST_ASSERT_EQ(READ_ONCE(*test_hva), CONVERSIONS_GUEST_SHARED_VAL);
+
+ WRITE_ONCE(*test_hva, CONVERSIONS_HOST_SHARED_VAL);
+ TEST_ASSERT_EQ(READ_ONCE(*test_hva), CONVERSIONS_HOST_SHARED_VAL);
+
+ vcpu_run(vcpu);
+ handle_hypercall_map_gpa(vcpu);
+
+ vcpu_run(vcpu);
+ TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_DONE);
+
+ kvm_vm_free(vm);
+}
+
int main(int argc, char **argv)
{
TEST_REQUIRE(is_tdx_supported());
--
2.54.0.1032.g2f8565e1d1-goog
^ permalink raw reply related
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-05 13:58 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-19-6983ae4c3a4d@google.com>
Lisa Wang <wyihan@google.com> writes:
> From: Sagi Shahar <sagis@google.com>
>
> Finalize TDX VM after creation to make it runnable.
>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
> tools/testing/selftests/kvm/lib/x86/processor.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> index d84c629a1945..842cac168e99 100644
> --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> @@ -1479,6 +1479,12 @@ bool kvm_arch_has_default_irqchip(void)
> return true;
> }
>
> +void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
> +{
> + if (is_tdx_vm(vm))
> + tdx_vm_finalize(vm);
> +}
> +
This doesn't necessarily block this series, we could (re)move this
later: I'm not sure if kvm_arch_vm_finalize_vcpus() is the correct place
to be finalizing the VM.
Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
instead?
The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
__vm_create_with_vcpus().
While building this POC to test conversions [1] I only wanted to create
the vm and vcpus and didn't want to finalize yet, since I still needed
to do more mappings in the guest (and I needed the vm pointer to do
mappings in the guest).
Would calling tdx_vm_finalize() from within vcpu_run(), just once, be
too magical?
It's also possible to have some kvm_vm_finalize() call that can be
explicitly and manually invoked from selftests just for CoCo selftests.
[1] https://lore.kernel.org/all/20260605134153.204152-1-ackerleytng@google.com/
> void setup_smram(struct kvm_vm *vm, struct kvm_vcpu *vcpu, u64 smram_gpa,
> const void *smi_handler, size_t handler_size)
> {
>
> --
> 2.54.0.746.g67dd491aae-goog
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Lorenzo Pieralisi @ 2026-06-05 14:35 UTC (permalink / raw)
To: Gavin Shan
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <ecef952b-e8c6-4102-933b-c99c46f14431@redhat.com>
On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
> > On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
> >
> > [...]
> >
> > > > +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
> > > > + kvm_pfn_t pfn, unsigned long map_size,
> > > > + enum kvm_pgtable_prot prot,
> > > > + struct kvm_mmu_memory_cache *memcache)
> > > > +{
> > > > + struct realm *realm = &kvm->arch.realm;
> > > > +
> > > > + /*
> > > > + * Write permission is required for now even though it's possible to
> > > > + * map unprotected pages (granules) as read-only. It's impossible to
> > > > + * map protected pages (granules) as read-only.
> > > > + */
> > > > + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> > > > + return -EFAULT;
> > > > +
> > >
> > > I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
> > > if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
> > > (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
> > > working any more.
> > >
> > > > + ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
> > > > + if (!kvm_realm_is_private_address(realm, ipa))
> > > > + return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
> > > > + memcache);
> > > > +
> > > > + return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
> > > > +}
> > > > +
> > > > static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
> > > > {
> > > > switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
> > > > @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > > > bool write_fault, exec_fault;
> > > > enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
> > > > enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > > - struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
> > > > + struct kvm_vcpu *vcpu = s2fd->vcpu;
> > > > + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > > + gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
> > > > unsigned long mmu_seq;
> > > > struct page *page;
> > > > - struct kvm *kvm = s2fd->vcpu->kvm;
> > > > + struct kvm *kvm = vcpu->kvm;
> > > > void *memcache;
> > > > kvm_pfn_t pfn;
> > > > gfn_t gfn;
> > > > int ret;
> > > > - memcache = get_mmu_memcache(s2fd->vcpu);
> > > > - ret = topup_mmu_memcache(s2fd->vcpu, memcache);
> > > > + if (kvm_is_realm(vcpu->kvm)) {
> > > > + /* check for memory attribute mismatch */
> > > > + bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > > > + /*
> > > > + * For Realms, the shared address is an alias of the private
> > > > + * PA with the top bit set. Thus if the fault address matches
> > > > + * the GPA then it is the private alias.
> > > > + */
> > > > + bool is_priv_fault = (gpa == s2fd->fault_ipa);
> > > > +
> > > > + if (is_priv_gfn != is_priv_fault) {
> > > > + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> > > > + kvm_is_write_fault(vcpu),
> > > > + false,
> > > > + is_priv_fault);
> > > > + /*
> > > > + * KVM_EXIT_MEMORY_FAULT requires an return code of
> > > > + * -EFAULT, see the API documentation
> > > > + */
> > > > + return -EFAULT;
> > > > + }
> > > > + }
> > > > +
> > > > + memcache = get_mmu_memcache(vcpu);
> > > > + ret = topup_mmu_memcache(vcpu, memcache);
> > > > if (ret)
> > > > return ret;
> > > > if (s2fd->nested)
> > > > gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
> > > > else
> > > > - gfn = s2fd->fault_ipa >> PAGE_SHIFT;
> > > > + gfn = gpa >> PAGE_SHIFT;
> > > > - write_fault = kvm_is_write_fault(s2fd->vcpu);
> > > > - exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
> > > > + write_fault = kvm_is_write_fault(vcpu);
> > > > + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > > VM_WARN_ON_ONCE(write_fault && exec_fault);
> > > > @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > > > ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
> > > > if (ret) {
> > > > - kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
> > > > + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
> > > > write_fault, exec_fault, false);
> > > > return ret;
> > > > }
> > > > @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > > > kvm_fault_lock(kvm);
> > > > if (mmu_invalidate_retry(kvm, mmu_seq)) {
> > > > ret = -EAGAIN;
> > > > - goto out_unlock;
> > > > + goto out_release_page;
> > > > + }
> > > > +
> > > > + if (kvm_is_realm(kvm)) {
> > > > + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
> > > > + PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
> > > > + goto out_release_page;
> > > > }
> > > > ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> > > > __pfn_to_phys(pfn), prot,
> > > > memcache, flags);
> > > > -out_unlock:
> > > > +out_release_page:
> > > > kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
> > > > kvm_fault_unlock(kvm);
> > > > @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
> > > > * mapping size to ensure we find the right PFN and lay down the
> > > > * mapping in the right place.
> > > > */
> > > > - s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
> > > > + s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
> > > > s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
> > > > @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
> > > > prot &= ~KVM_NV_GUEST_MAP_SZ;
> > > > ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
> > > > prot, flags);
> > > > + } else if (kvm_is_realm(kvm)) {
> > > > + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
> > > > + prot, memcache);
> > > > } else {
> > > > ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
> > > > __pfn_to_phys(pfn), prot,
> > >
> > > For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
> > > huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
> > > transparent_hugepage_adjust() to be aligned with huge page size. If the
> > > adjustment happened in transparent_hugepage_adjust(), we need to align
> > > s2fd->fault_ipa down to the huge page size either.
> >
> > All of the above + some RMM changes are needed to get QEmu VMM going
> > with anon pages guest memory backing - currently testing various
> > configurations in the background.
> >
>
> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
> that memory slots backed by THP are broken. With THP disabled on the host and
> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
> fxies on the top.
>
> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/latest)
> [2] https://git.qemu.org/git/qemu.git (branch: cca/gavin)
>
> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
issues you pointed out above.
> If so, I'm not sure if there is a QEMU repository for me to try?
We should be able to submit patches by end of June - we shall let you know
whether we can make something available earlier.
Thanks,
Lorenzo
>
> Thanks,
> Gavin
>
> > Thanks,
> > Lorenzo
> >
> > > > @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > > return 0;
> > > > }
> > > > +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
> > > > +{
> > > > + gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
> > > > +
> > > > + return (gpa != fault_ipa);
> > > > +}
> > > > +
> > > > /**
> > > > * kvm_handle_guest_abort - handles all 2nd stage aborts
> > > > * @vcpu: the VCPU pointer
> > > > @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > > nested = &nested_trans;
> > > > }
> > > > - gfn = ipa >> PAGE_SHIFT;
> > > > + gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
> > > > memslot = gfn_to_memslot(vcpu->kvm, gfn);
> > > > +
> > > > hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> > > > write_fault = kvm_is_write_fault(vcpu);
> > > > if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
> > > > @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > > * of the page size.
> > > > */
> > > > ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
> > > > - ret = io_mem_abort(vcpu, ipa);
> > > > + ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
> > > > goto out_unlock;
> > > > }
> > > > @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > > > !write_fault &&
> > > > !kvm_vcpu_trap_is_exec_fault(vcpu));
> > > > - if (kvm_slot_has_gmem(memslot))
> > > > + if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm, fault_ipa))
> > > > ret = gmem_abort(&s2fd);
> > > > else
> > > > ret = user_mem_abort(&s2fd);
> > > > @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> > > > return false;
> > > > + /* We don't support aging for Realms */
> > > > + if (kvm_is_realm(kvm))
> > > > + return true;
> > > > +
> > > > return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> > > > range->start << PAGE_SHIFT,
> > > > size, true);
> > > > @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> > > > return false;
> > > > + /* We don't support aging for Realms */
> > > > + if (kvm_is_realm(kvm))
> > > > + return true;
> > > > +
> > > > return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
> > > > range->start << PAGE_SHIFT,
> > > > size, false);
> > > > @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > > return -EFAULT;
> > > > /*
> > > > - * Only support guest_memfd backed memslots with mappable memory, since
> > > > - * there aren't any CoCo VMs that support only private memory on arm64.
> > > > + * Only support guest_memfd backed memslots with mappable memory,
> > > > + * unless the guest is a CCA realm guest.
> > > > */
> > > > - if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> > > > + if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
> > > > + !kvm_is_realm(kvm))
> > > > return -EINVAL;
> > > > hva = new->userspace_addr;
> > > > diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> > > > index cae29fd3353c..761b38a4071c 100644
> > > > --- a/arch/arm64/kvm/rmi.c
> > > > +++ b/arch/arm64/kvm/rmi.c
> > > > @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
> > > > return ret;
> > > > }
> > > > +static unsigned long addr_range_desc(unsigned long phys, unsigned long size)
> > > > +{
> > > > + unsigned long out = 0;
> > > > +
> > > > + switch (size) {
> > > > + case P4D_SIZE:
> > > > + out = 3 | (1 << 2);
> > > > + break;
> > > > + case PUD_SIZE:
> > > > + out = 2 | (1 << 2);
> > > > + break;
> > > > + case PMD_SIZE:
> > > > + out = 1 | (1 << 2);
> > > > + break;
> > > > + case PAGE_SIZE:
> > > > + out = 0 | (1 << 2);
> > > > + break;
> > > > + default:
> > > > + /*
> > > > + * Only support mapping at the page level granulatity when
> > > > + * it's an unusual length. This should get us back onto a larger
> > > > + * block size for the subsequent mappings.
> > > > + */
> > > > + out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
> > > > + break;
> > > > + }
> > > > +
> > > > + WARN_ON(phys & ~PAGE_MASK);
> > > > +
> > > > + out |= phys & PAGE_MASK;
> > > > +
> > > > + return out;
> > > > +}
> > > > +
> > > > +int realm_map_protected(struct kvm *kvm,
> > > > + unsigned long ipa,
> > > > + kvm_pfn_t pfn,
> > > > + unsigned long map_size,
> > > > + struct kvm_mmu_memory_cache *memcache)
> > > > +{
> > > > + struct realm *realm = &kvm->arch.realm;
> > > > + phys_addr_t phys = __pfn_to_phys(pfn);
> > > > + phys_addr_t base_phys = phys;
> > > > + phys_addr_t rd = virt_to_phys(realm->rd);
> > > > + unsigned long base_ipa = ipa;
> > > > + unsigned long ipa_top = ipa + map_size;
> > > > + int ret = 0;
> > > > +
> > > > + if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
> > > > + !IS_ALIGNED(ipa, map_size)))
> > > > + return -EINVAL;
> > > > +
> > > > + if (rmi_delegate_range(phys, map_size)) {
> > > > + /*
> > > > + * It's likely we raced with another VCPU on the same
> > > > + * fault. Assume the other VCPU has handled the fault
> > > > + * and return to the guest.
> > > > + */
> > > > + return 0;
> > > > + }
> > > > +
> > > > + while (ipa < ipa_top) {
> > > > + unsigned long flags = RMI_ADDR_TYPE_SINGLE;
> > > > + unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > > > + unsigned long out_top;
> > > > +
> > > > + ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
> > > > + &out_top);
> > > > +
> > > > + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > > > + /* Create missing RTTs and retry */
> > > > + int level = RMI_RETURN_INDEX(ret);
> > > > +
> > > > + WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > > > + ret = realm_create_rtt_levels(realm, ipa, level,
> > > > + KVM_PGTABLE_LAST_LEVEL,
> > > > + memcache);
> > > > + if (ret)
> > > > + goto err_undelegate;
> > > > +
> > > > + ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> > > > + range_desc, &out_top);
> > > > + }
> > > > +
> > > > + if (WARN_ON(ret))
> > > > + goto err_undelegate;
> > > > +
> > > > + phys += out_top - ipa;
> > > > + ipa = out_top;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +
> > > > +err_undelegate:
> > > > + realm_unmap_private_range(kvm, base_ipa, ipa, true);
> > > > + if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
> > > > + /* Page can't be returned to NS world so is lost */
> > > > + get_page(phys_to_page(base_phys));
> > > > + }
> > > > + return -ENXIO;
> > > > +}
> > > > +
> > > > +int realm_map_non_secure(struct realm *realm,
> > > > + unsigned long ipa,
> > > > + kvm_pfn_t pfn,
> > > > + unsigned long size,
> > > > + enum kvm_pgtable_prot prot,
> > > > + struct kvm_mmu_memory_cache *memcache)
> > > > +{
> > > > + unsigned long attr, flags = 0;
> > > > + phys_addr_t rd = virt_to_phys(realm->rd);
> > > > + phys_addr_t phys = __pfn_to_phys(pfn);
> > > > + unsigned long ipa_top = ipa + size;
> > > > + int ret;
> > > > +
> > > > + if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
> > > > + !IS_ALIGNED(ipa, size)))
> > > > + return -EINVAL;
> > > > +
> > > > + switch (prot & (KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC)) {
> > > > + case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
> > > > + return -EINVAL;
> > > > + case KVM_PGTABLE_PROT_DEVICE:
> > > > + attr = MT_S2_FWB_DEVICE_nGnRE;
> > > > + break;
> > > > + case KVM_PGTABLE_PROT_NORMAL_NC:
> > > > + attr = MT_S2_FWB_NORMAL_NC;
> > > > + break;
> > > > + default:
> > > > + attr = MT_S2_FWB_NORMAL;
> > > > + }
> > > > +
> > > > + flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
> > > > +
> > > > + if (prot & KVM_PGTABLE_PROT_R)
> > > > + flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_READ);
> > > > + if (prot & KVM_PGTABLE_PROT_W)
> > > > + flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, RMI_S2AP_DIRECT_WRITE);
> > > > +
> > > > + flags |= RMI_ADDR_TYPE_SINGLE;
> > > > +
> > > > + while (ipa < ipa_top) {
> > > > + unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
> > > > + unsigned long out_top;
> > > > +
> > > > + ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
> > > > + &out_top);
> > > > +
> > > > + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> > > > + /* Create missing RTTs and retry */
> > > > + int level = RMI_RETURN_INDEX(ret);
> > > > +
> > > > + WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> > > > + ret = realm_create_rtt_levels(realm, ipa, level,
> > > > + KVM_PGTABLE_LAST_LEVEL,
> > > > + memcache);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
> > > > + range_desc, &out_top);
> > > > + }
> > > > +
> > > > + if (WARN_ON(ret))
> > > > + return ret;
> > > > +
> > > > + phys += out_top - ipa;
> > > > + ipa = out_top;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > > struct page *src_page, void *opaque)
> > > > {
> > >
> > > Thanks,
> > > Gavin
> > >
> >
>
^ permalink raw reply
* Re: [PATCH v14 17/44] arm64: RMI: RTT tear down
From: Steven Price @ 2026-06-05 15:01 UTC (permalink / raw)
To: Wei-Lin Chang, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <7egwow26r6sbbtm53mujbhpwyts2utzv2ddth7554kqwbk7k7d@iptjpvvbsc2n>
On 26/05/2026 23:27, Wei-Lin Chang wrote:
> Hi,
>
> On Wed, May 13, 2026 at 02:17:25PM +0100, Steven Price wrote:
>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>> that the RMM creates/destroys entries as necessary. The physical pages
>> to store the page tables are delegated to the realm as required, and can
>> be undelegated when no longer used.
>>
>> Creating new RTTs is the easy part, tearing down is a little more
>> tricky. The result of realm_rtt_destroy() can be used to effectively
>> walk the tree and destroy the entries (undelegating pages that were
>> given to the realm).
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * Avoid the double call of kvm_free_stage2_pgd() by splitting the work
>> across that and a new function kvm_realm_uninit_stage2() which is
>> only called for realm guests.
>> Changes since v12:
>> * Simplify some functions now we know RMM page size is the same as the
>> host's.
>> Changes since v11:
>> * Moved some code from earlier in the series to this one so that it's
>> added when it's first used.
>> Changes since v10:
>> * RME->RMI rename.
>> * Some code to handle freeing stage 2 PGD moved into this patch where
>> it belongs.
>> Changes since v9:
>> * Add a comment clarifying that root level RTTs are not destroyed until
>> after the RD is destroyed.
>> Changes since v8:
>> * Introduce free_rtt() wrapper which calls free_delegated_granule()
>> followed by kvm_account_pgtable_pages(). This makes it clear where an
>> RTT is being freed rather than just a delegated granule.
>> Changes since v6:
>> * Move rme_rtt_level_mapsize() and supporting defines from kvm_rme.h
>> into rme.c as they are only used in that file.
>> Changes since v5:
>> * Rename some RME_xxx defines to do with page sizes as RMM_xxx - they are
>> a property of the RMM specification not the RME architecture.
>> Changes since v2:
>> * Moved {alloc,free}_delegated_page() and ensure_spare_page() to a
>> later patch when they are actually used.
>> * Some simplifications now rmi_xxx() functions allow NULL as an output
>> parameter.
>> * Improved comments and code layout.
>> ---
>> arch/arm64/include/asm/kvm_rmi.h | 7 ++
>> arch/arm64/kvm/mmu.c | 21 ++++-
>> arch/arm64/kvm/rmi.c | 148 +++++++++++++++++++++++++++++++
>> 3 files changed, 174 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 9de34983ee52..06ba0d4745c6 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -64,5 +64,12 @@ u32 kvm_realm_ipa_limit(void);
>>
>> int kvm_init_realm(struct kvm *kvm);
>> void kvm_destroy_realm(struct kvm *kvm);
>> +void kvm_realm_destroy_rtts(struct kvm *kvm);
>> +
>> +static inline bool kvm_realm_is_private_address(struct realm *realm,
>> + unsigned long addr)
>> +{
>> + return !(addr & BIT(realm->ia_bits - 1));
>> +}
>>
>> #endif /* __ASM_KVM_RMI_H */
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ba8286472286..eb56d4e7f21a 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1024,9 +1024,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>> return err;
>> }
>>
>> +static void kvm_realm_uninit_stage2(struct kvm_s2_mmu *mmu)
>> +{
>> + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> + struct realm *realm = &kvm->arch.realm;
>> +
>> + if (kvm_realm_state(kvm) != REALM_STATE_ACTIVE)
>> + return;
>> +
>> + write_lock(&kvm->mmu_lock);
>> + kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true);
>> + write_unlock(&kvm->mmu_lock);
>> + kvm_realm_destroy_rtts(kvm);
>> +}
>> +
>> void kvm_uninit_stage2_mmu(struct kvm *kvm)
>> {
>> - kvm_free_stage2_pgd(&kvm->arch.mmu);
>> + if (kvm_is_realm(kvm))
>> + kvm_realm_uninit_stage2(&kvm->arch.mmu);
>> + else
>> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>> kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>> }
>>
>> @@ -1103,7 +1120,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> {
>> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> - struct kvm_pgtable *pgt = NULL;
>> + struct kvm_pgtable *pgt;
>
> Is this included by accident?
Thanks for spotting that. Yes that change shouldn't have sneaked in
here. The original code before this series had the redundant assignment
to NULL. But it's unrelated to this patch so I'll drop the change.
Thanks,
Steve
>
>>
>> write_lock(&kvm->mmu_lock);
>> pgt = mmu->pgt;
>
> [...]
>
> Thanks,
> Wei-Lin Chang
^ permalink raw reply
* Re: [PATCH v14 17/44] arm64: RMI: RTT tear down
From: Steven Price @ 2026-06-05 15:01 UTC (permalink / raw)
To: Wei-Lin Chang, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <mpesc2j3czpunbg3pvgwbotvfn7vahaabvoiu77vd2g5uervho@255lwycekmxh>
On 26/05/2026 23:32, Wei-Lin Chang wrote:
> Hi,
>
> On Wed, May 13, 2026 at 02:17:25PM +0100, Steven Price wrote:
>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>> that the RMM creates/destroys entries as necessary. The physical pages
>> to store the page tables are delegated to the realm as required, and can
>> be undelegated when no longer used.
>>
>> Creating new RTTs is the easy part, tearing down is a little more
>> tricky. The result of realm_rtt_destroy() can be used to effectively
>> walk the tree and destroy the entries (undelegating pages that were
>> given to the realm).
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * Avoid the double call of kvm_free_stage2_pgd() by splitting the work
>> across that and a new function kvm_realm_uninit_stage2() which is
>> only called for realm guests.
>> Changes since v12:
>> * Simplify some functions now we know RMM page size is the same as the
>> host's.
>> Changes since v11:
>> * Moved some code from earlier in the series to this one so that it's
>> added when it's first used.
>> Changes since v10:
>> * RME->RMI rename.
>> * Some code to handle freeing stage 2 PGD moved into this patch where
>> it belongs.
>> Changes since v9:
>> * Add a comment clarifying that root level RTTs are not destroyed until
>> after the RD is destroyed.
>> Changes since v8:
>> * Introduce free_rtt() wrapper which calls free_delegated_granule()
>> followed by kvm_account_pgtable_pages(). This makes it clear where an
>> RTT is being freed rather than just a delegated granule.
>> Changes since v6:
>> * Move rme_rtt_level_mapsize() and supporting defines from kvm_rme.h
>> into rme.c as they are only used in that file.
>> Changes since v5:
>> * Rename some RME_xxx defines to do with page sizes as RMM_xxx - they are
>> a property of the RMM specification not the RME architecture.
>> Changes since v2:
>> * Moved {alloc,free}_delegated_page() and ensure_spare_page() to a
>> later patch when they are actually used.
>> * Some simplifications now rmi_xxx() functions allow NULL as an output
>> parameter.
>> * Improved comments and code layout.
>> ---
>> arch/arm64/include/asm/kvm_rmi.h | 7 ++
>> arch/arm64/kvm/mmu.c | 21 ++++-
>> arch/arm64/kvm/rmi.c | 148 +++++++++++++++++++++++++++++++
>> 3 files changed, 174 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 9de34983ee52..06ba0d4745c6 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -64,5 +64,12 @@ u32 kvm_realm_ipa_limit(void);
>>
>> int kvm_init_realm(struct kvm *kvm);
>> void kvm_destroy_realm(struct kvm *kvm);
>> +void kvm_realm_destroy_rtts(struct kvm *kvm);
>> +
>> +static inline bool kvm_realm_is_private_address(struct realm *realm,
>> + unsigned long addr)
>> +{
>> + return !(addr & BIT(realm->ia_bits - 1));
>> +}
>>
>> #endif /* __ASM_KVM_RMI_H */
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ba8286472286..eb56d4e7f21a 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1024,9 +1024,26 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>> return err;
>> }
>>
>> +static void kvm_realm_uninit_stage2(struct kvm_s2_mmu *mmu)
>> +{
>> + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> + struct realm *realm = &kvm->arch.realm;
>> +
>> + if (kvm_realm_state(kvm) != REALM_STATE_ACTIVE)
>> + return;
>> +
>> + write_lock(&kvm->mmu_lock);
>> + kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true);
>> + write_unlock(&kvm->mmu_lock);
>> + kvm_realm_destroy_rtts(kvm);
>> +}
>> +
>> void kvm_uninit_stage2_mmu(struct kvm *kvm)
>> {
>> - kvm_free_stage2_pgd(&kvm->arch.mmu);
>> + if (kvm_is_realm(kvm))
>> + kvm_realm_uninit_stage2(&kvm->arch.mmu);
>> + else
>> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>> kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>> }
>>
>> @@ -1103,7 +1120,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> {
>> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> - struct kvm_pgtable *pgt = NULL;
>> + struct kvm_pgtable *pgt;
>>
>> write_lock(&kvm->mmu_lock);
>> pgt = mmu->pgt;
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index f51ec667445e..5b00ccca4af3 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -11,6 +11,14 @@
>> #include <asm/rmi_cmds.h>
>> #include <asm/virt.h>
>>
>> +static inline unsigned long rmi_rtt_level_mapsize(int level)
>> +{
>> + if (WARN_ON(level > KVM_PGTABLE_LAST_LEVEL))
>> + return PAGE_SIZE;
>> +
>> + return (1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(level));
>> +}
>> +
>> static bool rmi_has_feature(unsigned long feature)
>> {
>> return !!u64_get_bits(rmm_feat_reg0, feature);
>> @@ -21,6 +29,144 @@ u32 kvm_realm_ipa_limit(void)
>> return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>> }
>>
>> +static int get_start_level(struct realm *realm)
>> +{
>> + return 4 - stage2_pgtable_levels(realm->ia_bits);
>> +}
>> +
>> +static void free_rtt(phys_addr_t phys)
>> +{
>> + if (free_delegated_page(phys))
>> + return;
>> +
>> + kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>> +}
>> +
>> +/*
>> + * realm_rtt_destroy - Destroy an RTT at @level for @addr.
>> + *
>> + * Returns - Result of the RMI_RTT_DESTROY call, and:
>> + * @rtt_granule: RTT granule, if the RTT was destroyed.
>> + * @next_addr: IPA corresponding to the next possible valid entry we
>> + * can target
>> + */
>> +static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
>> + int level, phys_addr_t *rtt_granule,
>> + unsigned long *next_addr)
>> +{
>> + unsigned long out_rtt;
>> + int ret;
>> +
>> + ret = rmi_rtt_destroy(virt_to_phys(realm->rd), addr, level,
>> + &out_rtt, next_addr);
>> +
>> + *rtt_granule = out_rtt;
>> +
>> + return ret;
>> +}
>
> Looks like out_rtt can be simplified out.
The issue here is there's a type conversion going on. rmi_rtt_destroy()
takes an "unsigned long *" to match the general approach of using
"unsigned long" for the inputs/outputs of SMCCC calls. But rtt_granule
is a "phys_addr_t". While we know these are (currently) the same size,
they are not the same type according to the compiler - phys_addr_t is
"long long unsigned int".
Thanks,
Steve
> [...]
>
> Thanks,
> Wei-Lin Chang
^ permalink raw reply
* Re: [PATCH v14 19/44] arm64: RMI: Allocate/free RECs to match vCPUs
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
To: Wei-Lin Chang, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <2uvtjhncf57yek5i4fupdefunukmidzw452mcavnmixpr5u3qd@uoaktzpak3nl>
On 26/05/2026 23:39, Wei-Lin Chang wrote:
> Hi,
>
> On Wed, May 13, 2026 at 02:17:27PM +0100, Steven Price wrote:
>> The RMM maintains a data structure known as the Realm Execution Context
>> (or REC). It is similar to struct kvm_vcpu and tracks the state of the
>> virtual CPUs. KVM must delegate memory and request the structures are
>> created when vCPUs are created, and suitably tear down on destruction.
>>
>> RECs may require additional pages (e.g. for storing larger register
>> state for SVE). The RMM can request extra pages for this purpose using
>> the Stateful RMI Operations (SRO) functionality to request pages during
>> REC creation. These pages are then passed back to the host from the RMM
>> ('reclaimed') when the REC is destroyed. The kernel tracking object
>> (struct rmi_sro_state) is stored in the realm_rec structure to avoid
>> memory allocation during the destruction path.
>>
>> Note that only some of register state for the REC can be set by KVM, the
>> rest is defined by the RMM (zeroed). The register state then cannot be
>> changed by KVM after the REC is created (except when the guest
>> explicitly requests this e.g. by performing a PSCI call).
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * Support SRO for REC creation/destruction instead of auxiliary
>> granules.
>> Changes since v12:
>> * Use the new range-based delegation RMI.
>> Changes since v11:
>> * Remove the KVM_ARM_VCPU_REC feature. User space no longer needs to
>> configure each VCPU separately, RECs are created on the first VCPU
>> run of the guest.
>> Changes since v9:
>> * Size the aux_pages array according to the PAGE_SIZE of the host.
>> Changes since v7:
>> * Add comment explaining the aux_pages array.
>> * Rename "undeleted_failed" variable to "should_free" to avoid a
>> confusing double negative.
>> Changes since v6:
>> * Avoid reporting the KVM_ARM_VCPU_REC feature if the guest isn't a
>> realm guest.
>> * Support host page size being larger than RMM's granule size when
>> allocating/freeing aux granules.
>> Changes since v5:
>> * Separate the concept of vcpu_is_rec() and
>> kvm_arm_vcpu_rec_finalized() by using the KVM_ARM_VCPU_REC feature as
>> the indication that the VCPU is a REC.
>> Changes since v2:
>> * Free rec->run earlier in kvm_destroy_realm() and adapt to previous patches.
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 2 +-
>> arch/arm64/include/asm/kvm_host.h | 3 +
>> arch/arm64/include/asm/kvm_rmi.h | 17 +++++
>> arch/arm64/kvm/arm.c | 6 ++
>> arch/arm64/kvm/reset.c | 1 +
>> arch/arm64/kvm/rmi.c | 105 +++++++++++++++++++++++++++
>> 6 files changed, 133 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 82fd777bd9bb..2e69fe494716 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -714,7 +714,7 @@ static inline bool kvm_realm_is_created(struct kvm *kvm)
>>
>> static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>> {
>> - return false;
>> + return kvm_is_realm(vcpu->kvm);
>> }
>>
>> #endif /* __ARM64_KVM_EMULATE_H__ */
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 3512696ed506..39b5de03d0fe 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -969,6 +969,9 @@ struct kvm_vcpu_arch {
>>
>> /* Hyp-readable copy of kvm_vcpu::pid */
>> pid_t pid;
>> +
>> + /* Realm meta data */
>> + struct realm_rec rec;
>> };
>>
>> /*
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 8bd743093ccf..d99bf4fc3c39 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -59,6 +59,22 @@ struct realm {
>> unsigned int ia_bits;
>> };
>>
>> +/**
>> + * struct realm_rec - Additional per VCPU data for a Realm
>> + *
>> + * @mpidr: MPIDR (Multiprocessor Affinity Register) value to identify this VCPU
>> + * @rec_page: Kernel VA of the RMM's private page for this REC
>> + * @aux_pages: Additional pages private to the RMM for this REC
>> + * @run: Kernel VA of the RmiRecRun structure shared with the RMM
>> + * @sro: A preallocated SRO state context
>> + */
>> +struct realm_rec {
>> + unsigned long mpidr;
>> + void *rec_page;
>> + struct rec_run *run;
>> + struct rmi_sro_state *sro;
>> +};
>> +
>> void kvm_init_rmi(void);
>> u32 kvm_realm_ipa_limit(void);
>>
>> @@ -66,6 +82,7 @@ int kvm_init_realm(struct kvm *kvm);
>> int kvm_activate_realm(struct kvm *kvm);
>> void kvm_destroy_realm(struct kvm *kvm);
>> void kvm_realm_destroy_rtts(struct kvm *kvm);
>> +void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>>
>> static inline bool kvm_realm_is_private_address(struct realm *realm,
>> unsigned long addr)
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index eb2b61fe1f0a..93d34762db91 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -586,6 +586,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>> /* Force users to call KVM_ARM_VCPU_INIT */
>> vcpu_clear_flag(vcpu, VCPU_INITIALIZED);
>>
>> + vcpu->arch.rec.mpidr = INVALID_HWID;
>> +
>> vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>>
>> /* Set up the timer */
>> @@ -1651,6 +1653,10 @@ static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu,
>> if (test_bit(KVM_ARM_VCPU_HAS_EL2, &features))
>> return -EINVAL;
>>
>> + /* Realms are incompatible with AArch32 */
>> + if (vcpu_is_rec(vcpu))
>> + return -EINVAL;
>> +
>> return 0;
>> }
>>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index b963fd975aac..c18cdca7d125 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -161,6 +161,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>> free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
>> kfree(vcpu->arch.vncr_tlb);
>> kfree(vcpu->arch.ccsidr);
>> + kvm_destroy_rec(vcpu);
>> }
>>
>> static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 849111817af7..353a5ca45e78 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -173,9 +173,108 @@ static int realm_ensure_created(struct kvm *kvm)
>> return -ENXIO;
>> }
>>
>> +static int kvm_create_rec(struct kvm_vcpu *vcpu)
>> +{
>> + struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
>> + unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
>> + struct realm *realm = &vcpu->kvm->arch.realm;
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> + unsigned long rec_page_phys;
>> + struct rec_params *params;
>> + int r, i;
>> +
>> + if (rec->run)
>> + return -EBUSY;
>> +
>> + /*
>> + * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2
>> + * flag covers v0.2 and onwards.
>> + */
>> + if (!vcpu_has_feature(vcpu, KVM_ARM_VCPU_PSCI_0_2))
>> + return -EINVAL;
>> +
>> + BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE);
>> + BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE);
>> +
>> + params = (struct rec_params *)get_zeroed_page(GFP_KERNEL);
>> + rec->rec_page = (void *)__get_free_page(GFP_KERNEL);
>> + rec->run = (void *)get_zeroed_page(GFP_KERNEL);
>
> Should this be cast to (struct rec_run *) ?
Yes it probably should - I'll update. Although IMHO get_zeroed_page()
should really return void * - but I know that would be a contentious change.
Thanks,
Steve
>> + rec->sro = kmalloc_obj(*rec->sro);
>> + if (!params || !rec->rec_page || !rec->run || !rec->sro) {
>> + r = -ENOMEM;
>> + goto out_free_pages;
>> + }
>> +
>> + for (i = 0; i < ARRAY_SIZE(params->gprs); i++)
>> + params->gprs[i] = vcpu_regs->regs[i];
>> +
>> + params->pc = vcpu_regs->pc;
>> +
>> + if (vcpu->vcpu_id == 0)
>> + params->flags |= REC_PARAMS_FLAG_RUNNABLE;
>> +
>> + rec_page_phys = virt_to_phys(rec->rec_page);
>> +
>> + if (rmi_delegate_page(rec_page_phys)) {
>> + r = -ENXIO;
>> + goto out_free_pages;
>> + }
>> +
>> + params->mpidr = mpidr;
>> +
>> + if (rmi_rec_create(virt_to_phys(realm->rd), rec_page_phys,
>> + virt_to_phys(params), rec->sro)) {
>> + r = -ENXIO;
>> + goto out_undelegate_rmm_rec;
>> + }
>> +
>> + rec->mpidr = mpidr;
>> +
>> + free_page((unsigned long)params);
>> + return 0;
>> +
>> +out_undelegate_rmm_rec:
>> + if (WARN_ON(rmi_undelegate_page(rec_page_phys)))
>> + rec->rec_page = NULL;
>> +out_free_pages:
>> + free_page((unsigned long)rec->run);
>> + free_page((unsigned long)rec->rec_page);
>> + free_page((unsigned long)params);
>> + kfree(rec->sro);
>> + rec->run = NULL;
>> + return r;
>> +}
>> +
>
> [...]
>
> Thanks,
> Wei-Lin Chang
^ permalink raw reply
* Re: [PATCH v14 20/44] arm64: RMI: Support for the VGIC in realms
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
To: Gavin Shan, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <5ea74b6a-a51a-415e-b53f-5ece9829dee8@redhat.com>
On 28/05/2026 05:07, Gavin Shan wrote:
> Hi Steve,
>
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The RMM provides emulation of a VGIC to the realm guest. With RMM v2.0
>> the registers are passed in the system registers so this works similar
>> to a normal guest, but kvm_arch_vcpu_put() need reordering to early out,
>> and realm guests don't support GICv2 even if the host does.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes from v12:
>> * GIC registers are now passed in the system registers rather than via
>> rec_entry/rec_exit which removes most of the changes.
>> Changes from v11:
>> * Minor changes to align with the previous patches. Note that the VGIC
>> handling will change with RMM v2.0.
>> Changes from v10:
>> * Make sure we sync the VGIC v4 state, and only populate valid lrs from
>> the list.
>> Changes from v9:
>> * Copy gicv3_vmcr from the RMM at the same time as gicv3_hcr rather
>> than having to handle that as a special case.
>> Changes from v8:
>> * Propagate gicv3_hcr to from the RMM.
>> Changes from v5:
>> * Handle RMM providing fewer GIC LRs than the hardware supports.
>> ---
>> arch/arm64/kvm/arm.c | 11 ++++++++---
>> arch/arm64/kvm/vgic/vgic-init.c | 2 +-
>> 2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 93d34762db91..21d9dfdb1ea0 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -786,19 +786,24 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> kvm_call_hyp_nvhe(__pkvm_vcpu_put);
>> }
>> + kvm_timer_vcpu_put(vcpu);
>> + kvm_vgic_put(vcpu);
>> +
>> + vcpu->cpu = -1;
>> +
>> + if (vcpu_is_rec(vcpu))
>> + return;
>> +
>
> For a REC, kvm_vcpu_{load, put}_debug() becomes unbalanced in
> kvm_arch_vcpu_{load, put}().
> kvm_vcpu_load_debug() is called in kvm_arch_vcpu_load(), but
> kvm_vcpu_put_debug() won't
> be called in kvm_arch_vcpu_put() after this whole series is applied.
Good catch. Yes that's not quite right.
Thanks,
Steve
>> kvm_vcpu_put_debug(vcpu);
>> kvm_arch_vcpu_put_fp(vcpu);
>> if (has_vhe())
>> kvm_vcpu_put_vhe(vcpu);
>> - kvm_timer_vcpu_put(vcpu);
>> - kvm_vgic_put(vcpu);
>> kvm_vcpu_pmu_restore_host(vcpu);
>> if (vcpu_has_nv(vcpu))
>> kvm_vcpu_put_hw_mmu(vcpu);
>> kvm_arm_vmid_clear_active();
>> vcpu_clear_on_unsupported_cpu(vcpu);
>> - vcpu->cpu = -1;
>> }
>> static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/
>> vgic-init.c
>> index 933983bb2005..a9db963dfd23 100644
>> --- a/arch/arm64/kvm/vgic/vgic-init.c
>> +++ b/arch/arm64/kvm/vgic/vgic-init.c
>> @@ -81,7 +81,7 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
>> * the proper checks already.
>> */
>> if (type == KVM_DEV_TYPE_ARM_VGIC_V2 &&
>> - !kvm_vgic_global_state.can_emulate_gicv2)
>> + (!kvm_vgic_global_state.can_emulate_gicv2 || kvm_is_realm(kvm)))
>> return -ENODEV;
>> /*
>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v14 22/44] arm64: RMI: Handle realm enter/exit
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
To: Gavin Shan, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <e6cb82fc-a8be-49d3-8fa3-0107c8ab97f7@redhat.com>
On 28/05/2026 05:38, Gavin Shan wrote:
> Hi Steve,
>
> On 5/13/26 11:17 PM, Steven Price wrote:
>> Entering a realm is done using a SMC call to the RMM. On exit the
>> exit-codes need to be handled slightly differently to the normal KVM
>> path so define our own functions for realm enter/exit and hook them
>> in if the guest is a realm guest.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> ---
>> Chanegs since v13:
>> * The RMM is now required to provide an ESR value with the correct
>> information to emulate MMIO, so we no longer need to hardcode 0s in
>> rec_exit_sys_reg().
>> * The PSCI changes mean that there is a potential race when turning on
>> a VCPU which can cause a RMI_ERROR_REC return. Exit to user space
>> with -EAGAIN in this case.
>> Changes since v12:
>> * Call guest_state_{enter,exit}_irqoff() around rmi_rec_enter().
>> * Add handling of the IRQ exception case where IRQs need to be briefly
>> enabled before exiting guest timing.
>> Changes since v8:
>> * Introduce kvm_rec_pre_enter() called before entering an atomic
>> section to handle operations that might require memory allocation
>> (specifically completing a RIPAS change introduced in a later patch).
>> * Updates to align with upstream changes to hpfar_el2 which now
>> (ab)uses
>> HPFAR_EL2_NS as a valid flag.
>> * Fix exit reason when racing with PSCI shutdown to return
>> KVM_EXIT_SHUTDOWN rather than KVM_EXIT_UNKNOWN.
>> Changes since v7:
>> * A return of 0 from kvm_handle_sys_reg() doesn't mean the register has
>> been read (although that can never happen in the current code). Tidy
>> up the condition to handle any future refactoring.
>> Changes since v6:
>> * Use vcpu_err() rather than pr_err/kvm_err when there is an associated
>> vcpu to the error.
>> * Return -EFAULT for KVM_EXIT_MEMORY_FAULT as per the documentation for
>> this exit type.
>> * Split code handling a RIPAS change triggered by the guest to the
>> following patch.
>> Changes since v5:
>> * For a RIPAS_CHANGE request from the guest perform the actual RIPAS
>> change on next entry rather than immediately on the exit. This allows
>> the VMM to 'reject' a RIPAS change by refusing to continue
>> scheduling.
>> Changes since v4:
>> * Rename handle_rme_exit() to handle_rec_exit()
>> * Move the loop to copy registers into the REC enter structure from the
>> to rec_exit_handlers callbacks to kvm_rec_enter(). This fixes a bug
>> where the handler exits to user space and user space wants to modify
>> the GPRS.
>> * Some code rearrangement in rec_exit_ripas_change().
>> Changes since v2:
>> * realm_set_ipa_state() now provides an output parameter for the
>> top_iap that was changed. Use this to signal the VMM with the correct
>> range that has been transitioned.
>> * Adapt to previous patch changes.
>> ---
>> arch/arm64/include/asm/kvm_rmi.h | 4 +
>> arch/arm64/kvm/Makefile | 2 +-
>> arch/arm64/kvm/arm.c | 26 ++++-
>> arch/arm64/kvm/rmi-exit.c | 186 +++++++++++++++++++++++++++++++
>> arch/arm64/kvm/rmi.c | 42 +++++++
>> 5 files changed, 254 insertions(+), 6 deletions(-)
>> create mode 100644 arch/arm64/kvm/rmi-exit.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index d99bf4fc3c39..feb534a6678e 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -84,6 +84,10 @@ void kvm_destroy_realm(struct kvm *kvm);
>> void kvm_realm_destroy_rtts(struct kvm *kvm);
>> void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>> +int kvm_rec_enter(struct kvm_vcpu *vcpu);
>> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>> +
>> static inline bool kvm_realm_is_private_address(struct realm *realm,
>> unsigned long addr)
>> {
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index ed3cf30eb06e..4a2d52fdb6a2 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -16,7 +16,7 @@ CFLAGS_handle_exit.o += -Wno-override-init
>> kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
>> inject_fault.o va_layout.o handle_exit.o config.o \
>> guest.o debug.o reset.o sys_regs.o stacktrace.o \
>> - vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o \
>> + vgic-sys-reg-v3.o fpsimd.o pkvm.o rmi.o rmi-exit.o \
>> arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \
>> vgic/vgic.o vgic/vgic-init.o \
>> vgic/vgic-irqfd.o vgic/vgic-v2.o \
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 21d9dfdb1ea0..ed88a203b892 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1331,6 +1331,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>> if (ret > 0)
>> ret = check_vcpu_requests(vcpu);
>> + if (ret > 0 && vcpu_is_rec(vcpu))
>> + ret = kvm_rec_pre_enter(vcpu);
>> +
>> /*
>> * Preparing the interrupts to be injected also
>> * involves poking the GIC, which must be done in a
>> @@ -1378,7 +1381,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>> trace_kvm_entry(*vcpu_pc(vcpu));
>> guest_timing_enter_irqoff();
>> - ret = kvm_arm_vcpu_enter_exit(vcpu);
>> + if (vcpu_is_rec(vcpu))
>> + ret = kvm_rec_enter(vcpu);
>> + else
>> + ret = kvm_arm_vcpu_enter_exit(vcpu);
>> vcpu->mode = OUTSIDE_GUEST_MODE;
>> vcpu->stat.exits++;
>> @@ -1424,7 +1430,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>> * context synchronization event) is necessary to ensure that
>> * pending interrupts are taken.
>> */
>> - if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
>> + if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ ||
>> + (vcpu_is_rec(vcpu) &&
>> + vcpu->arch.rec.run->exit.exit_reason == RMI_EXIT_IRQ)) {
>> local_irq_enable();
>> isb();
>> local_irq_disable();
>
> The condition could be posssibly imprecise because ARM_EXCEPTION_CODE(ret)
> can be ARM_EXCEPTION_IRQ even for a REC. So the precise condition would be:
>
> if ((!vcpu_is_rec(vcpu) && ARM_EXCEPTION_CODE(ret) ==
> ARM_EXCEPTION_IRQ) ||
> (vcpu_is_rec(vcpu) && vcpu->arch.rec.run->exit.exit_reason
> == RMI_EXIT_IRQ)) {
Good point - I guess this wouldn't have shown up in testing because
there's no harm (other than performance) in the ISB.
>> @@ -1436,8 +1444,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>> trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu),
>> *vcpu_pc(vcpu));
>> - /* Exit types that need handling before we can be preempted */
>> - handle_exit_early(vcpu, ret);
>> + if (!vcpu_is_rec(vcpu)) {
>> + /*
>> + * Exit types that need handling before we can be
>> + * preempted
>> + */
>> + handle_exit_early(vcpu, ret);
>> + }
>> kvm_nested_sync_hwstate(vcpu);
>> @@ -1462,7 +1475,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>> *vcpu)
>> ret = ARM_EXCEPTION_IL;
>> }
>> - ret = handle_exit(vcpu, ret);
>> + if (vcpu_is_rec(vcpu))
>> + ret = handle_rec_exit(vcpu, ret);
>> + else
>> + ret = handle_exit(vcpu, ret);
>> }
>> /* Tell userspace about in-kernel device output levels */
>> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
>> new file mode 100644
>> index 000000000000..e7c51b6cf6ce
>> --- /dev/null
>> +++ b/arch/arm64/kvm/rmi-exit.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/kvm_host.h>
>> +#include <kvm/arm_hypercalls.h>
>> +#include <kvm/arm_psci.h>
>> +
>> +#include <asm/rmi_smc.h>
>> +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_rmi.h>
>> +#include <asm/kvm_mmu.h>
>> +
>> +typedef int (*exit_handler_fn)(struct kvm_vcpu *vcpu);
>> +
>> +static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> + vcpu_err(vcpu, "Unhandled exit reason from realm (ESR: %#llx)\n",
>> + rec->run->exit.esr);
>> + return -ENXIO;
>> +}
>> +
>
> s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu), rec->run->exit.esr has been
> copied to the storage space pointed by kvm_vcpu_get_esr() in its caller.
Ack
>> +static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
>> +{
>> + return kvm_handle_guest_abort(vcpu);
>> +}
>> +
>> +static int rec_exit_sync_iabt(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> + vcpu_err(vcpu, "Unhandled instruction abort (ESR: %#llx).\n",
>> + rec->run->exit.esr);
>> + return -ENXIO;
>> +}
>> +
>
> s/rec->run->exit.esr/kvm_vcpu_get_esr(vcpu)
Ack
>> +static int rec_exit_sys_reg(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> + unsigned long esr = kvm_vcpu_get_esr(vcpu);
>> + int rt = kvm_vcpu_sys_get_rt(vcpu);
>> + bool is_write = (esr & ESR_ELx_SYS64_ISS_DIR_MASK) ==
>> ESR_ELx_SYS64_ISS_DIR_WRITE;
>> + int ret;
>> +
>> + if (is_write)
>> + vcpu_set_reg(vcpu, rt, rec->run->exit.gprs[rt]);
>> +
>> + ret = kvm_handle_sys_reg(vcpu);
>> + if (!is_write)
>> + rec->run->enter.gprs[rt] = vcpu_get_reg(vcpu, rt);
>> +
>> + return ret;
>> +}
>> +
>> +static exit_handler_fn rec_exit_handlers[] = {
>> + [0 ... ESR_ELx_EC_MAX] = rec_exit_reason_notimpl,
>> + [ESR_ELx_EC_SYS64] = rec_exit_sys_reg,
>> + [ESR_ELx_EC_DABT_LOW] = rec_exit_sync_dabt,
>> + [ESR_ELx_EC_IABT_LOW] = rec_exit_sync_iabt
>> +};
>> +
>> +static int rec_exit_psci(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> + int i;
>> +
>> + for (i = 0; i < REC_RUN_GPRS; i++)
>> + vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]);
>> +
>> + return kvm_smccc_call_handler(vcpu);
>> +}
>> +
>> +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
>> +{
>> + struct kvm *kvm = vcpu->kvm;
>> + struct realm *realm = &kvm->arch.realm;
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> + unsigned long base = rec->run->exit.ripas_base;
>> + unsigned long top = rec->run->exit.ripas_top;
>> + unsigned long ripas = rec->run->exit.ripas_value;
>> +
>> + if (!kvm_realm_is_private_address(realm, base) ||
>> + !kvm_realm_is_private_address(realm, top - 1)) {
>> + vcpu_err(vcpu, "Invalid RIPAS_CHANGE for %#lx - %#lx, ripas:
>> %#lx\n",
>> + base, top, ripas);
>> + /* Set RMI_REJECT bit */
>> + rec->run->enter.flags = REC_ENTER_FLAG_RIPAS_RESPONSE;
>> + return -EINVAL;
>> + }
>
> I doubt if the flag (REC_ENTER_FLAG_RIPAS_RESPONSE) will be handed over
> to RMM
> since the negative return value forces we're exiting to VMM like QEMU where
> how this problematic case can be handled is TBD.
It's perhaps a bit non-obvious but enter.flags is cleared on the exit.
So even if we return to the VMM the flags will be kept for the next entry.
I agree it is somewhat TBD exactly how this case should be handled -
there's a bunch of "VM did something stupid" cases like this that are a
bit problematic.
Thanks,
Steve
>> +
>> + /* Exit to VMM, the actual RIPAS change is done on next entry */
>> + kvm_prepare_memory_fault_exit(vcpu, base, top - base, false, false,
>> + ripas == RMI_RAM);
>> +
>> + /*
>> + * KVM_EXIT_MEMORY_FAULT requires an return code of -EFAULT, see the
>> + * API documentation
>> + */
>> + return -EFAULT;
>> +}
>> +
>> +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> + __vcpu_assign_sys_reg(vcpu, CNTV_CTL_EL0, rec->run->exit.cntv_ctl);
>> + __vcpu_assign_sys_reg(vcpu, CNTV_CVAL_EL0, rec->run-
>> >exit.cntv_cval);
>> + __vcpu_assign_sys_reg(vcpu, CNTP_CTL_EL0, rec->run->exit.cntp_ctl);
>> + __vcpu_assign_sys_reg(vcpu, CNTP_CVAL_EL0, rec->run-
>> >exit.cntp_cval);
>> +
>> + kvm_realm_timers_update(vcpu);
>> +}
>> +
>> +/*
>> + * Return > 0 to return to guest, < 0 on error, 0 (and set
>> exit_reason) on
>> + * proper exit to userspace.
>> + */
>> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> + u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
>> + unsigned long status, index;
>> +
>> + status = RMI_RETURN_STATUS(rec_run_ret);
>> + index = RMI_RETURN_INDEX(rec_run_ret);
>> +
>> + /*
>> + * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we
>> might
>> + * see the following status code and index indicating an attempt
>> to run
>> + * a REC when the RD state is SYSTEM_OFF. In this case, we just
>> need to
>> + * return to user space which can deal with the system event or
>> will try
>> + * to run the KVM VCPU again, at which point we will no longer
>> attempt
>> + * to enter the Realm because we will have a sleep request
>> pending on
>> + * the VCPU as a result of KVM's PSCI handling.
>> + */
>> + if (status == RMI_ERROR_REALM) {
>> + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>> + return 0;
>> + }
>> +
>> + /*
>> + * If a VCPU has been turned on, but the REC state hasn't been
>> updated
>> + * we may experience RMI_ERROR_REC. Exit to the userspace with -
>> EAGAIN
>> + * for a retry.
>> + */
>> + if (status == RMI_ERROR_REC)
>> + return -EAGAIN;
>> + if (rec_run_ret)
>> + return -ENXIO;
>> +
>> + vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
>> + vcpu->arch.fault.far_el2 = rec->run->exit.far;
>> + /* HPFAR_EL2 is only valid for RMI_EXIT_SYNC */
>> + vcpu->arch.fault.hpfar_el2 = 0;
>> +
>> + update_arch_timer_irq_lines(vcpu);
>> +
>> + /* Reset the emulation flags for the next run of the REC */
>> + rec->run->enter.flags = 0;
>> +
>> + switch (rec->run->exit.exit_reason) {
>> + case RMI_EXIT_SYNC:
>> + /*
>> + * HPFAR_EL2_NS is hijacked to indicate a valid HPFAR value,
>> + * see __get_fault_info()
>> + */
>> + vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar |
>> HPFAR_EL2_NS;
>> + return rec_exit_handlers[esr_ec](vcpu);
>> + case RMI_EXIT_IRQ:
>> + case RMI_EXIT_FIQ:
>> + case RMI_EXIT_SERROR:
>> + return 1;
>> + case RMI_EXIT_PSCI:
>> + return rec_exit_psci(vcpu);
>> + case RMI_EXIT_RIPAS_CHANGE:
>> + return rec_exit_ripas_change(vcpu);
>> + }
>> +
>> + kvm_pr_unimpl("Unsupported exit reason: %u\n",
>> + rec->run->exit.exit_reason);
>> + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> + return 0;
>> +}
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 353a5ca45e78..d8a5fb12db2d 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -173,6 +173,48 @@ static int realm_ensure_created(struct kvm *kvm)
>> return -ENXIO;
>> }
>> +/*
>> + * kvm_rec_pre_enter - Complete operations before entering a REC
>> + *
>> + * Some operations require work to be completed before entering a
>> realm. That
>> + * work may require memory allocation so cannot be done in the
>> kvm_rec_enter()
>> + * call.
>> + *
>> + * Return: 1 if we should enter the guest
>> + * 0 if we should exit to userspace
>> + * < 0 if we should exit to userspace, where the return value
>> indicates
>> + * an error
>> + */
>> +int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
>> + return -EINVAL;
>> +
>> + switch (rec->run->exit.exit_reason) {
>> + case RMI_EXIT_HOST_CALL:
>> + for (int i = 0; i < REC_RUN_GPRS; i++)
>> + rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>> + break;
>> + }
>> +
>> + return 1;
>> +}
>> +
>> +int noinstr kvm_rec_enter(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm_rec *rec = &vcpu->arch.rec;
>> + int ret;
>> +
>> + guest_state_enter_irqoff();
>> + ret = rmi_rec_enter(virt_to_phys(rec->rec_page),
>> + virt_to_phys(rec->run));
>> + guest_state_exit_irqoff();
>> +
>> + return ret;
>> +}
>> +
>> static int kvm_create_rec(struct kvm_vcpu *vcpu)
>> {
>> struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v14 23/44] arm64: RMI: Handle RMI_EXIT_RIPAS_CHANGE
From: Steven Price @ 2026-06-05 15:02 UTC (permalink / raw)
To: Aneesh Kumar K.V, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <yq5ay0hfspok.fsf@kernel.org>
On 19/05/2026 10:40, Aneesh Kumar K.V wrote:
> Steven Price <steven.price@arm.com> writes:
>
> ...
>
>> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long start,
>> + unsigned long size, bool unmap_private,
>> + bool may_block)
>> +{
>> + unsigned long end = start + size;
>> + struct realm *realm = &kvm->arch.realm;
>> +
>> + if (!kvm_realm_is_created(kvm))
>> + return;
>> +
>> + end = min(BIT(realm->ia_bits - 1), end);
>> +
>> + realm_unmap_shared_range(kvm, start, end, may_block);
>> + if (unmap_private)
>> + realm_unmap_private_range(kvm, start, end, may_block);
>> +}
>> +
>
> kvm_gmem_invalidate_begin() indicates a private-only invalidation. How
> is that supported?
Because we treat the private and shared spaces are aliasing we don't
really support a "private-only" invalidation. So the shared space will
be invalidated as well. Something has gone wrong if we've ended up with
the 'same' IPA being used in both the private and shared spaces.
Private has to be treated slightly specially because removing a private
mapping is observable by the guest (the page can't be reinserted without
the guest agreeing and the contents being wiped). For shared mappings
the page can simply be refaulted.
That said, I'll look into Wei-Lin's suggestion to use
kvm_gfn_range_filter which would allow all three combinations of
private-only, shared-only and private+shared.
Thanks,
Steve
^ permalink raw reply
* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Dave Hansen @ 2026-06-05 16:23 UTC (permalink / raw)
To: Kiryl Shutsemau, Chao Gao
Cc: Edgecombe, Rick P, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, Huang, Kai, Zhao, Yan Y,
seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
pbonzini@redhat.com, nik.borisov@suse.com,
linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
x86@kernel.org
In-Reply-To: <aiK1_q8beMcIEiwO@thinkstation>
On 6/5/26 04:42, Kiryl Shutsemau wrote:
>>> I don't see a reason why we can't keep the scoped_guard() on get side.
>> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
>> with goto, which is discouraged. See [*]
>>
>> :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
>> :that the “goto” statement can jump between scopes, the expectation is that
>> :usage of “goto” and cleanup helpers is never mixed in the same function.
> Fair enough.
>
> But it can also be address if we free the PAMT page array with the guard
> too :P
How important is this patch? I see "Optimize" but I read "Optional".
If we're arguing about it, maybe we should just kick it out and focus on
the more important bits.
^ permalink raw reply
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-05 17:58 UTC (permalink / raw)
To: Ackerley Tng
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgHEq-hHcTivUw8TYBeMu-RpS=Ho4DXaNXKQKLPL_biTgg@mail.gmail.com>
On Fri, Jun 05, 2026, Ackerley Tng wrote:
> Lisa Wang <wyihan@google.com> writes:
>
> > From: Sagi Shahar <sagis@google.com>
> >
> > Finalize TDX VM after creation to make it runnable.
> >
> > Signed-off-by: Sagi Shahar <sagis@google.com>
> > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Lisa Wang <wyihan@google.com>
> > ---
> > tools/testing/selftests/kvm/lib/x86/processor.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> > index d84c629a1945..842cac168e99 100644
> > --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> > +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> > @@ -1479,6 +1479,12 @@ bool kvm_arch_has_default_irqchip(void)
> > return true;
> > }
> >
> > +void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
> > +{
> > + if (is_tdx_vm(vm))
> > + tdx_vm_finalize(vm);
> > +}
> > +
>
> This doesn't necessarily block this series, we could (re)move this
> later: I'm not sure if kvm_arch_vm_finalize_vcpus() is the correct place
> to be finalizing the VM.
>
> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
> instead?
>
> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
> __vm_create_with_vcpus().
>
> While building this POC to test conversions [1] I only wanted to create
> the vm and vcpus and didn't want to finalize yet, since I still needed
> to do more mappings in the guest (and I needed the vm pointer to do
> mappings in the guest).
Hmm, I would argue this is a flaw in the selftests infrastructure. IMO, as a
developer, it's quite surprising that the current value of a global variable
doesn't show up in the VM automagically. I totally understand why selftests
work that way, but it's certainly odd and annoying. If _that_ were solved, then
the kludginess of what you're doing goes away.
The other way this could be solved is by adding support for annotating globals
with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
into the VM can automatically mark the associated globals' pages as shared.
> Would calling tdx_vm_finalize() from within vcpu_run(), just once, be
> too magical?
Yes.
> It's also possible to have some kvm_vm_finalize() call that can be
> explicitly and manually invoked from selftests just for CoCo selftests.
Why bother? It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
^ permalink raw reply
* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-06-05 18:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <87fr315fq9.ffs@fw13>
On Fri, Jun 05, 2026, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> > frequency.
>
> That's misleading because fixed frequency means that the frequency does
> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
> from CPUID/MSRs.
Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
phrase this.
> > In practice, this is likely one big nop, as re-calibration is
> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>
> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
> frequency drivers, which means that's absolutely not interesting and
> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
> set on those systems.
It _shouldn't_ be set on those systems, but in the world of virtualization it's
not completely impossible.
> IOW, this patch is pointless voodoo ware.
Would y'all be opposed to adding a WARN? I don't actually care about P4 or K7
CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
readers that don't already know that in practice, it's limited to ancient CPUs.
In other words, the point is to document expectations and mutual exclusion, not
to "fix" anything.
^ permalink raw reply
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-05 18:27 UTC (permalink / raw)
To: Sean Christopherson
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <aiMOWkiVBqoQDAPd@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
>> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
>> instead?
>>
>> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
>> __vm_create_with_vcpus().
>>
>> While building this POC to test conversions [1] I only wanted to create
>> the vm and vcpus and didn't want to finalize yet, since I still needed
>> to do more mappings in the guest (and I needed the vm pointer to do
>> mappings in the guest).
>
> Hmm, I would argue this is a flaw in the selftests infrastructure. IMO, as a
> developer, it's quite surprising that the current value of a global variable
> doesn't show up in the VM automagically. I totally understand why selftests
> work that way, but it's certainly odd and annoying. If _that_ were solved, then
> the kludginess of what you're doing goes away.
>
> The other way this could be solved is by adding support for annotating globals
> with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
> into the VM can automatically mark the associated globals' pages as shared.
>
More generally, is your opinion that tests should not have to add extra
memslots?
If I wanted a shared page, would I have to do
static __shared test_page[4096] = {0};
and then rely on ELF loading to put that in the guest for me? Are there
some compiler flags/how will I require that test_page be page aligned?
If I mark 10 globals as __shared, would the compiler automatically
consolidate the shared memory together?
I think it's a bit constraining to require that all guest memory be set
up statically. It's nice to have but I'd like another option...
Many tests use vm_userspace_mem_region_add(), CoCo tests that require
finalizing shouldn't be disallowed that option.
>> Would calling tdx_vm_finalize() from within vcpu_run(), just once, be
>> too magical?
>
> Yes.
>
>> It's also possible to have some kvm_vm_finalize() call that can be
>> explicitly and manually invoked from selftests just for CoCo selftests.
>
> Why bother? It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
is the right function where the TD is finalized?
For tests that need to do more setup after creating a vm, is the only
way out to call __vm_create() then vm_vcpu_add() to avoid premature
finalization in __vm_create_with_vcpus() when
kvm_arch_vm_finalize_vcpus() is called?
^ permalink raw reply
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-05 18:27 UTC (permalink / raw)
To: Ackerley Tng
Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgHz5GDjq0GqRmpQdHc-X45gCNr39VYWZH-T7XhPEtN5CQ@mail.gmail.com>
On Thu, Jun 04, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> + KVM: selftests: Test conversion with elevated page refcount
> >> + Askar pointed out that soon vmsplice may not pin pages. Should I
> >> pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
> >> take a dependency on CONFIG_GUP_TEST.
> >
> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> > it probably is the least awful choice. E.g. KVM also pins pages is certain flows,
> > but we're _also_ actively working to remove the need to pin.
> >
> > Hmm, maybe IORING_REGISTER_PBUF_RING? AFAICT, it's almost literally a "pin user
> > memory" syscall.
> >
>
> Hmm that takes a dependency on io_uring, which isn't always compiled
> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
> CONFIG_GUP_TEST.
Or try both? If it's not a ridiculous amount of work.
^ permalink raw reply
* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 19:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <aiMPxl5vkvJDldi9@google.com>
On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
> On Fri, Jun 05, 2026, Thomas Gleixner wrote:
>> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
>> > frequency.
>>
>> That's misleading because fixed frequency means that the frequency does
>> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
>> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
>> from CPUID/MSRs.
>
> Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
> phrase this.
>
>> > In practice, this is likely one big nop, as re-calibration is
>> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
>> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>>
>> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
>> frequency drivers, which means that's absolutely not interesting and
>> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
>> set on those systems.
>
> It _shouldn't_ be set on those systems, but in the world of virtualization it's
> not completely impossible.
>
>> IOW, this patch is pointless voodoo ware.
>
> Would y'all be opposed to adding a WARN? I don't actually care about P4 or K7
> CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
> recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
> readers that don't already know that in practice, it's limited to ancient CPUs.
>
> In other words, the point is to document expectations and mutual exclusion, not
> to "fix" anything.
Fair enough.
So yes, having a check there for actually X86_FEATURE_CONSTANT_TSC
(X86_FEATURE_CONSTANT_TSC is not interesting) and emitting a warning and
returning early is the right thing to do there.
But we also should have a check in the TSC init code somewhere which
validates that X86_FEATURE_CONSTANT_TSC is set when
X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
w/o X86_FEATURE_CONSTANT_TSC.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-05 20:48 UTC (permalink / raw)
To: Ackerley Tng
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgF09SfDm=OgbrS8-wpfxbNecQkqAQwf1ELq1jWu7NjbUA@mail.gmail.com>
On Fri, Jun 05, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> >
> > [...snip...]
> >
> >> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
> >> instead?
> >>
> >> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
> >> __vm_create_with_vcpus().
> >>
> >> While building this POC to test conversions [1] I only wanted to create
> >> the vm and vcpus and didn't want to finalize yet, since I still needed
> >> to do more mappings in the guest (and I needed the vm pointer to do
> >> mappings in the guest).
> >
> > Hmm, I would argue this is a flaw in the selftests infrastructure. IMO, as a
> > developer, it's quite surprising that the current value of a global variable
> > doesn't show up in the VM automagically. I totally understand why selftests
> > work that way, but it's certainly odd and annoying. If _that_ were solved, then
> > the kludginess of what you're doing goes away.
> >
> > The other way this could be solved is by adding support for annotating globals
> > with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
> > into the VM can automatically mark the associated globals' pages as shared.
> >
>
> More generally, is your opinion that tests should not have to add extra
> memslots?
I don't care? What I care about is making it as easy and intuitive as possible
for people to write tests, and to minimize maintenance costs.
> If I wanted a shared page, would I have to do
>
> static __shared test_page[4096] = {0};
>
> and then rely on ELF loading to put that in the guest for me? Are there
> some compiler flags/how will I require that test_page be page aligned?
Compilere and linker shenanigans.
> If I mark 10 globals as __shared, would the compiler automatically
> consolidate the shared memory together?
Yes, follow the __bss_decrypted breadcrumbs.
#define __bss_decrypted __section(".bss..decrypted")
> I think it's a bit constraining to require that all guest memory be set
> up statically. It's nice to have but I'd like another option...
You do have options, they just require more work.
> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
> finalizing shouldn't be disallowed that option.
What does that have to do with finalizing the VM?
> >> It's also possible to have some kvm_vm_finalize() call that can be
> >> explicitly and manually invoked from selftests just for CoCo selftests.
> >
> > Why bother? It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
>
> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
> is the right function where the TD is finalized?
>
> For tests that need to do more setup after creating a vm, is the only
> way out to call __vm_create() then vm_vcpu_add() to avoid premature
> finalization in __vm_create_with_vcpus() when
> kvm_arch_vm_finalize_vcpus() is called?
Depends on what you're doing. Sometimes, the answer will be yes. That's why
there are "low level" APIs, so that some tests can do fancy things, while most
tests can leave the details to the infrastructure.
If there's a recurring problem, or we anticipate one, then we can and should
figure out how to minimize the pain so that tests don't have to deal with the
same boilerplate issues over and over. Hence the __shared idea.
^ permalink raw reply
* Re: [PATCH v6 01/20] s390: Expose protected virtualization through cc_platform_has()
From: JAEHOON KIM @ 2026-06-06 0:34 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Halil Pasic,
Matthew Rosato
In-Reply-To: <20260604083959.1265923-2-aneesh.kumar@kernel.org>
On 6/4/2026 3:39 AM, Aneesh Kumar K.V (Arm) wrote:
> Protected virtualization guests use memory encryption, so advertise that to
> the rest of the kernel through cc_platform_has(CC_ATTR_MEM_ENCRYPT).
>
> s390 already forces DMA mappings to be unencrypted for protected
> virtualization guests through force_dma_unencrypted(). Add
> ARCH_HAS_CC_PLATFORM and provide the matching cc_platform_has()
> implementation
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Tested-by: Jaehoon Kim <jhkim@linux.ibm.com>
Tested on s390 PV guest with swiotlb_dynamic configuration. SWIOTLB
bounce buffer allocation and dynamic pool management work correctly.
Also concurrent I/O stress completed successfully.
Thanks,
Jaehoon.
> ---
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Jaehoon Kim <jhkim@linux.ibm.com>
> ---
> arch/s390/Kconfig | 1 +
> arch/s390/mm/init.c | 14 ++++++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index ecbcbb781e40..9b5e6029e043 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -87,6 +87,7 @@ config S390
> select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
> select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> select ARCH_HAS_CC_CAN_LINK
> + select ARCH_HAS_CC_PLATFORM
> select ARCH_HAS_CPU_FINALIZE_INIT
> select ARCH_HAS_CURRENT_STACK_POINTER
> select ARCH_HAS_DEBUG_VIRTUAL
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 1f72efc2a579..ad3c6d92b801 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -50,6 +50,7 @@
> #include <linux/virtio_anchor.h>
> #include <linux/virtio_config.h>
> #include <linux/execmem.h>
> +#include <linux/cc_platform.h>
>
> pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(".bss..swapper_pg_dir");
> pgd_t invalid_pg_dir[PTRS_PER_PGD] __section(".bss..invalid_pg_dir");
> @@ -140,6 +141,19 @@ bool force_dma_unencrypted(struct device *dev)
> return is_prot_virt_guest();
> }
>
> +
> +bool cc_platform_has(enum cc_attr attr)
> +{
> + switch (attr) {
> + case CC_ATTR_MEM_ENCRYPT:
> + return is_prot_virt_guest();
> +
> + default:
> + return false;
> + }
> +}
> +EXPORT_SYMBOL_GPL(cc_platform_has);
> +
> /* protected virtualization */
> static void __init pv_init(void)
> {
^ permalink raw reply
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Thomas Gleixner @ 2026-06-06 10:34 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-11-seanjc@google.com>
On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> Now that all paravirt code that explicitly specifies the TSC frequency
> also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
>
> Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> line parameter"), one of the goals of the param is to allow the refined
> calibration work "to do meaningful error checking".
>
> Note, preferring the user-provided TSC frequency over the frequency from
> the hypervisor or trusted firmware, while simultaneously not treating the
> user-provided frequency as gospel, is obviously incongruous. Sweep the
> problem under the rug for now to avoid opening a big can of worms that
> likely doesn't have a great answer.
There is a good answer I think.
early_tsc_khz exists to cater for the overclocking crowd. On their
modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
matching reality anymore. So they work around that by supplying a close
enough tsc_early_khz and then they let the refined calibration work
figure it out.
Arguably that's only relevant for bare metal systems and what's worse is
that in virtual environments the refined calibration work can fail,
which renders the TSC unstable.
So I'd rather say we change this logic to:
if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
tsc_khz = x86_init.....();
force(X86_FEATURE_TSC_KNOWN_FREQ);
} else if (tsc_khz_early) {
....
} else {
...
}
Along with:
if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
if (tsc_khz_early)
pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
or something daft like that.
The kernel has for various reasons always tried to cater for the needs
of users who are plagued by bonkers firmware, but we have to stop to
prioritize or treating equal ancient and modded out of spec hardware.
TBH, I consider that whole KVM clock nonsense to fall into the modded
out of spec hardware realm. Do a reality check:
How many production systems are out there still which run VMs on CPUs
with a broken TSC and the lack of VM TSC scaling?
I'm not saying that we should not support the few remaining systems
anymore, but our tendency to pretend that we can keep all of this
nonsense working and at the same time making progress is just a fallacy.
I rather want to have a more fine grained differentiation and
prioritization of:
1) The actual real world relevant use cases which run on contemporary
hardware.
2) Still relevant use cases on slightly older hardware with less
capabilities
3) Broken firmware
4) Modded out of spec nonsense
5) Support for ancient museums pieces
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: David Woodhouse @ 2026-06-06 10:52 UTC (permalink / raw)
To: Thomas Gleixner, Sean Christopherson, Paolo Bonzini, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
Michael Kelley
In-Reply-To: <877boc554l.ffs@fw13>
[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]
On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>
> > Now that all paravirt code that explicitly specifies the TSC frequency
> > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> >
> > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > line parameter"), one of the goals of the param is to allow the refined
> > calibration work "to do meaningful error checking".
> >
> > Note, preferring the user-provided TSC frequency over the frequency from
> > the hypervisor or trusted firmware, while simultaneously not treating the
> > user-provided frequency as gospel, is obviously incongruous. Sweep the
> > problem under the rug for now to avoid opening a big can of worms that
> > likely doesn't have a great answer.
>
> There is a good answer I think.
>
> early_tsc_khz exists to cater for the overclocking crowd. On their
> modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> matching reality anymore. So they work around that by supplying a close
> enough tsc_early_khz and then they let the refined calibration work
> figure it out.
>
> Arguably that's only relevant for bare metal systems and what's worse is
> that in virtual environments the refined calibration work can fail,
> which renders the TSC unstable.
>
> So I'd rather say we change this logic to:
>
> if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> tsc_khz = x86_init.....();
> force(X86_FEATURE_TSC_KNOWN_FREQ);
> } else if (tsc_khz_early) {
> ....
> } else {
> ...
> }
>
> Along with:
>
> if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> if (tsc_khz_early)
> pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
>
> or something daft like that.
>
> The kernel has for various reasons always tried to cater for the needs
> of users who are plagued by bonkers firmware, but we have to stop to
> prioritize or treating equal ancient and modded out of spec hardware.
>
> TBH, I consider that whole KVM clock nonsense to fall into the modded
> out of spec hardware realm. Do a reality check:
>
> How many production systems are out there still which run VMs on CPUs
> with a broken TSC and the lack of VM TSC scaling?
>
> I'm not saying that we should not support the few remaining systems
> anymore, but our tendency to pretend that we can keep all of this
> nonsense working and at the same time making progress is just a fallacy.
I don't know that we can take the KVM (and Xen) clock away from guests,
but all of the *horrid* part about it is the way it attempts to cope
with the possibility that the *host* timekeeping might flip away from
TSC-based mode at any point in time. By the end of my outstanding
cleanup series, that is the *only* thing the gtod_notifier remains for.
If we can trust the hardware *and* the host kernel, then KVM could
theoretically hardwire the kvmclock into 'master clock mode' where it
basically just advertises the TSC→kvmclock relationship *once* to all
CPUs and it never changes.
All the nonsense about updating it every time we enter a CPU could just
go away completely.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Kishen Maloor @ 2026-06-07 4:36 UTC (permalink / raw)
To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-1-yilun.xu@linux.intel.com>
On 5/21/26 8:41 PM, Xu Yilun wrote:
> ...
> This series has 2 distinct parts:
>
> Patches 1-4: TDX Module Extensions enabling
> Patches 5-15: DICE-based TDX Quoting, primarily Peter's work.
>
Perhaps the extensions enabling patches could be organized more simply as
these three?
1. Add TDX extensions metadata structure and accessor
2. Add TDH.EXT.MEM.ADD
3. Add TDH.EXT.INIT and wire extensions init into init_tdx_module()
This introduces the SEAMCALLs and lets the wiring land with the patch
that completes the init flow, avoiding a separate "enable" patch.
^ permalink raw reply
* Re: [PATCH 02/15] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Kishen Maloor @ 2026-06-07 4:38 UTC (permalink / raw)
To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-3-yilun.xu@linux.intel.com>
On 5/21/26 8:41 PM, Xu Yilun wrote:
> TDX Module introduces a new concept called "TDX Module Extensions" to
> support long running / hard-irq preemptible flows inside. This makes TDX
> Module capable of handling complex tasks through "Extension SEAMCALLs".
> Adding more memory to TDX Module is the first step to enable Extensions.
>
> Currently, TDX Module memory use is relatively static. But, the
> Extensions need to use memory more dynamically. While 'static' here
> means the kernel provides necessary amount of memory to TDX Module for
> its basic functionalities, 'dynamic' means extra memory is needed only
> if new add-on features are to be enabled. So add a new memory feeding
> process backed by a new SEAMCALL TDH.EXT.MEM.ADD.
>
> The process is mostly the same as adding PAMT. The kernel queries TDX
> Module how much memory needed, allocates it, hands it over, and never
> gets it back.
>
> TDH.EXT.MEM.ADD uses a new parameter type HPA_LIST_INFO to provide
> control (private) pages to TDX Module. This type represents a list of
> pages for TDX Module to access. It needs a 'root page' which contains
> the list of HPAs of the pages. It collapses the HPA of the root page
> and the number of valid HPAs into a 64 bit raw value for SEAMCALL
> parameters. The root page is always a medium, TDX Module never keeps
> the root page.
>
> Introduce a tdx_clflush_hpa_list() helper to flush shared cache before
> SEAMCALL, to avoid shared cache writeback damaging these private pages.
>
> For now, TDX Module Extensions consumes relatively large amount of
> memory (~50MB). Use contiguous page allocation to avoid permanently
> fragment too much memory. Print the allocation amount on TDX Module
> Extensions initialization for visibility.
>
> Co-developed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
> arch/x86/virt/vmx/tdx/tdx.h | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 118 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 119 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index a5eec8e3cc71..2335f88bbb10 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -46,6 +46,7 @@
> #define TDH_PHYMEM_PAGE_WBINVD 41
> #define TDH_VP_WR 43
> #define TDH_SYS_CONFIG 45
> +#define TDH_EXT_MEM_ADD 61
> #define TDH_SYS_DISABLE 69
>
> /*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c0c6281b08a5..622399d8da68 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -31,6 +31,7 @@
> #include <linux/syscore_ops.h>
> #include <linux/idr.h>
> #include <linux/kvm_types.h>
> +#include <linux/bitfield.h>
> #include <asm/page.h>
> #include <asm/special_insns.h>
> #include <asm/msr-index.h>
> @@ -1179,6 +1180,123 @@ static __init int init_tdmrs(struct tdmr_info_list *tdmr_list)
> return 0;
> }
>
> +static void tdx_clflush_hpa_list(struct page *root, unsigned int nr_pages)
> +{
> + u64 *entries = page_to_virt(root);
> + int i;
> +
> + for (i = 0; i < nr_pages; i++)
> + clflush_cache_range(__va(entries[i]), PAGE_SIZE);
> +}
> +
> +#define HPA_LIST_INFO_FIRST_ENTRY GENMASK_U64(11, 3)
> +#define HPA_LIST_INFO_PFN GENMASK_U64(51, 12)
> +#define HPA_LIST_INFO_LAST_ENTRY GENMASK_U64(63, 55)
> +
> +static u64 to_hpa_list_info(struct page *root, unsigned int nr_pages)
> +{
> + return FIELD_PREP(HPA_LIST_INFO_FIRST_ENTRY, 0) |
> + FIELD_PREP(HPA_LIST_INFO_PFN, page_to_pfn(root)) |
> + FIELD_PREP(HPA_LIST_INFO_LAST_ENTRY, nr_pages - 1);
> +}
> +
> +static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> +{
> + struct tdx_module_args args = {
> + .rcx = to_hpa_list_info(root, nr_pages),
> + };
> + u64 r;
> +
> + tdx_clflush_hpa_list(root, nr_pages);
> +
> + do {
> + /*
> + * TDH_EXT_MEM_ADD is designed to use output parameter RCX to
> + * override/update input parameter RCX, so the caller doesn't
> + * have to do manual parameter update on retry call.
> + */
> + r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> + } while (r == TDX_INTERRUPTED_RESUMABLE);
The retry loop compares the full return value against TDX_INTERRUPTED_RESUMABLE. Should
it mask with TDX_SEAMCALL_STATUS_MASK first, in case the module sets any
lower detail bits?
Ditto for TDH.EXT.INIT in patch 3.
> +
> + if (r != TDX_SUCCESS)
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int tdx_ext_mem_setup(void)
> +{
> + unsigned int nr_pages;
> + struct page *page;
> + u64 *root;
> + unsigned int i;
> + int ret;
> +
> + nr_pages = tdx_sysinfo.ext.memory_pool_required_pages;
> + /*
> + * memory_pool_required_pages == 0 means no need to add pages,
> + * skip the memory setup.
> + */
> + if (!nr_pages)
> + return 0;
> +
> + root = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!root)
> + return -ENOMEM;
> +
> + page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
> + &node_online_map);
The SEAMCALL takes a scatter list (HPA_LIST_INFO), so the module
doesn't require contiguity. If the goal is just to avoid scattering
pages across many 2MB regions, maybe dense, 2MB-aligned allocations should
achieve that without a single pool-wide contiguous block.
> + if (!page) {
> + ret = -ENOMEM;
> + goto out_free_root;
> + }
> +
> + for (i = 0; i < nr_pages;) {
> + unsigned int nents = min(nr_pages - i,
> + PAGE_SIZE / sizeof(*root));
> + int j;
> +
> + for (j = 0; j < nents; j++)
> + root[j] = page_to_phys(page + i + j);
Would it be better to allocate per-batch (i.e. one root page's worth
at a time) rather than the whole pool up front?
That way an intermediate TDH.EXT.MEM.ADD failure wouldn't leak
all nr_pages. Also, a batch is up to 512 pages (= 2MB) and its allocation
could be 2MB-aligned, addressing your fragmentation concern.
> +
> + ret = tdx_ext_mem_add(virt_to_page(root), nents);
> + /*
> + * No SEAMCALLs to reclaim the added pages. For simple error
> + * handling, leak all pages.
> + */
> + WARN_ON_ONCE(ret);
> + if (ret)
> + break;
> +
> + i += nents;
> + }
> +
> + /*
> + * Extensions memory can't be reclaimed once added, print out the
> + * amount, stop tracking it and free the root page, no matter success
> + * or failure.
> + */
> + pr_info("%lu KB allocated for TDX Module Extensions\n",
> + nr_pages * PAGE_SIZE / 1024);
> +
> +out_free_root:
> + kfree(root);
> +
> + return ret;
> +}
> +
> +static int __maybe_unused init_tdx_ext(void)
Could this be named init_tdx_extensions() instead to disambiguate
from tdx_ext_init() in patch 3?
> +{
> + if (!(tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_EXT))
> + return 0;
> +
> + /* No feature requires TDX Module Extensions. */
> + if (!tdx_sysinfo.ext.ext_required)
> + return 0;
> +
> + return tdx_ext_mem_setup();
> +}
> +
> static __init int init_tdx_module(void)
> {
> int ret;
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox