* Re: [PATCH v4 12/47] x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() => native_calibrate_cpu_late()
From: David Woodhouse @ 2026-06-01 21:52 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-13-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 339 bytes --]
On Fri, 29 May 2026 07:43:59 -0700, Sean Christopherson wrote:
> Rename the late CPU calibration routine so that its relationship to the
> early routine is more obvious and intuitive.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 14/47] x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
From: David Woodhouse @ 2026-06-01 21:53 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-15-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 400 bytes --]
On Fri, 29 May 2026 07:44:01 -0700, Sean Christopherson wrote:
> Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz() in anticipation of
> adding support for getting TSC info from PV CPUID, i.e. in a KVM specific
> way, but without non-kvmclock.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 17/47] x86/kvm: Mark TSC as reliable when it's constant and nonstop
From: David Woodhouse @ 2026-06-01 22:02 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnhnjvfIblFxTFX@google.com>
[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]
On Fri, 29 May 2026 11:57:34 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 909d3e5e5bcd5..4fe9c69bf40b3 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > [ ... ]
> > > @@ -1040,7 +1041,20 @@ static void __init kvm_init_platform(void)
> > [ ... ]
> > > - kvmclock_init();
> > > + /*
> > > + * If the TSC counts at a constant frequency across P/T states, counts
> > > + * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > + * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
> > > + * exists purely to honor the TSC being marked unstable via command
> > > + * line, any runtime detection of an unstable will happen after this.
> > > + */
> > > + tsc_is_reliable = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > + !check_tsc_unstable();
> >
> > [Severity: High]
> > Does this evaluate check_tsc_unstable() too early to catch the command line
> > parameter?
>
> Huh, it does indeed.
>
> > It looks like kvm_init_platform() is called from setup_arch(), but the
> > tsc=unstable kernel parameter is parsed via __setup() later during
> > parse_args() in start_kernel().
> >
> > If check_tsc_unstable() evaluates to 0 here because the parameter hasn't
> > been parsed yet, wouldn't it incorrectly force X86_FEATURE_TSC_RELIABLE
> > and set prefer_tsc to true?
>
> Yep, but this is a pre-existing problem that goes all the way back to the original
> commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if invariant TSC is exposed").
>
> We could try to fix that, but I'm _very_ strongly inclined to add (yet another)
> patch to simply drop the check_tsc_unstable() since it has always been dead code.
Yeah, kill it with fire.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 18/47] x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
From: David Woodhouse @ 2026-06-01 22:06 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnZ8tuHFVFTeofW@google.com>
[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]
On Fri, 29 May 2026 11:24:50 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 4fe9c69bf40b3..c1139182121d5 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> >
> > [ ... ]
> >
> > > @@ -1039,6 +1040,13 @@ static void __init kvm_init_platform(void)
> > > x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
> > > x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
> > > }
> > > +
> > > +#ifdef CONFIG_X86_LOCAL_APIC
> > > + /* The leaf also includes the local APIC bus/timer frequency.*/
> > > + apic_khz = cpuid_ebx(timing_info_leaf);
> > > + if (apic_khz)
> > > + lapic_timer_period = apic_khz * 1000 / HZ;
> >
> > [Severity: High]
> > Could the intermediate multiplication of apic_khz * 1000 overflow here?
> >
> > Since apic_khz is a u32, the multiplication is evaluated in 32-bit unsigned
> > arithmetic. If the hypervisor reports an APIC timer frequency greater than
> > ~4.29 GHz (apic_khz > 4294967), this calculation will silently overflow.
>
> Hmm, easy enough to use mul_u64_u32_div() (I think that's the write helper for
> this?).
Yep.
> But this problem pre-exits in almost every other path that sets lapic_timer_period.
> So while I tried to avoid doing yet more tangentially related cleanup, it seems
> like adding a helper to set lapic_timer_period is the way to go. That would also
> allow making lapic_timer_period local to arch/x86/kernel/apic/apic.c.
>
> *sigh*
Yay, more patches!
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 31/47] x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150753.714296-1-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 628 bytes --]
On Fri, 29 May 2026 08:07:52 -0700, Sean Christopherson wrote:
> NOP-ify the sched_clock save/restore hooks when using VMware's version of
> sched_clock. This will allow extending paravirt_set_sched_clock() to set
> the save/restore hooks, without having to simultaneously change the
> behavior of VMware guests.
>
> Note, it's not at all obvious that it's safe/correct for VMware guests to
> do nothing on suspend/resume, but that's a pre-existing problem. Leave it
> for a VMware expert to sort out.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 30/47] x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150741.714145-1-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 403 bytes --]
On Fri, 29 May 2026 08:07:41 -0700, Sean Christopherson wrote:
> NOP-ify the x86_platform sched_clock save/restore hooks when setting up
> Xen's PV clock to make it somewhat obvious the hooks aren't used when
> running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 46/47] x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for documentation
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150833.715042-1-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 543 bytes --]
On Fri, 29 May 2026 08:08:33 -0700, Sean Christopherson wrote:
> Invoke kvmclock_cpu_action() with AP_ONLINE and BSP_RESUME, even though
> kvmclock doesn't need to do anything in either case, so that the asymmetry
> of kvmclock is a detail buried in kvmclock, and to explicitly document
> that doing nothing during those phases is intentional and correct.
>
> For all intents and purposes, no functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v7 09/42] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-06-01 23:14 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-9-2f0fae496530@google.com>
On Fri, May 22, 2026 at 05:17:51PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
>
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
>
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
>
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
>
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>
Typo on the "person".
(Sent this earlier but looks like some of my emails never hit the
list so re-sending. Apologies if this is a dupe).
Thanks,
Mike
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
^ permalink raw reply
* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Borislav Petkov @ 2026-06-02 3:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>
On Fri, May 29, 2026 at 07:43:49AM -0700, Sean Christopherson wrote:
> +static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
> +{
> + unsigned int ecx_hz, edx;
> +
> + memset(info, 0, sizeof(*info));
Let's not clear this unnecessarily...
> +
> + if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
> + return -ENOENT;
... just to return here...
> +
> + /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> + cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
> +
> + if (!info->denominator || !info->numerator)
> + return -ENOENT;
... or here.
We wanna clear it here, when we'll return success.
> +
> + /*
> + * Note, some CPUs provide the multiplier information, but not the core
Note: some CPUs...
> + * crystal frequency. The multiplier information is still useful for
> + * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
> + */
> + info->crystal_khz = ecx_hz / 1000;
> + return 0;
> +}
> +
> +int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
> +{
> + if (cpuid_get_tsc_info(info) || !info->crystal_khz)
> + return -ENOENT;
> +
> + info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
> + return 0;
> +}
Unused here. Add it with its first user pls.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Aneesh Kumar K.V @ 2026-06-02 5:10 UTC (permalink / raw)
To: Dan Williams (nvidia), Dan Williams (nvidia),
Alexey Kardashevskiy, linux-coco, iommu, linux-kernel, kvm
Cc: Bjorn Helgaas, Dan Williams, Jason Gunthorpe, Joerg Roedel,
Jonathan Cameron, Kevin Tian, Nicolin Chen, Samuel Ortiz,
Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun,
Shameer Kolothum, Paolo Bonzini, Tony Krowiak, Halil Pasic,
Jason Herne, Harald Freudenberger, Holger Dengler, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
Eric Farman, linux-s390
In-Reply-To: <6a1774dd80f74_19737610095@djbw-dev.notmuch>
"Dan Williams (nvidia)" <djbw@kernel.org> writes:
> Aneesh Kumar K.V wrote:
>> >> I am leaning towards the latter at this point.
>> >
>> > But we already have struct pci_tsm_ops::guest_req, which is specific to
>> > the underlying CC architecture. From the above, pci_tsm_req_scope also
>> > appears to carry the same information. Is that useful?
>> >
>>
>> I think there is value in having the VMM express the guest’s
>> confidential computing architecture, so that the TSM backend can
>> validate whether it should handle that guest request ?.
>
> Yes, that is the idea.
>
>> So it would not be the IOMMU validating the scope value, but rather
>> pci_tsm_ops::guest_req.
>>
>> static ssize_t cca_tsm_guest_req(struct pci_tdi *tdi, enum pci_tsm_req_scope scope,
>> sockptr_t req, size_t req_len, sockptr_t resp,
>> size_t resp_len, u64 *tsm_code)
>> {
>> struct pci_dev *pdev = tdi->pdev;
>>
>> /* reject the guest request if VMM was using the link tsm wrongly. The guest
>> * was using a wrong CC archiecture with this link tsm
>> */
>> if (scope != TSM_REQ_TYPE_CCA)
>> return -EINVAL;
>
> Right, iommufd is tunneling TSM requests. The tunnel should have an
> envelope of TSM_REQ_TYPE_* and an @op field. The TSM driver gets those
> from iommufd, validates the envelope and then processes @req.
>
> This self-consistency and explicitness also buys some future-proofing.
> It allows for alternate command sets within an arch, cross TSM
> implementation shared commands, IOMMUFD-to-TSM requests outside of guest
> requests.
>
>> Jason Gunthorpe <jgg@ziepe.ca> writes:
>>
>> > On Tue, May 26, 2026 at 11:17:50PM -0700, Dan Williams (nvidia) wrote:
>> >
>> >> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
>> >>
>> >> TSM_REQ_TYPE_CCA
>> >> TSM_REQ_TYPE_SEV
>> >> TSM_REQ_TYPE_TDX
>> >>
>> >> I am leaning towards the latter at this point.
>> >
>> > Yeah, this sounds good. I would also include an common op field that
>> > can be decoded by the TSM driver based on the TYPE above, and the
>> > usual in/out message buffers.
>>
>> We already have iommufd_vdevice_tsm_op_ioctl() to handle common
>> operations.
>
> Per above, I believe this is about an @op value in a common location
> that iommufd can forward to the backend for validation of guest
> requests.
>
>> Right now, it handles IOMMU_VDEVICE_TSM_BIND and
>> IOMMU_VDEVICE_TSM_UNBIND. I guess we should move TSM_REQ_SET_TDI_STATE
>> operations to that as well?
>
> I think we can wait to move it to its own IOMMU operation unless/until
> there is a need to set RUN outside of an explicit guest request, right?
Something like the below? (the diff against this series)
I have not yet integrated this into the full CCA patchset for testing,
but I wanted to make sure we are aligned on the UAPI.
diff --git a/drivers/iommu/iommufd/tsm.c b/drivers/iommu/iommufd/tsm.c
index 56bb499ba7a9..345efba2e66e 100644
--- a/drivers/iommu/iommufd/tsm.c
+++ b/drivers/iommu/iommufd/tsm.c
@@ -61,17 +61,30 @@ int iommufd_vdevice_tsm_op_ioctl(struct iommufd_ucmd *ucmd)
return ret;
}
-static bool iommufd_vdevice_tsm_req_scope_valid(u32 scope)
+static bool iommufd_vdevice_tsm_req_arch_valid(u32 tvm_arch)
{
- if (scope > IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_LAST)
+ switch (tvm_arch) {
+ case IOMMU_VDEVICE_TSM_TVM_ARCH_CCA:
+ case IOMMU_VDEVICE_TSM_TVM_ARCH_SEV:
+ case IOMMU_VDEVICE_TSM_TVM_ARCH_TDX:
+ return true;
+ default:
return false;
+ }
+}
- switch (scope) {
- case IOMMU_VDEVICE_TSM_REQ_PCI_INFO:
- case IOMMU_VDEVICE_TSM_REQ_PCI_STATE_CHANGE:
- case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_READ:
- case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE:
+static bool iommufd_vdevice_tsm_req_op_valid(u32 op, u32 tvm_arch)
+{
+ switch (op) {
+ case TSM_REQ_READ_OBJECT:
+ case TSM_REQ_REGEN_OBJECT:
+ case TSM_REQ_OBJECT_INFO:
+ case TSM_REQ_VALIDATE_MMIO:
+ case TSM_REQ_SET_TDI_STATE:
return true;
+ case TSM_REQ_SEV_ENABLE_DMA:
+ case TSM_REQ_SEV_DISABLE_DMA:
+ return tvm_arch == IOMMU_VDEVICE_TSM_TVM_ARCH_SEV;
default:
return false;
}
@@ -99,7 +112,8 @@ int iommufd_vdevice_tsm_req_ioctl(struct iommufd_ucmd *ucmd)
struct iommufd_vdevice *vdev;
struct iommu_vdevice_tsm_req *cmd = ucmd->cmd;
struct tsm_guest_req_info info = {
- .scope = cmd->scope,
+ .op = cmd->op,
+ .tvm_arch = cmd->tvm_arch,
.req = {
.user = u64_to_user_ptr(cmd->req_uptr),
.is_kernel = false,
@@ -112,10 +126,10 @@ int iommufd_vdevice_tsm_req_ioctl(struct iommufd_ucmd *ucmd)
.resp_len = cmd->resp_len,
};
- if (cmd->__reserved)
- return -EOPNOTSUPP;
+ if (!iommufd_vdevice_tsm_req_arch_valid(cmd->tvm_arch))
+ return -EINVAL;
- if (!iommufd_vdevice_tsm_req_scope_valid(cmd->scope))
+ if (!iommufd_vdevice_tsm_req_op_valid(cmd->op, cmd->tvm_arch))
return -EINVAL;
vdev = iommufd_get_vdevice(ucmd->ictx, cmd->vdevice_id);
diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
index 5fdcd7f2e820..439241c756fd 100644
--- a/drivers/pci/tsm.c
+++ b/drivers/pci/tsm.c
@@ -378,7 +378,8 @@ EXPORT_SYMBOL_GPL(pci_tsm_bind);
/**
* pci_tsm_guest_req() - helper to marshal guest requests to the TSM driver
* @pdev: @pdev representing a bound tdi
- * @scope: caller asserts this passthrough request is limited to TDISP operations
+ * @op: guest-initiated request operation
+ * @tvm_arch: guest TVM architecture
* @req_in: Input payload forwarded from the guest
* @in_len: Length of @req_in
* @req_out: Output payload buffer response to the guest
@@ -387,7 +388,7 @@ EXPORT_SYMBOL_GPL(pci_tsm_bind);
*
* This is a common entry point for requests triggered by userspace KVM-exit
* service handlers responding to TDI information or state change requests. The
- * scope parameter limits requests to TDISP state management, or limited debug.
+ * operation parameter limits requests to guest-initiated TSM operations.
* This path is only suitable for commands and results that are the host kernel
* has no use, the host is only facilitating guest to TSM communication.
*
@@ -400,7 +401,9 @@ EXPORT_SYMBOL_GPL(pci_tsm_bind);
* Context: Caller is responsible for calling this within the pci_tsm_bind()
* state of the TDI.
*/
-ssize_t pci_tsm_guest_req(struct pci_dev *pdev, enum pci_tsm_req_scope scope,
+ssize_t pci_tsm_guest_req(struct pci_dev *pdev,
+ enum iommu_vdevice_tsm_guest_req_op op,
+ enum iommu_vdevice_tsm_guest_tvm_arch tvm_arch,
sockptr_t req_in, size_t in_len, sockptr_t req_out,
size_t out_len, u64 *tsm_code)
{
@@ -408,9 +411,30 @@ ssize_t pci_tsm_guest_req(struct pci_dev *pdev, enum pci_tsm_req_scope scope,
struct pci_tdi *tdi;
int rc;
- /* Forbid requests that are not directly related to TDISP operations */
- if (scope > PCI_TSM_REQ_STATE_CHANGE)
+ switch (tvm_arch) {
+ case IOMMU_VDEVICE_TSM_TVM_ARCH_CCA:
+ case IOMMU_VDEVICE_TSM_TVM_ARCH_SEV:
+ case IOMMU_VDEVICE_TSM_TVM_ARCH_TDX:
+ break;
+ default:
return -EINVAL;
+ }
+
+ switch (op) {
+ case TSM_REQ_READ_OBJECT:
+ case TSM_REQ_REGEN_OBJECT:
+ case TSM_REQ_OBJECT_INFO:
+ case TSM_REQ_VALIDATE_MMIO:
+ case TSM_REQ_SET_TDI_STATE:
+ break;
+ case TSM_REQ_SEV_ENABLE_DMA:
+ case TSM_REQ_SEV_DISABLE_DMA:
+ if (tvm_arch == IOMMU_VDEVICE_TSM_TVM_ARCH_SEV)
+ break;
+ fallthrough;
+ default:
+ return -EINVAL;
+ }
ACQUIRE(rwsem_read_intr, lock)(&pci_tsm_rwsem);
if ((rc = ACQUIRE_ERR(rwsem_read_intr, &lock)))
@@ -430,8 +454,9 @@ ssize_t pci_tsm_guest_req(struct pci_dev *pdev, enum pci_tsm_req_scope scope,
tdi = pdev->tsm->tdi;
if (!tdi)
return -ENXIO;
- return to_pci_tsm_ops(pdev->tsm)->guest_req(tdi, scope, req_in, in_len,
- req_out, out_len, tsm_code);
+ return to_pci_tsm_ops(pdev->tsm)->guest_req(tdi, op, tvm_arch, req_in,
+ in_len, req_out, out_len,
+ tsm_code);
}
EXPORT_SYMBOL_GPL(pci_tsm_guest_req);
diff --git a/drivers/virt/coco/tsm-core.c b/drivers/virt/coco/tsm-core.c
index ce01b19990f5..88cb168d8120 100644
--- a/drivers/virt/coco/tsm-core.c
+++ b/drivers/virt/coco/tsm-core.c
@@ -128,42 +128,15 @@ int tsm_unbind(struct device *dev)
}
EXPORT_SYMBOL_GPL(tsm_unbind);
-static int tsm_pci_req_scope(u32 scope, enum pci_tsm_req_scope *pci_scope)
-{
- switch (scope) {
- case IOMMU_VDEVICE_TSM_REQ_PCI_INFO:
- *pci_scope = PCI_TSM_REQ_INFO;
- return 0;
- case IOMMU_VDEVICE_TSM_REQ_PCI_STATE_CHANGE:
- *pci_scope = PCI_TSM_REQ_STATE_CHANGE;
- return 0;
- case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_READ:
- *pci_scope = PCI_TSM_REQ_DEBUG_READ;
- return 0;
- case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE:
- *pci_scope = PCI_TSM_REQ_DEBUG_WRITE;
- return 0;
- default:
- return -EINVAL;
- }
-}
-
ssize_t tsm_guest_req(struct device *dev,
struct tsm_guest_req_info *info, u64 *tsm_code)
{
- int ret;
- enum pci_tsm_req_scope pci_scope;
-
if (!dev_is_pci(dev))
return -EINVAL;
- ret = tsm_pci_req_scope(info->scope, &pci_scope);
- if (ret)
- return ret;
-
- return pci_tsm_guest_req(to_pci_dev(dev), pci_scope, info->req,
- info->req_len, info->resp, info->resp_len,
- tsm_code);
+ return pci_tsm_guest_req(to_pci_dev(dev), info->op, info->tvm_arch,
+ info->req, info->req_len, info->resp,
+ info->resp_len, tsm_code);
}
EXPORT_SYMBOL_GPL(tsm_guest_req);
diff --git a/include/linux/pci-tsm.h b/include/linux/pci-tsm.h
index ec2236a7a279..30a60551fcf5 100644
--- a/include/linux/pci-tsm.h
+++ b/include/linux/pci-tsm.h
@@ -9,7 +9,6 @@
struct pci_tsm;
struct tsm_dev;
struct kvm;
-enum pci_tsm_req_scope;
/*
* struct pci_tsm_ops - manage confidential links and security state
@@ -55,7 +54,8 @@ struct pci_tsm_ops {
struct kvm *kvm, u32 tdi_id);
void (*unbind)(struct pci_tdi *tdi);
ssize_t (*guest_req)(struct pci_tdi *tdi,
- enum pci_tsm_req_scope scope,
+ enum iommu_vdevice_tsm_guest_req_op op,
+ enum iommu_vdevice_tsm_guest_tvm_arch tvm_arch,
sockptr_t req_in, size_t in_len,
sockptr_t req_out, size_t out_len,
u64 *tsm_code);
@@ -160,46 +160,6 @@ static inline bool is_pci_tsm_pf0(struct pci_dev *pdev)
return PCI_FUNC(pdev->devfn) == 0;
}
-/**
- * enum pci_tsm_req_scope - Scope of guest requests to be validated by TSM
- *
- * Guest requests are a transport for a TVM to communicate with a TSM + DSM for
- * a given TDI. A TSM driver is responsible for maintaining the kernel security
- * model and limit commands that may affect the host, or are otherwise outside
- * the typical TDISP operational model.
- */
-enum pci_tsm_req_scope {
- /**
- * @PCI_TSM_REQ_INFO: Read-only, without side effects, request for
- * typical TDISP collateral information like Device Interface Reports.
- * No device secrets are permitted, and no device state is changed.
- */
- PCI_TSM_REQ_INFO = IOMMU_VDEVICE_TSM_REQ_PCI_INFO,
- /**
- * @PCI_TSM_REQ_STATE_CHANGE: Request to change the TDISP state from
- * UNLOCKED->LOCKED, LOCKED->RUN, or other architecture specific state
- * changes to support those transitions for a TDI. No other (unrelated
- * to TDISP) device / host state, configuration, or data change is
- * permitted.
- */
- PCI_TSM_REQ_STATE_CHANGE = IOMMU_VDEVICE_TSM_REQ_PCI_STATE_CHANGE,
- /**
- * @PCI_TSM_REQ_DEBUG_READ: Read-only request for debug information
- *
- * A method to facilitate TVM information retrieval outside of typical
- * TDISP operational requirements. No device secrets are permitted.
- */
- PCI_TSM_REQ_DEBUG_READ = IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_READ,
- /**
- * @PCI_TSM_REQ_DEBUG_WRITE: Device state changes for debug purposes
- *
- * The request may affect the operational state of the device outside of
- * the TDISP operational model. If allowed, requires CAP_SYS_RAW_IO, and
- * will taint the kernel.
- */
- PCI_TSM_REQ_DEBUG_WRITE = IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE,
-};
-
#ifdef CONFIG_PCI_TSM
int pci_tsm_register(struct tsm_dev *tsm_dev);
void pci_tsm_unregister(struct tsm_dev *tsm_dev);
@@ -214,7 +174,9 @@ int pci_tsm_bind(struct pci_dev *pdev, struct kvm *kvm, u32 tdi_id);
void pci_tsm_unbind(struct pci_dev *pdev);
void pci_tsm_tdi_constructor(struct pci_dev *pdev, struct pci_tdi *tdi,
struct kvm *kvm, u32 tdi_id);
-ssize_t pci_tsm_guest_req(struct pci_dev *pdev, enum pci_tsm_req_scope scope,
+ssize_t pci_tsm_guest_req(struct pci_dev *pdev,
+ enum iommu_vdevice_tsm_guest_req_op op,
+ enum iommu_vdevice_tsm_guest_tvm_arch tvm_arch,
sockptr_t req_in, size_t in_len, sockptr_t req_out,
size_t out_len, u64 *tsm_code);
#else
@@ -233,7 +195,8 @@ static inline void pci_tsm_unbind(struct pci_dev *pdev)
{
}
static inline ssize_t pci_tsm_guest_req(struct pci_dev *pdev,
- enum pci_tsm_req_scope scope,
+ enum iommu_vdevice_tsm_guest_req_op op,
+ enum iommu_vdevice_tsm_guest_tvm_arch tvm_arch,
sockptr_t req_in, size_t in_len,
sockptr_t req_out, size_t out_len,
u64 *tsm_code)
diff --git a/include/linux/tsm.h b/include/linux/tsm.h
index b83b72bbf5e3..cba0ada5f4cb 100644
--- a/include/linux/tsm.h
+++ b/include/linux/tsm.h
@@ -7,6 +7,7 @@
#include <linux/uuid.h>
#include <linux/device.h>
#include <linux/sockptr.h>
+#include <uapi/linux/iommufd.h>
#define TSM_REPORT_INBLOB_MAX 64
#define TSM_REPORT_OUTBLOB_MAX SZ_16M
@@ -132,14 +133,16 @@ int tsm_unbind(struct device *dev);
/**
* struct tsm_guest_req_info - parameter for tsm_guest_req()
- * @scope: iommufd allocated scope for tsm guest request
+ * @op: operation for the guest-initiated request
+ * @tvm_arch: guest TVM architecture
* @req: request data buffer filled by guest
* @req_len: the size of @req filled by guest
* @resp: response data buffer filled by host
* @resp_len: the size of @resp buffer filled by guest
*/
struct tsm_guest_req_info {
- u32 scope;
+ enum iommu_vdevice_tsm_guest_req_op op;
+ enum iommu_vdevice_tsm_guest_tvm_arch tvm_arch;
sockptr_t req;
size_t req_len;
sockptr_t resp;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 70c2927c18bc..0789a705bb07 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -1375,54 +1375,46 @@ struct iommu_hw_queue_alloc {
};
#define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
-/*
- * TSM request scope values are allocated by iommufd. Each device-bus transport
- * gets a range from this number space.
+/**
+ * enum iommu_vdevice_tsm_guest_tvm_arch - guest TVM architecture
+ * @IOMMU_VDEVICE_TSM_TVM_ARCH_CCA: Arm CCA TVM
+ * @IOMMU_VDEVICE_TSM_TVM_ARCH_SEV: AMD SEV TVM
+ * @IOMMU_VDEVICE_TSM_TVM_ARCH_TDX: Intel TDX TVM
*/
-#define IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_BASE 0
+enum iommu_vdevice_tsm_guest_tvm_arch {
+ IOMMU_VDEVICE_TSM_TVM_ARCH_CCA = 1,
+ IOMMU_VDEVICE_TSM_TVM_ARCH_SEV,
+ IOMMU_VDEVICE_TSM_TVM_ARCH_TDX,
+};
-enum iommu_vdevice_tsm_req_scope {
- /*
- * Read-only, without side effects, request for typical TDISP
- * collateral information like Device Interface Reports. No device
- * secrets are permitted, and no device state is changed.
- */
- IOMMU_VDEVICE_TSM_REQ_PCI_INFO =
- IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_BASE,
- /*
- * Request to change the TDISP state from UNLOCKED->LOCKED,
- * LOCKED->RUN, or other architecture specific state changes to
- * support those transitions for a TDI. No other device or host state,
- * configuration, or data change is permitted.
- */
- IOMMU_VDEVICE_TSM_REQ_PCI_STATE_CHANGE =
- IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_BASE + 1,
- /*
- * Read-only request for debug information outside of typical TDISP
- * operational requirements. No device secrets are permitted.
- */
- IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_READ =
- IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_BASE + 2,
- /*
- * Device state changes for debug purposes. The request may affect the
- * operational state of the device outside of the TDISP operational
- * model. If allowed, this requires CAP_SYS_RAW_IO and taints the
- * kernel.
- */
- IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE =
- IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_BASE + 3,
- IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_LAST =
- IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE,
+/**
+ * enum iommu_vdevice_tsm_guest_req_op - operation for guest TSM requests
+ * @TSM_REQ_READ_OBJECT: Read a TSM object
+ * @TSM_REQ_REGEN_OBJECT: Regenerate a TSM object
+ * @TSM_REQ_OBJECT_INFO: Read TSM object information
+ * @TSM_REQ_VALIDATE_MMIO: Validate MMIO for the TDI
+ * @TSM_REQ_SET_TDI_STATE: Set TDI state
+ * @TSM_REQ_SEV_ENABLE_DMA: Enable SEV DMA
+ * @TSM_REQ_SEV_DISABLE_DMA: Disable SEV DMA
+ */
+enum iommu_vdevice_tsm_guest_req_op {
+ TSM_REQ_READ_OBJECT = 1,
+ TSM_REQ_REGEN_OBJECT,
+ TSM_REQ_OBJECT_INFO,
+ TSM_REQ_VALIDATE_MMIO,
+ TSM_REQ_SET_TDI_STATE,
+ TSM_REQ_SEV_ENABLE_DMA,
+ TSM_REQ_SEV_DISABLE_DMA,
};
/**
* struct iommu_vdevice_tsm_req - ioctl(IOMMU_VDEVICE_TSM_REQ)
* @size: sizeof(struct iommu_vdevice_tsm_req)
* @vdevice_id: vDevice ID the guest request is for
- * @scope: One of enum iommu_vdevice_tsm_req_scope
+ * @op: One of enum iommu_vdevice_tsm_guest_req_op
+ * @tvm_arch: One of enum iommu_vdevice_tsm_guest_tvm_arch
* @req_len: Size in bytes of the input payload at @req_uptr
* @resp_len: Size in bytes of the output buffer at @resp_uptr
- * @__reserved: Must be 0
* @req_uptr: Userspace pointer to the guest-provided request payload
* @resp_uptr: Userspace pointer to the guest response buffer
* @tsm_code: TSM-specific result code returned by the TSM implementation
@@ -1431,9 +1423,9 @@ enum iommu_vdevice_tsm_req_scope {
* guest TSM/TDISP message transport where the host kernel only marshals
* bytes between userspace and the TSM implementation.
*
- * Requests outside the iommufd allocated scope values are rejected. Lower
- * layers may reject scope values that are valid in the global iommufd
- * namespace, but not permitted for a specific bus.
+ * The request operation is guest initiated. Operations that may also be host
+ * initiated are handled through IOMMU_VDEVICE_TSM_OP instead. The TSM backend
+ * validates @tvm_arch against its bound TVM architecture assumptions.
*
* The request payload is read from @req_uptr/@req_len. If a response is
* expected, userspace provides @resp_uptr/@resp_len as writable storage for
@@ -1445,10 +1437,10 @@ enum iommu_vdevice_tsm_req_scope {
struct iommu_vdevice_tsm_req {
__u32 size;
__u32 vdevice_id;
- __u32 scope;
+ __u32 op;
+ __u32 tvm_arch;
__u32 req_len;
__u32 resp_len;
- __u32 __reserved;
__aligned_u64 req_uptr;
__aligned_u64 resp_uptr;
__aligned_u64 tsm_code;
^ permalink raw reply related
* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-06-02 5:36 UTC (permalink / raw)
To: Sohil Mehta
Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <9e6107a9-71b1-4764-96f7-2d8e68060173@intel.com>
On Mon, Jun 01, 2026 at 01:17:59PM -0700, Sohil Mehta wrote:
>
> >>
> >> Let's say a future platform has a lot more features and needs
> >> significantly more memory. Wouldn't loading a legacy kernel with this
> >> default policy lead to excessive wastage?
> >
> > A legacy kernel won't consume Extensions memory. The Extensions memory
> > is only required by TDX module when add-ons features are explicitly
> > configured via TDH.SYS.CONFIG [1].
>
> So, the TDX module will only report memory_pool_required_pages for
> add-on features that have been configured by the kernel? This would be
Correct.
> good to clarify in the cover letter.
Will do.
>
> > For legacy kernel, no add-on features configured so no memory
> > consumption.
> >
>
> I was referring to the first kernel that has support for one TDX
> extension. I am mainly trying to ensure that a kernel with support for
> one TDX extension only consumes memory for that feature (even when it is
> loaded on a hardware platform that supports multiple TDX extensions).
Yes. The first kernel that supports for one add-on feature will only
consume memory for that feature. The other HW/FW supported features
will not be configured so will not consume extra memory.
I think I should refactor the cover-letter and changelogs based on all
these comments. Thanks for all the inputs that help me see what missed.
>
> > But yes, if the features grow rapidly out of expectation, may need new
> > options to switch something off. I think if we discuss later when the
> > need actually arises.
> >
>
^ permalink raw reply
* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Aneesh Kumar K.V @ 2026-06-02 6:05 UTC (permalink / raw)
To: Michael Kelley, iommu@lists.linux.dev,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <SN6PR02MB415754E94A9505C2B9739E4DD4092@SN6PR02MB4157.namprd02.prod.outlook.com>
Michael Kelley <mhklinux@outlook.com> writes:
> From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>Sent: Thursday, May 21, 2026 9:28 PM
>>
>> Teach the atomic DMA pool code to distinguish between encrypted and
>> unencrypted pools, and make pool allocation select the matching pool based
>> on DMA attributes.
>>
>> Introduce a dma_gen_pool wrapper that records whether a pool is
>> unencrypted, initialize that state when the atomic pools are created, and
>> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
>> to take attrs and skip pools whose encrypted state does not match
>> DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
>>
>> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path so
>> decrypted swiotlb allocations are taken from the correct atomic pool.
>>
>> Tested-by: Jiri Pirko <jiri@nvidia.com>
>> Reviewed-by: Mostafa Saleh <smostafa@google.com>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>> drivers/iommu/dma-iommu.c | 2 +-
>> include/linux/dma-map-ops.h | 2 +-
>> kernel/dma/direct.c | 11 ++-
>> kernel/dma/pool.c | 167 +++++++++++++++++++++++-------------
>> kernel/dma/swiotlb.c | 7 +-
>> 5 files changed, 123 insertions(+), 66 deletions(-)
>>
>
> [snip]
>
>> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
>> + size_t pool_size, gfp_t gfp)
>> {
>> - struct gen_pool *pool;
>> int ret;
>>
>> - pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
>> - if (!pool)
>> + dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
>> + if (!dma_pool->pool)
>> return NULL;
>>
>> - gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
>> + gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
>> +
>> + /* if platform is using memory encryption atomic pools are by default decrypted. */
>> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> + dma_pool->unencrypted = true;
>> + else
>> + dma_pool->unencrypted = false;
>
> I'm curious about the name of the "unencrypted" field in struct dma_gen_pool,
> and similarly in Patch 7 of the series for the swiotlb struct io_tlb_pool and
> struct io_tlb_mem. Up through v3 of this series, you used "decrypted", but
> starting in v4 switched to "unencrypted".
>
> To me, the above "if" statement has some cognitive dissonance in that if
> CC_ATTR_MEM_ENCRYPT is false (i.e., a normal VM), "unencrypted" is set
> to false. But I think of memory in a normal VM as "unencrypted" since it
> was never encrypted. A similar "if" statement occurs in your swiotlb changes.
>
> Two related concepts are captured by the field:
> 1) Is some action needed to put the memory into the unencrypted state,
> and to remove it from that state? This applies when assigning memory to the
> pool, or freeing the memory in the pool.
> 2) Is the memory currently in the unencrypted state? This applies when
> allocating memory from the pool to a caller.
>
> It's hard to capture all that in a short field name. But I think I prefer "decrypted"
> over "unencrypted". The former implies that some action was taken. It's a
> little easier to think of a normal VM as *not* having decrypted memory. The
> memory was never encrypted in the first place, so no decryption action was taken.
>
> Throughout the kernel, "decrypted" occurs much more frequently than
> "unencrypted". We have set_memory_encrypted() and set_memory_decrypted()
> that are "take action" names. But we also have force_dma_unencrypted(),
> phys_to_dma_unencrypted(), and dma_addr_unencrypted(). So it's a bit
> of a mess.
>
>
> But maybe there's more background here that led to the change
> between your v3 and v4.
>
> Michael
The current APIs, phys_to_dma_unencrypted() and dma_addr_unencrypted(),
are the reason I changed the pool attribute name from decrypted to
unencrypted. The rationale was that nobody actually decrypted the
memory; the memory was already in an unencrypted state.
In other words, the DMA pool did not contain encrypted content that was
later decrypted. Rather, the DMA pool itself was in an unencrypted
state.
IMHO, set_memory_decrypted()/set_memory_encrypted() is the right naming
because those APIs describe an operation that transitions memory between
states. In contrast, the pool attribute describes the state of the
memory itself, which is why I used unencrypted rather than decrypted.
-aneesh
^ permalink raw reply
* RE: [PATCH v5 10/20] dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-06-02 6:10 UTC (permalink / raw)
To: Michael Kelley, iommu@lists.linux.dev,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <SN6PR02MB41574064D14D4A2734222C51D40B2@SN6PR02MB4157.namprd02.prod.outlook.com>
Michael Kelley <mhklinux@outlook.com> writes:
> From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Sent: Thursday, May 21, 2026 9:28 PM
>>
>> Teach dma_direct_map_phys() to select the DMA address encoding based on
>> DMA_ATTR_CC_SHARED.
>>
>> Use phys_to_dma_unencrypted() for decrypted mappings and
>> phys_to_dma_encrypted() otherwise. If a device requires unencrypted DMA
>> but the source physical address is still encrypted, force the mapping
>> through swiotlb so the DMA address and backing memory attributes remain
>> consistent.
>>
>> Update the arm64, x86, s390 and powerpc secure-guest setup to not use
>> swiotlb force option
>>
>> Tested-by: Jiri Pirko <jiri@nvidia.com>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
...
> With this patch removing SWIOTLB_FORCE from four places in
> kernel code, there are no remaining places where it is set.
> The test of SWIOTLB_FORCE could be removed from
> swiotlb_init_remap(), and its definition could be deleted
> from include/linux/swiotlb.h.
>
Sure, I’ll add that as a separate patch in the series.
-aneesh
^ permalink raw reply
* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Alexey Kardashevskiy @ 2026-06-02 8:40 UTC (permalink / raw)
To: Dan Williams (nvidia), Aneesh Kumar K.V (Arm), linux-coco, iommu,
linux-kernel, kvm
Cc: Bjorn Helgaas, Jason Gunthorpe, Joerg Roedel, Jonathan Cameron,
Kevin Tian, Nicolin Chen, Samuel Ortiz, Steven Price,
Suzuki K Poulose, Will Deacon, Xu Yilun, Shameer Kolothum,
Paolo Bonzini, Tony Krowiak, Halil Pasic, Jason Herne,
Harald Freudenberger, Holger Dengler, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
Eric Farman, linux-s390
In-Reply-To: <6a168c8ea7d10_2129b2100e@djbw-dev.notmuch>
On 27/5/26 16:17, Dan Williams (nvidia) wrote:
> [You don't often get email from djbw@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Alexey Kardashevskiy wrote:
>>
>>
>> On 26/5/26 01:48, Aneesh Kumar K.V (Arm) wrote:
>>> Add IOMMU_VDEVICE_TSM_REQUEST for issuing TSM guest request/response
>>> transactions against an iommufd vdevice.
>>>
>>> The ioctl takes a vdevice_id plus request/response user buffers and length
>>> fields, and forwards the request through tsm_guest_req() to the PCI TSM
>>> backend. This provides the host-side passthrough path used by CoCo guests
>>> for TSM device attestation and acceptance flows after the device has been
>>> bound to TSM.
>>>
>>> Also add the supporting tsm_guest_req() helper and associated TSM core
>>> interface definitions.
>>>
>>> Based on changes from: Alexey Kardashevskiy <aik@amd.com>
>>>
>>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>>> ---
>>> drivers/iommu/iommufd/iommufd_private.h | 6 ++
>>> drivers/iommu/iommufd/main.c | 3 +
>>> drivers/iommu/iommufd/tsm.c | 68 +++++++++++++++++++++
>>> drivers/virt/coco/tsm-core.c | 39 ++++++++++++
>>> include/linux/pci-tsm.h | 9 +--
>>> include/linux/tsm.h | 25 ++++++++
>>> include/uapi/linux/iommufd.h | 80 +++++++++++++++++++++++++
>>> 7 files changed, 226 insertions(+), 4 deletions(-)
> [..]
>>> diff --git a/drivers/iommu/iommufd/tsm.c b/drivers/iommu/iommufd/tsm.c
>>> index 09ee668dbed9..342fbdb6a6b9 100644
>>> --- a/drivers/iommu/iommufd/tsm.c
>>> +++ b/drivers/iommu/iommufd/tsm.c
>>> @@ -60,3 +60,71 @@ int iommufd_vdevice_tsm_op_ioctl(struct iommufd_ucmd *ucmd)
>>> iommufd_put_object(ucmd->ictx, &vdev->obj);
>>> return rc;
>>> }
>>> +
>>> +static bool iommufd_vdevice_tsm_req_scope_valid(u32 scope)
>>> +{
>>> + if (scope > IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_LAST)
>>> + return false;
>>> +
>>> + switch (scope) {
>>> + case IOMMU_VDEVICE_TSM_REQ_PCI_INFO:
>>> + case IOMMU_VDEVICE_TSM_REQ_PCI_STATE_CHANGE:
>>> + case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_READ:
>>> + case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE:
>>
>> This scope thing still needs clarification.
>>
>> I have 3 types of requests to fit here, all go via VM -> KVM -> QEMU -> IOMMUFD -> TSM.
>>
>> 1) bind/unbind TDI <- moves to CONFIG_LOCKED, this is "OP";
>> 2) start/stop TDI <- moves to RUN, this is "GR"? Right now I route it via "OP";
>> 3) enable/disable MMIO/DMA <- no TDI state change, this is "GR" but which scope is it here?
>
> The scope parameter was meant to enumerate a security model for classes
> of commands that are otherwise opaque to the kernel. However, none of
> the commands we are targeting are opaque (private specification with
> unknown effect). It now turns out there is no role for @scope for
> security.
>
> Now a command family that iommufd can validate seems useful. As it
> stands this implementation aliases command codes across TSMs. Do we
> proceed with creating an actual shared command uapi for the truly shared
> commands:
>
> TSM_REQ_TYPE_DEFAULT: Commands every arch needs
> TSM_REQ_READ_OBJECT
> TSM_REQ_REGEN_OBJECT
> TSM_REQ_OBJECT_INFO
These 3 are already in that netlink interface of the TSM (so common for all arches), right?
> TSM_REQ_VALIDATE_MMIO
SEV handles this in the KVM as this is where RMP and NPT are managed + opaque guest request to the TSM, I'd think it is the same for others.
> TSM_REQ_SET_TDI_STATE
This is a common one.
> TSM_REQ_TYPE_SEV: Commands only SEV needs
> TSM_REQ_SEV_ENABLE_DMA
> TSM_REQ_SEV_DISABLE_DMA
No change to host owned part of the IOMMU when TDX or CCA moves the device to secure? Or it is packed into those opaque requests to the TSM?
> ...or just observe that per CC arch commands are needed to setup the VM
> so per CC arch commands are needed to marshal device assignment support
> requests.
>
> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
>
> TSM_REQ_TYPE_CCA
> TSM_REQ_TYPE_SEV
> TSM_REQ_TYPE_TDX
>
> I am leaning towards the latter at this point.
Dunno, besides the DMA thing, these CCA/SEV/TDX types will only appear in WARN_ON of the arch TSM drivers and will not really be seen. If a wrong TSM driver is loaded (say, TDX on AMD), then something just went terribly wrong. Thanks,
--
Alexey
^ permalink raw reply
* Re: [RFC PATCH v4 01/14] coco: host: arm64: Add host TSM callback and IDE stream allocation support
From: Aneesh Kumar K.V @ 2026-06-02 8:42 UTC (permalink / raw)
To: Dan Williams (nvidia), linux-coco, kvmarm, linux-arm-kernel,
linux-kernel
Cc: Alexey Kardashevskiy, Catalin Marinas, Dan Williams,
Jason Gunthorpe, Jonathan Cameron, Marc Zyngier, Samuel Ortiz,
Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun
In-Reply-To: <6a17d6f1d6371_2b1fb710057@djbw-dev.notmuch>
"Dan Williams (nvidia)" <djbw@kernel.org> writes:
> Aneesh Kumar K.V (Arm) wrote:
>> Register the TSM callback when the DA feature is supported by KVM.
>>
>> This driver handles IDE stream setup for both the root port and PCIe
>> endpoints. Root port IDE stream enablement itself is managed by RMM.
>>
>> In addition, the driver registers pci_tsm_ops with the TSM subsystem.
>
> Do you want to call out that this is an infrastructure / scaffolding
> patch that only handles the PCI-TSM skeleton. The CCA meat comes later,
> in particular IDE key management. Tell a bit more of the story
>
> Otherwise, mostly looks good.
>
Sure, I’ll update the commit message.
>
> Minor comments below...
>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>> arch/arm64/include/asm/rmi_smc.h | 2 +
>> drivers/firmware/smccc/rmm.c | 12 ++
>> drivers/firmware/smccc/rmm.h | 8 +
>> drivers/firmware/smccc/smccc.c | 1 +
>> drivers/virt/coco/Kconfig | 2 +
>> drivers/virt/coco/Makefile | 1 +
>> drivers/virt/coco/arm-cca-host/Kconfig | 19 ++
>> drivers/virt/coco/arm-cca-host/Makefile | 5 +
>> drivers/virt/coco/arm-cca-host/arm-cca.c | 225 +++++++++++++++++++++++
>> drivers/virt/coco/arm-cca-host/rmi-da.h | 46 +++++
>> 10 files changed, 321 insertions(+)
>> create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig
>> create mode 100644 drivers/virt/coco/arm-cca-host/Makefile
>> create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c
>> create mode 100644 drivers/virt/coco/arm-cca-host/rmi-da.h
>>
>> diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
>> index fa23818e1b4c..109d6cc6ef37 100644
>> --- a/arch/arm64/include/asm/rmi_smc.h
>> +++ b/arch/arm64/include/asm/rmi_smc.h
> [..]
>> diff --git a/drivers/firmware/smccc/rmm.c b/drivers/firmware/smccc/rmm.c
>> index 2a6187df3285..7444cc3a588c 100644
>> --- a/drivers/firmware/smccc/rmm.c
>> +++ b/drivers/firmware/smccc/rmm.c
> [..]
>> diff --git a/drivers/firmware/smccc/rmm.h b/drivers/firmware/smccc/rmm.h
>> index a47a650d4f51..37d0d95a099e 100644
>> --- a/drivers/firmware/smccc/rmm.h
>> +++ b/drivers/firmware/smccc/rmm.h
> [..]
>> diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
>> index fc9b44b7c687..2bf2d59e686d 100644
>> --- a/drivers/firmware/smccc/smccc.c
>> +++ b/drivers/firmware/smccc/smccc.c
>> @@ -97,6 +97,7 @@ static int __init smccc_devices_init(void)
>> * the required SMCCC function IDs at a supported revision.
>> */
>> register_rsi_device(pdev);
>> + register_rmi_device(pdev);
>> }
>
> Would splitting the above three hunks make this series stand on its own
> relative to the base CCA series? I assume likely not as soon as we get
> to patch2.
>
> Otherwise, just curious what your intended merge strategy is for this,
> tsm.git or arm.git, and what help this needs?
>
> [..]
> snip code that looks good.
>
Yes, I’ll split this into a separate patch.
>
>> diff --git a/drivers/virt/coco/arm-cca-host/Makefile b/drivers/virt/coco/arm-cca-host/Makefile
>> new file mode 100644
>> index 000000000000..c236827f002c
>> --- /dev/null
>> +++ b/drivers/virt/coco/arm-cca-host/Makefile
>> @@ -0,0 +1,5 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +obj-$(CONFIG_ARM_CCA_HOST) += arm-cca-host.o
>> +
>> +arm-cca-host-y += arm-cca.o
>> diff --git a/drivers/virt/coco/arm-cca-host/arm-cca.c b/drivers/virt/coco/arm-cca-host/arm-cca.c
>> new file mode 100644
>> index 000000000000..67f7e80106e8
>> --- /dev/null
>> +++ b/drivers/virt/coco/arm-cca-host/arm-cca.c
>> @@ -0,0 +1,225 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2026 ARM Ltd.
>> + */
>> +
>> +#include <linux/auxiliary_bus.h>
>> +#include <linux/pci-tsm.h>
>> +#include <linux/pci-ide.h>
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +#include <linux/tsm.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/cleanup.h>
>> +
>> +#include "rmi-da.h"
>> +
>> +/* Total number of stream id supported at root port level */
>> +#define MAX_STREAM_ID 256
>> +
>> +static struct pci_tsm *cca_tsm_pci_probe(struct tsm_dev *tsm_dev, struct pci_dev *pdev)
>> +{
>> + int ret;
>> +
>> + if (!is_pci_tsm_pf0(pdev)) {
>> + struct cca_host_fn_dsc *fn_dsc __free(kfree) =
>> + kzalloc(sizeof(*fn_dsc), GFP_KERNEL);
>
> kzalloc_obj(*fn_dsc)
>
>> +
>> + if (!fn_dsc)
>> + return NULL;
>> +
>> + ret = pci_tsm_link_constructor(pdev, &fn_dsc->pci, tsm_dev);
>> + if (ret)
>> + return NULL;
>> +
>> + return &no_free_ptr(fn_dsc)->pci;
>> + }
>> +
>> + if (!pdev->ide_cap)
>> + return NULL;
>
> Bailing early?
>
> Maybe the RMM knows something about this device not needing IDE? I have
> a similar question in patch2 around trusted sources for whether a device
> is internal or not.
>
Yes. This get updated later in
https://lore.kernel.org/all/20260427065121.916615-14-aneesh.kumar@kernel.org
>
>> +
>> + struct cca_host_pf0_ep_dsc *pf0_ep_dsc __free(kfree) =
>> + kzalloc(sizeof(*pf0_ep_dsc), GFP_KERNEL);
>> + if (!pf0_ep_dsc)
>> + return NULL;
>> +
>> + ret = pci_tsm_pf0_constructor(pdev, &pf0_ep_dsc->pci, tsm_dev);
>> + if (ret)
>> + return NULL;
>> +
>> + pci_dbg(pdev, "tsm enabled\n");
>> + return &no_free_ptr(pf0_ep_dsc)->pci.base_tsm;
>> +}
>> +
>> +static void cca_tsm_pci_remove(struct pci_tsm *tsm)
>> +{
>> + struct pci_dev *pdev = tsm->pdev;
>> +
>> + if (is_pci_tsm_pf0(pdev)) {
>> + struct cca_host_pf0_ep_dsc *pf0_ep_dsc = to_cca_pf0_ep_dsc(pdev);
>> +
>> + pci_tsm_pf0_destructor(&pf0_ep_dsc->pci);
>> + kfree(pf0_ep_dsc);
>> + } else {
>> + kfree(to_cca_fn_dsc(pdev));
>> + }
>> +}
>> +
>> +/* For now global for simplicity. Protected by pci_tsm_rwsem */
>> +static DECLARE_BITMAP(cca_stream_ids, MAX_STREAM_ID);
>> +static int alloc_stream_id(struct pci_host_bridge *hb)
>> +{
>> + int stream_id;
>> +
>> +redo_alloc:
>> + stream_id = find_first_zero_bit(cca_stream_ids, MAX_STREAM_ID);
>> + if (stream_id == MAX_STREAM_ID)
>> + return stream_id;
>> +
>> + if (ida_exists(&hb->ide_stream_ids_ida, stream_id)) {
>> + /* mark the stream allocated in the global bitmap. */
>> + set_bit(stream_id, cca_stream_ids);
>> + goto redo_alloc;
>> + }
>> + return stream_id;
>
> Is 256 total an RMM limit, and/or does it require globally unique
> stream-ids? If not you could do what SEV-TIO does and just set stream-id
> == stream-index.
>
Yes, I’ll switch to that.
>
>> +}
>> +
>> +static inline bool cca_pdev_need_sel_ide_streams(struct pci_dev *pdev)
>> +{
>> + return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
>> +}
>> +
>> +static int cca_tsm_connect(struct pci_dev *pdev)
>> +{
>> + struct pci_dev *rp = pcie_find_root_port(pdev);
>> + struct cca_host_pf0_ep_dsc *pf0_ep_dsc;
>> + struct pci_ide *ide;
>> + int ret, stream_id = 0;
>> +
>> + /* Only function 0 supports connect in host */
>> + if (WARN_ON(!is_pci_tsm_pf0(pdev)))
>> + return -EIO;
>> +
>> + pf0_ep_dsc = to_cca_pf0_ep_dsc(pdev);
>> + if (cca_pdev_need_sel_ide_streams(pdev)) {
>> + /* Allocate stream id */
>> + stream_id = alloc_stream_id(pci_find_host_bridge(pdev->bus));
>> + if (stream_id == MAX_STREAM_ID)
>> + return -EBUSY;
>> + set_bit(stream_id, cca_stream_ids);
>> +
>> + ide = pci_ide_stream_alloc(pdev);
>> + if (!ide) {
>> + ret = -ENOMEM;
>> + goto err_stream_alloc;
>> + }
>> +
>> + pf0_ep_dsc->sel_stream = ide;
>> + ide->stream_id = stream_id;
>> + ret = pci_ide_stream_register(ide);
>> + if (ret)
>> + goto err_stream;
>> + /*
>> + * Configure IDE capability for target device
>> + *
>> + * Some test devices work only with DEFAULT_STREAM enabled.
>> + * For simplicity, enable DEFAULT_STREAM for all devices. A
>> + * future decent solution may be to have a quirk table to
>> + * specify which devices need DEFAULT_STREAM.
>> + */
>> + ide->partner[PCI_IDE_EP].default_stream = 1;
>> + pci_ide_stream_setup(pdev, ide);
>> + pci_ide_stream_setup(rp, ide);
>> +
>> + ret = tsm_ide_stream_register(ide);
>> + if (ret)
>> + goto err_tsm;
>> +
>> + /*
>> + * Once ide is setup, enable the stream at the endpoint
>> + * Root port will be done by RMM
>> + */
>> + pci_ide_stream_enable(pdev, ide);
>
> The end point of these patches follows the spec recommendation of
> delaying enable until after key programming.
>
>> + }
>> + return 0;
>
> Should this be making security claims to userspace without taking any
> action for non-endpoint devices that happen to be passed in?
>
> Thinking about a bisection case this should either fail here, print a
> message that is removed in the final enabling patch, or do the
> __maybe_unused arrangement to land all the CCA bits first and then do
> this hookup. Up to you.
Will do the latter. ie, I’ll call tsm_register() only in the final
patch.
-aneesh
^ permalink raw reply
* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02 8:55 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-7-2f0fae496530@google.com>
On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
nit: s/non-CoCo/CoCo ?
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
nit: Missing Co-Developed-by: ?
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> virt/kvm/guest_memfd.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 78e5435967341..adf57a3a1f5dd 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> int *max_order)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct inode *inode;
> struct folio *folio;
> int r = 0;
>
> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> if (!file)
> return -EFAULT;
>
> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> + inode = file_inode(file);
> + filemap_invalidate_lock_shared(inode->i_mapping);
>
> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> if (IS_ERR(folio)) {
> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_mark_uptodate(folio);
> }
>
> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> + if (kvm_gmem_is_private_mem(inode, index))
Don't we need to make sure the entire folio is private ? Not just the
page at the index ?
if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
Suzuki
> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
> folio_unlock(folio);
>
> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_put(folio);
>
> out:
> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> + filemap_invalidate_unlock_shared(inode->i_mapping);
> return r;
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
^ permalink raw reply
* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02 9:10 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>
On 02/06/2026 09:55, Suzuki K Poulose wrote:
> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
>
> nit: s/non-CoCo/CoCo ?
>
>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn()
>> on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>
> nit: Missing Co-Developed-by: ?
>
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> virt/kvm/guest_memfd.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 78e5435967341..adf57a3a1f5dd 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> int *max_order)
>> {
>> pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> + struct inode *inode;
>> struct folio *folio;
>> int r = 0;
>> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> if (!file)
>> return -EFAULT;
>> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> + inode = file_inode(file);
>> + filemap_invalidate_lock_shared(inode->i_mapping);
>> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>> if (IS_ERR(folio)) {
>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> folio_mark_uptodate(folio);
>> }
>> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> + if (kvm_gmem_is_private_mem(inode, index))
>
> Don't we need to make sure the entire folio is private ? Not just the
> page at the index ?
> if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
Or rather, we should go through the individual pages and apply the
prepare for ones that are private ?
Suzuki
>
> Suzuki
>
>> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> folio_unlock(folio);
>> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> folio_put(folio);
>> out:
>> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> + filemap_invalidate_unlock_shared(inode->i_mapping);
>> return r;
>> }
>> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
>
^ permalink raw reply
* Re: [PATCH v14 13/44] arm64: RMI: Define the user ABI
From: Suzuki K Poulose @ 2026-06-02 11:15 UTC (permalink / raw)
To: Marc Zyngier, Steven Price
Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86jysovpxf.wl-maz@kernel.org>
Hi Marc
On 27/05/2026 16:21, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:21 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> There is one CAP which identified the presence of CCA, and one ioctl.
>> The ioctl is used to populate memory during creation of the realm as
>> this requires the RMM to copy data from an unprotected address to the
>> protected memory - CCA does not support memory conversion where the
>> memory contents is preserved as this is incompatible with memory
>> encryption.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * KVM_ARM_VCPU_RMI_PSCI_COMPLETE removed.
>> * KVM_ARM_RMI_POPULATE documentation updated to reflect that the
>> structure is written by the kernel.
>> * CAP number bumped.
>> Changes since v12:
>> * Change KVM_ARM_RMI_POPULATE to update the structure with the amount
>> that has been progressed rather than return the number of bytes
>> populated.
>> * Describe the flag KVM_ARM_RMI_POPULATE_FLAGS_MEASURE.
>> * CAP number is bumped.
>> * NOTE: The PSCI ioctl may be removed in a future spec release.
>> Changes since v11:
>> * Completely reworked to be more implicit. Rather than having explicit
>> CAP operations to progress the realm construction these operations
>> are done when needed (on populating and on first vCPU run).
>> * Populate and PSCI complete are promoted to proper ioctls.
>> Changes since v10:
>> * Rename symbols from RME to RMI.
>> Changes since v9:
>> * Improvements to documentation.
>> * Bump the magic number for KVM_CAP_ARM_RME to avoid conflicts.
>> Changes since v8:
>> * Minor improvements to documentation following review.
>> * Bump the magic numbers to avoid conflicts.
>> Changes since v7:
>> * Add documentation of new ioctls
>> * Bump the magic numbers to avoid conflicts
>> Changes since v6:
>> * Rename some of the symbols to make their usage clearer and avoid
>> repetition.
>> Changes from v5:
>> * Actually expose the new VCPU capability (KVM_ARM_VCPU_REC) by bumping
>> KVM_VCPU_MAX_FEATURES - note this also exposes KVM_ARM_VCPU_HAS_EL2!
>> ---
>> Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++++++++++++++
>> include/uapi/linux/kvm.h | 13 +++++++++++
>> 2 files changed, 53 insertions(+)
>
> $SUBJECT looks wrong. This is a KVM change, not an RMI change.
>
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 52bbbb553ce1..ca68aae7faa2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6553,6 +6553,37 @@ KVM_S390_KEYOP_SSKE
>> Sets the storage key for the guest address ``guest_addr`` to the key
>> specified in ``key``, returning the previous value in ``key``.
>>
>> +4.145 KVM_ARM_RMI_POPULATE
>> +--------------------------
>> +
>> +:Capability: KVM_CAP_ARM_RMI
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_rmi_populate (in/out)
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> + struct kvm_arm_rmi_populate {
>> + __u64 base;
>> + __u64 size;
>> + __u64 source_uaddr;
>> + __u32 flags;
>> + __u32 reserved;
>> + };
>> +
>> +Populate a region of protected address space by copying the data from the
>> +(non-protected) user space pointer provided into a protected region (backed by
>> +guestmem_fd). It implicitly sets the destination region to RIPAS RAM. This is
>> +only valid before any VCPUs have been run. The ioctl might not populate the
>> +entire region and in this case the kernel updates the fields `base`, `size` and
>> +`source_uaddr`. User space may have to repeatedly call it until `size` is 0 to
>> +populate the entire region.
>> +
>> +`flags` can be set to `KVM_ARM_RMI_POPULATE_FLAGS_MEASURE` to request that the
>> +populated data is hashed and added to the guest's Realm Initial Measurement
>> +(RIM).
>
> Where is that measurement stored? And retrieved? At least a pointer to
> that would help.
The measurement is stored by the RMM and is made available to the Guests
via RSI interface (RSI_ATTEST_TOKEN_{INIT,CONTINUE}) as part of the
attestation report along with the Platform attestation. On Linux Guest,
this could be fetched using TSM report infrastructure. This could be
added to the doc.
Suzuki
>
>> +
>> .. _kvm_run:
>>
>> 5. The kvm_run structure
>> @@ -8904,6 +8935,15 @@ helpful if user space wants to emulate instructions which are not
>> This capability can be enabled dynamically even if VCPUs were already
>> created and are running.
>>
>> +7.47 KVM_CAP_ARM_RMI
>> +--------------------
>> +
>> +:Architectures: arm64
>> +:Target: VM
>> +:Parameters: None
>> +
>> +This capability indicates that support for CCA realms is available.
>> +
>> 8. Other capabilities.
>> ======================
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6c8afa2047bf..b8cff0938041 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
>> #define KVM_CAP_S390_USER_OPEREXEC 246
>> #define KVM_CAP_S390_KEYOP 247
>> #define KVM_CAP_S390_VSIE_ESAMODE 248
>> +#define KVM_CAP_ARM_RMI 249
>>
>> struct kvm_irq_routing_irqchip {
>> __u32 irqchip;
>> @@ -1669,4 +1670,16 @@ struct kvm_pre_fault_memory {
>> __u64 padding[5];
>> };
>>
>> +/* Available with KVM_CAP_ARM_RMI, only for VMs with KVM_VM_TYPE_ARM_REALM */
>> +#define KVM_ARM_RMI_POPULATE _IOWR(KVMIO, 0xd7, struct kvm_arm_rmi_populate)
>> +#define KVM_ARM_RMI_POPULATE_FLAGS_MEASURE (1 << 0)
>> +
>> +struct kvm_arm_rmi_populate {
>> + __u64 base;
>> + __u64 size;
>> + __u64 source_uaddr;
>> + __u32 flags;
>> + __u32 reserved;
>> +};
>> +
>> #endif /* __LINUX_KVM_H */
>
> Thanks,
>
> M.
>
^ permalink raw reply
* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Michael Kelley @ 2026-06-02 14:24 UTC (permalink / raw)
To: Aneesh Kumar K.V, iommu@lists.linux.dev,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <yq5afr35sciu.fsf@kernel.org>
From: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Sent: Monday, June 1, 2026 11:05 PM
>
> Michael Kelley <mhklinux@outlook.com> writes:
>
> > From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>Sent: Thursday, May 21, 2026 9:28 PM
> >>
> >> Teach the atomic DMA pool code to distinguish between encrypted and
> >> unencrypted pools, and make pool allocation select the matching pool based
> >> on DMA attributes.
> >>
> >> Introduce a dma_gen_pool wrapper that records whether a pool is
> >> unencrypted, initialize that state when the atomic pools are created, and
> >> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
> >> to take attrs and skip pools whose encrypted state does not match
> >> DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
> >>
> >> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path so
> >> decrypted swiotlb allocations are taken from the correct atomic pool.
> >>
> >> Tested-by: Jiri Pirko <jiri@nvidia.com>
> >> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> >> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> >> ---
> >> drivers/iommu/dma-iommu.c | 2 +-
> >> include/linux/dma-map-ops.h | 2 +-
> >> kernel/dma/direct.c | 11 ++-
> >> kernel/dma/pool.c | 167 +++++++++++++++++++++++-------------
> >> kernel/dma/swiotlb.c | 7 +-
> >> 5 files changed, 123 insertions(+), 66 deletions(-)
> >>
> >
> > [snip]
> >
> >> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
> >> + size_t pool_size, gfp_t gfp)
> >> {
> >> - struct gen_pool *pool;
> >> int ret;
> >>
> >> - pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> >> - if (!pool)
> >> + dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> >> + if (!dma_pool->pool)
> >> return NULL;
> >>
> >> - gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
> >> + gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
> >> +
> >> + /* if platform is using memory encryption atomic pools are by default decrypted. */
> >> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> >> + dma_pool->unencrypted = true;
> >> + else
> >> + dma_pool->unencrypted = false;
> >
> > I'm curious about the name of the "unencrypted" field in struct dma_gen_pool,
> > and similarly in Patch 7 of the series for the swiotlb struct io_tlb_pool and
> > struct io_tlb_mem. Up through v3 of this series, you used "decrypted", but
> > starting in v4 switched to "unencrypted".
> >
> > To me, the above "if" statement has some cognitive dissonance in that if
> > CC_ATTR_MEM_ENCRYPT is false (i.e., a normal VM), "unencrypted" is set
> > to false. But I think of memory in a normal VM as "unencrypted" since it
> > was never encrypted. A similar "if" statement occurs in your swiotlb changes.
> >
> > Two related concepts are captured by the field:
> > 1) Is some action needed to put the memory into the unencrypted state,
> > and to remove it from that state? This applies when assigning memory to the
> > pool, or freeing the memory in the pool.
> > 2) Is the memory currently in the unencrypted state? This applies when
> > allocating memory from the pool to a caller.
> >
> > It's hard to capture all that in a short field name. But I think I prefer "decrypted"
> > over "unencrypted". The former implies that some action was taken. It's a
> > little easier to think of a normal VM as *not* having decrypted memory. The
> > memory was never encrypted in the first place, so no decryption action was taken.
> >
> > Throughout the kernel, "decrypted" occurs much more frequently than
> > "unencrypted". We have set_memory_encrypted() and set_memory_decrypted()
> > that are "take action" names. But we also have force_dma_unencrypted(),
> > phys_to_dma_unencrypted(), and dma_addr_unencrypted(). So it's a bit
> > of a mess.
> >
> >
> > But maybe there's more background here that led to the change
> > between your v3 and v4.
> >
> > Michael
>
> The current APIs, phys_to_dma_unencrypted() and dma_addr_unencrypted(),
> are the reason I changed the pool attribute name from decrypted to
> unencrypted. The rationale was that nobody actually decrypted the
> memory; the memory was already in an unencrypted state.
>
> In other words, the DMA pool did not contain encrypted content that was
> later decrypted. Rather, the DMA pool itself was in an unencrypted
> state.
>
> IMHO, set_memory_decrypted()/set_memory_encrypted() is the right naming
> because those APIs describe an operation that transitions memory between
> states. In contrast, the pool attribute describes the state of the
> memory itself, which is why I used unencrypted rather than decrypted.
>
Except that in a normal VM, the "unencrypted" pool attribute does *not*
describe the state of the memory itself. In a normal VM, the memory is
unencrypted, but the "unencrypted" pool attribute is false. That
contradiction is the essence of my concern.
Michael
^ permalink raw reply
* Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.
From: Suzuki K Poulose @ 2026-06-02 14:49 UTC (permalink / raw)
To: Marc Zyngier, Steven Price
Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86ik88ui0g.wl-maz@kernel.org>
Hi Marc
On 28/05/2026 08:10, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:22 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> Introduce the skeleton functions for creating and destroying a realm.
>> The IPA size requested is checked against what the RMM supports.
>>
>> The actual work of constructing the realm will be added in future
>> patches.
>
> Again, $SUBJECT doesn't reflect that this is purely a KVM patch.
>
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>> * Rebased and updated to RMM-v2.0-bet1.
>> * Auxiliary granules have been removed in RMM-v2.0-bet1
>> Changes since v12:
>> * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now configured to
>> be the same as the host's page size.
>> * Rework delegate/undelegate functions to use the new RMI range based
>> operations.
>> Changes since v11:
>> * Major rework to drop the realm configuration and make the
>> construction of realms implicit rather than driven by the VMM
>> directly.
>> * The code to create RDs, handle VMIDs etc is moved to later patches.
>> Changes since v10:
>> * Rename from RME to RMI.
>> * Move the stage2 cleanup to a later patch.
>> Changes since v9:
>> * Avoid walking the stage 2 page tables when destroying the realm -
>> the real ones are not accessible to the non-secure world, and the RMM
>> may leave junk in the physical pages when returning them.
>> * Fix an error path in realm_create_rd() to actually return an error value.
>> Changes since v8:
>> * Fix free_delegated_granule() to not call kvm_account_pgtable_pages();
>> a separate wrapper will be introduced in a later patch to deal with
>> RTTs.
>> * Minor code cleanups following review.
>> Changes since v7:
>> * Minor code cleanup following Gavin's review.
>> Changes since v6:
>> * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>> host page size to be larger than 4k while still communicating with an
>> RMM which uses 4k granules.
>> Changes since v5:
>> * Introduce free_delegated_granule() to replace many
>> undelegate/free_page() instances and centralise the comment on
>> leaking when the undelegate fails.
>> * Several other minor improvements suggested by reviews - thanks for
>> the feedback!
>> Changes since v2:
>> * Improved commit description.
>> * Improved return failures for rmi_check_version().
>> * Clear contents of PGD after it has been undelegated in case the RMM
>> left stale data.
>> * Minor changes to reflect changes in previous patches.
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
>> arch/arm64/include/asm/kvm_rmi.h | 51 +++++++++++++++++++++++++
>> arch/arm64/kvm/arm.c | 12 ++++++
>> arch/arm64/kvm/mmu.c | 12 +++++-
>> arch/arm64/kvm/rmi.c | 57 ++++++++++++++++++++++++++++
>> 5 files changed, 159 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 5bf3d7e1d92c..82fd777bd9bb 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu *vcpu)
>> vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
>> }
>> }
>> +
>> +static inline bool kvm_is_realm(struct kvm *kvm)
>> +{
>> + if (static_branch_unlikely(&kvm_rmi_is_available))
>> + return kvm->arch.is_realm;
>> + return false;
>> +}
>> +
>> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>> +{
>> + return READ_ONCE(kvm->arch.realm.state);
>> +}
>> +
>> +static inline void kvm_set_realm_state(struct kvm *kvm,
>> + enum realm_state new_state)
>> +{
>> + WRITE_ONCE(kvm->arch.realm.state, new_state);
>> +}
>> +
>> +static inline bool kvm_realm_is_created(struct kvm *kvm)
>> +{
>> + return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE;
>> +}
>> +
>> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>> +{
>> + return false;
>> +}
>> +
>> #endif /* __ARM64_KVM_EMULATE_H__ */
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 4936007947fd..9de34983ee52 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -6,12 +6,63 @@
>> #ifndef __ASM_KVM_RMI_H
>> #define __ASM_KVM_RMI_H
>>
>> +#include <asm/rmi_smc.h>
>> +
>> +/**
>> + * enum realm_state - State of a Realm
>> + */
>> +enum realm_state {
>> + /**
>> + * @REALM_STATE_NONE:
>> + * Realm has not yet been created. rmi_realm_create() has not
>> + * yet been called.
>> + */
>> + REALM_STATE_NONE,
>> + /**
>> + * @REALM_STATE_NEW:
>> + * Realm is under construction, rmi_realm_create() has been
>> + * called, but it is not yet activated. Pages may be populated.
>> + */
>> + REALM_STATE_NEW,
>> + /**
>> + * @REALM_STATE_ACTIVE:
>> + * Realm has been created and is eligible for execution with
>> + * rmi_rec_enter(). Pages may no longer be populated with
>> + * rmi_data_create().
>> + */
>> + REALM_STATE_ACTIVE,
>> + /**
>> + * @REALM_STATE_DYING:
>> + * Realm is in the process of being destroyed or has already been
>> + * destroyed.
>> + */
>> + REALM_STATE_DYING,
>> + /**
>> + * @REALM_STATE_DEAD:
>> + * Realm has been destroyed.
>> + */
>> + REALM_STATE_DEAD
>> +};
>
> What is the ABI status of this state? Is it purely internal to KVM? Or
> is it something that the RMM actively tracks?
The states are in line with what the RMM maintains for the Realm state,
(Section A2.2.5 Realm Lifecycle)
except for :
1. REALM_STATE_DYING is really a KVM internal state to indicate, we
are in the process of destroying the Realm and no further requests
needs to be serviced
2. We don't track the REALM_SYSTEM_OFF, REALM_ZOMBIE states separately
as we :
a) Always TERMINATE the Realm, just before the DESTROY
b) SYSTEM_OFF is naturally triggering the tear down path, leading to
DYING.
>
>> +
>> /**
>> * struct realm - Additional per VM data for a Realm
>> + *
>> + * @state: The lifetime state machine for the realm
>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>> + * @params: Parameters for the RMI_REALM_CREATE command
>> + * @ia_bits: Number of valid Input Address bits in the IPA
>> */
>> struct realm {
>> + enum realm_state state;
>> + void *rd;
>
> Why is this void? Doesn't it have a proper type?
Not really. This is an object that RMM manages (Realm Descriptor)
in the Realm world. We use it as a parameter to address the Realm.
>
>> + struct realm_params *params;
>> + unsigned int ia_bits;
>
> Consider reordering this structure to avoid holes.
>
>> };
>>
>> void kvm_init_rmi(void);
>> +u32 kvm_realm_ipa_limit(void);
>
> The use of 'realm' is confusing. This is not a per-realm property, but
> something global. I'd rather reserve the term 'realm' for CCA VMs (cue
> the two prototypes below).
Agreed. Perhaps, kvm_rmm_ipa_limit() ?
>
>> +
>> +int kvm_init_realm(struct kvm *kvm);
>> +void kvm_destroy_realm(struct kvm *kvm);
>>
>> #endif /* __ASM_KVM_RMI_H */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 247e03b33035..18251e561524 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>
>> bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>>
>> + /* Initialise the realm bits after the generic bits are enabled */
>> + if (kvm_is_realm(kvm)) {
>> + ret = kvm_init_realm(kvm);
>> + if (ret)
>> + goto err_uninit_mmu;
>> + }
>> +
>> return 0;
>>
>> err_uninit_mmu:
>> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>> kvm_unshare_hyp(kvm, kvm + 1);
>>
>> kvm_arm_teardown_hypercalls(kvm);
>> + if (kvm_is_realm(kvm))
>> + kvm_destroy_realm(kvm);
>> }
>>
>> static bool kvm_has_full_ptr_auth(void)
>> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> else
>> r = kvm_supports_cacheable_pfnmap();
>> break;
>> + case KVM_CAP_ARM_RMI:
>> + r = static_key_enabled(&kvm_rmi_is_available);
>> + break;
>>
>> default:
>> r = 0;
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d089c107d9b7..ba8286472286 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>
>> static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
>> {
>> + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> u32 kvm_ipa_limit = get_kvm_ipa_limit();
>> u64 mmfr0, mmfr1;
>> u32 phys_shift;
>>
>> + if (kvm_is_realm(kvm))
>> + kvm_ipa_limit = kvm_realm_ipa_limit();
>> +
>> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>> if (is_protected_kvm_enabled()) {
>> phys_shift = kvm_ipa_limit;
>> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>> return -EINVAL;
>> }
>>
>> + mmu->arch = &kvm->arch;
>> +
>> err = kvm_init_ipa_range(mmu, type);
>> if (err)
>> return err;
>> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>> if (!pgt)
>> return -ENOMEM;
>>
>> - mmu->arch = &kvm->arch;
>
> Why moving this init?
Because, we need to know the "kvm" instance for kvm_init_ipa_range to
detect the limit that applies to Realms.
>
>> err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu, &kvm_s2_mm_ops);
>> if (err)
>> goto out_free_pgtable;
>> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> write_unlock(&kvm->mmu_lock);
>>
>> if (pgt) {
>> - kvm_stage2_destroy(pgt);
>> + if (!kvm_is_realm(kvm))
>> + kvm_stage2_destroy(pgt);
>> + else
>> + kvm_pgtable_stage2_destroy_pgd(pgt);
>
> Why can't you make kvm_stage2_destroy() do the right thing? Surely the
> PTs have to be reclaimed one way or another.
Actually yes, we could make it work. We need to skip walking the page
table for Realms. We may be able to do the checks via
pgt->mmu->arch->kvm and skip the walking for Realms. ( The S2 is
unmapped and torn
down before the RD is destroyed in kvm_destroy_realm(). We can't
rely on the contents of the PGDs to be zero - e.g., with MEC.)
>
>> kfree(pgt);
>> }
>> }
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 6e28b669ded2..f51ec667445e 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -5,6 +5,8 @@
>>
>> #include <linux/kvm_host.h>
>>
>> +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_mmu.h>
>> #include <asm/kvm_pgtable.h>
>> #include <asm/rmi_cmds.h>
>> #include <asm/virt.h>
>> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
>> return !!u64_get_bits(rmm_feat_reg0, feature);
>> }
>>
>> +u32 kvm_realm_ipa_limit(void)
>> +{
>> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>> +}
>> +
>> +void kvm_destroy_realm(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +
>> + if (realm->params) {
>> + free_page((unsigned long)realm->params);
>> + realm->params = NULL;
>> + }
>> +
>> + if (!kvm_realm_is_created(kvm))
>> + return;
>> +
>> + kvm_set_realm_state(kvm, REALM_STATE_DYING);
>> +
>> + write_lock(&kvm->mmu_lock);
>> + kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
>> + BIT(realm->ia_bits - 1), true);
>> + write_unlock(&kvm->mmu_lock);
>> +
>> + if (realm->rd) {
>> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
>> +
>> + if (WARN_ON(rmi_realm_terminate(rd_phys)))
>> + return;
>> +
>> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
>> + return;
>> + free_delegated_page(rd_phys);
>> + realm->rd = NULL;
>> + }
>> +
>> + if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size)))
>> + return;
>> +
>> + kvm_set_realm_state(kvm, REALM_STATE_DEAD);
>> +
>> + /* Now that the Realm is destroyed, free the entry level RTTs */
>> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>> +}
>
> This really needs documentation: what happens at each stage? What
> memory is reclaimed when?
Agreed.
>
> But even more importantly, why is this built in a completely parallel
> way, potentially deviating from the existing KVM S2 management?
RMM requires a Realm is not live at the time of REALM_DESTROY.
(See section A2.2.4 Realm Liveness).
i.e., All RECs are destroyed, Root RTTs wiped clean (no live mappings)
before the RD is destroyed. So, we need to make sure all of this is
done at Realm Destroy. Hence we delay the kvm_free_stage2_pgd() until
we destroy the RD.
Does that help? May be we could improve the comments around it.
Suzuki
> Thanks,>
> M.
>
^ permalink raw reply
* SVSM Development Call June 3rd, 2026
From: Jörg Rödel @ 2026-06-02 16:07 UTC (permalink / raw)
To: coconut-svsm, linux-coco
Hi,
Here is the call for agenda items for this weeks SVSM development call. Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.
We will use the LF Zoom instance. Details of the meeting can be found in our
governance repository at:
https://github.com/coconut-svsm/governance
The link to the COCONUT-SVSM calendar is:
https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week
The meeting will be recorded and the recording eventually published.
Regards,
Jörg
^ permalink raw reply
* [PATCH v6 0/6] Add RMPOPT support.
From: Ashish Kalra @ 2026-06-02 20:00 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
From: Ashish Kalra <ashish.kalra@amd.com>
In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.
The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.
RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.
RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.
In case of report status function, the CPU returns the optimization
status for the 1GB region.
The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned. Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.
For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.
As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.
This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.
Support for RAM larger than 2 TB will be added in follow-on series.
This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.
RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.
Delaying work allows batching of multiple SNP guest terminations.
Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.
Additionally add debugfs interface to report per-CPU RMPOPT status
across all system RAM.
v6:
- Drop wrmsrq_on_cpus() helper; use for_each_cpu() with wrmsrq_on_cpu()
instead, as RMPOPT_BASE MSR programming is not performance-critical.
- Rewrite rmpopt_work_handler() leader selection to use a local
follower_mask copy instead of modifying the global rmpopt_cpumask.
This eliminates the current_cpu_cleared tracking and the restore at
the end, and removes the need for synchronization comments about
transient cpumask inconsistency.
- Add three-way leader selection in rmpopt_work_handler():
1. Current CPU is a primary thread in cpumask: run leader locally.
2. Current CPU is a sibling thread whose primary is in cpumask:
run leader locally (RMPOPT_BASE MSR is per-core), remove the
primary from followers via cpumask_andnot(topology_sibling_cpumask).
3. Current CPU's core has no RMPOPT_BASE MSR programmed: pick an
explicit leader via cpumask_first() + smp_call_function_single()
to avoid #UD, with cpus_read_lock() around the IPI loop.
- Add WARN_ON_ONCE guard for empty cpumask in the explicit leader
fallback path, with migrate_enable() before goto out.
- Add .llseek = seq_lseek to rmpopt_table_fops for consistency with
other seq_file-based debugfs files and to support tools like "less".
- Change debugfs file permissions from 0444 to 0400 to restrict access
to root only.
- Add comment in rmpopt_table_seq_show() explaining why cpu_online_mask
is safe: RMPOPT_BASE MSR is per-core and snp_prepare() ensures all
CPUs are online when the MSR is programmed.
Sashiko AI code review identified several of the above issues.
v5:
- Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
and MSR state, called from snp_shutdown().
- Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
snp_rmpopt_all_physmem(), and rmpopt_cleanup().
- Introduce rmpopt_show_mutex to serialize debugfs reporting of
rmpopt_report_cpumask.
- Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
guest shutdown.
- Use migrate_disable()/migrate_enable() for CPU pinning in the
rmpopt_work_handler() leader loop to maintain CPU affinity without
disabling preemption for the entire RMPOPT scan.
- Add cpus_read_lock()/cpus_read_unlock() around the follower
on_each_cpu_mask() loop in rmpopt_work_handler().
- Guard snp_setup_rmpopt() against re-initialization when
SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
but clears snp_initialized, preventing workqueue and resource
leaks on repeated init/shutdown cycles.
- Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
used after alternatives are patched; callers check rmpopt_wq != NULL
as the runtime guard instead.
- Add pr_info() when RMPOPT coverage is capped at 2TB.
- Add comments noting CPU hotplug is not supported with SNP enabled
and only online primary threads are covered by rmpopt_cpumask.
- Add comment in setup_rmptable() noting Segmented RMP must be
enabled to enable RMPOPT.
- Simplify cpumask setup loop to set if primary thread rather than
skip if not primary.
- Improve grammar and clarity in snp_setup_rmpopt() comments.
- Added Reviewed-by's.
Sashiko AI code review identified several of the above issues.
v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
per-CPU MSR across a cpumask without per-cpu struct allocation
overhead.
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked()
and instead setup and enable RMPOPT after SNP is enabled and
initialized.
v3:
- Drop all RMPOPT kthread support and introduce adding custom and
dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
re-enable RMP optimizations during guest shutdown using the
asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
rmpopt_report_status() wrappers on top which use rax and rcx
parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
first, then let other CPUs execute RMPOPT in parallel so they can skip
most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
as specified by RMPOPT specifications and not be dependent on PUD_SIZE
which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages
v2:
- Drop all NUMA and Socket configuration and enablement support and
enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
parent directory.
Ashish Kalra (6):
x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
x86/sev: Initialize RMPOPT configuration MSRs
x86/sev: Add support to perform RMP optimizations asynchronously
x86/sev: Add interface to re-enable RMP optimizations.
KVM: SEV: Perform RMP optimizations on SNP guest shutdown
x86/sev: Add debugfs support for RMPOPT
arch/x86/coco/core.c | 1 +
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/msr-index.h | 3 +
arch/x86/include/asm/sev.h | 4 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kvm/svm/sev.c | 2 +
arch/x86/virt/svm/sev.c | 398 ++++++++++++++++++++++++++++-
drivers/crypto/ccp/sev-dev.c | 3 +
8 files changed, 412 insertions(+), 2 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH v6 1/6] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Ashish Kalra @ 2026-06-02 20:00 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
Add a flag indicating whether RMPOPT instruction is supported.
RMPOPT is a new instruction that reduces the performance overhead of
RMP checks for the hypervisor and non-SNP guests by allowing those
checks to be skipped when 1-GB memory regions are known to contain no
SEV-SNP guest memory.
For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.
Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/kernel/cpu/scattered.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..794cc96b8493 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
#define X86_FEATURE_K8 ( 3*32+ 4) /* Opteron, Athlon64 */
#define X86_FEATURE_ZEN5 ( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
#define X86_FEATURE_ZEN6 ( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT ( 3*32+ 7) /* Support for AMD RMPOPT instruction */
#define X86_FEATURE_CONSTANT_TSC ( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
#define X86_FEATURE_UP ( 3*32+ 9) /* "up" SMP kernel running on UP */
#define X86_FEATURE_ART ( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_V2, CPUID_EAX, 1, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_PMC_FREEZE, CPUID_EAX, 2, 0x80000022, 0 },
+ { X86_FEATURE_RMPOPT, CPUID_EDX, 0, 0x80000025, 0 },
{ X86_FEATURE_AMD_HTR_CORES, CPUID_EAX, 30, 0x80000026, 0 },
{ 0, 0, 0, 0, 0 }
};
--
2.43.0
^ permalink raw reply related
* [PATCH v6 2/6] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-06-02 20:01 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.
Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.
Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.
Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/coco/core.c | 1 +
arch/x86/include/asm/msr-index.h | 3 ++
arch/x86/include/asm/sev.h | 2 +
arch/x86/virt/svm/sev.c | 65 +++++++++++++++++++++++++++++++-
drivers/crypto/ccp/sev-dev.c | 3 ++
5 files changed, 73 insertions(+), 1 deletion(-)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..7fdef00ca8f2 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -172,6 +172,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
switch (attr) {
case CC_ATTR_HOST_SEV_SNP:
cc_flags.host_sev_snp = 0;
+ setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
break;
default:
break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
#define MSR_AMD64_SEG_RMP_ENABLED_BIT 0
#define MSR_AMD64_SEG_RMP_ENABLED BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
#define MSR_AMD64_RMP_SEGMENT_SHIFT(x) (((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE 0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT 0
+#define MSR_AMD64_RMPOPT_ENABLE BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
#define MSR_SVSM_CAA 0xc001f000
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..6fd72a44a51e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
__snp_leak_pages(pfn, pages, true);
}
int snp_prepare(void);
+void snp_setup_rmpopt(void);
void snp_shutdown(void);
#else
static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +681,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
static inline void kdump_sev_callback(void) { }
static inline void snp_fixup_e820_tables(void) {}
static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
static inline void snp_shutdown(void) {}
#endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..089c9a14edc7 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,9 @@ static void *rmp_bookkeeping __ro_after_init;
static u64 probed_rmp_base, probed_rmp_size;
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+
static LIST_HEAD(snp_leaked_pages_list);
static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -488,9 +491,13 @@ static bool __init setup_segmented_rmptable(void)
static bool __init setup_rmptable(void)
{
if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
- if (!setup_segmented_rmptable())
+ if (!setup_segmented_rmptable()) {
+ setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
return false;
+ }
} else {
+ /* Note that Segmented RMP must be enabled to enable RMPOPT. */
+ setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
if (!setup_contiguous_rmptable())
return false;
}
@@ -555,6 +562,21 @@ int snp_prepare(void)
}
EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
+static void rmpopt_cleanup(void)
+{
+ int cpu;
+
+ cpus_read_lock();
+
+ for_each_cpu(cpu, &rmpopt_cpumask)
+ wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0);
+
+ cpus_read_unlock();
+
+ cpumask_clear(&rmpopt_cpumask);
+ rmpopt_pa_start = 0;
+}
+
void snp_shutdown(void)
{
u64 syscfg;
@@ -563,11 +585,52 @@ void snp_shutdown(void)
if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
return;
+ rmpopt_cleanup();
+
clear_rmp();
on_each_cpu(mfd_reconfigure, NULL, 1);
}
EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
+void snp_setup_rmpopt(void)
+{
+ u64 rmpopt_base;
+ int cpu;
+
+ if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+ return;
+
+ cpus_read_lock();
+
+ /*
+ * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+ * to set up the RMPOPT_BASE MSR.
+ *
+ * Note: only online primary threads are included. If a core's
+ * primary thread is offline, that core is not covered. CPU hotplug
+ * is not currently supported with SNP enabled.
+ */
+
+ for_each_online_cpu(cpu)
+ if (topology_is_primary_thread(cpu))
+ cpumask_set_cpu(cpu, &rmpopt_cpumask);
+
+ rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+ rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+ /*
+ * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+ * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+ * to the starting physical address to enable RMP optimizations for
+ * up to 2 TB of system RAM on all CPUs.
+ */
+ for_each_cpu(cpu, &rmpopt_cpumask)
+ wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+
+ cpus_read_unlock();
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
/*
* Do the necessary preparations which are verified by the firmware as
* described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
}
snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+ snp_setup_rmpopt();
+
sev->snp_initialized = true;
dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
data.tio_en ? "enabled" : "disabled");
--
2.43.0
^ permalink raw reply related
* [PATCH v6 3/6] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-06-02 20:01 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.
RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.
Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.
Enable RMPOPT optimizations for up to 2TB of system RAM starting from
the lowest physical memory address aligned down to a 1GB boundary at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/virt/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 193 insertions(+), 3 deletions(-)
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 089c9a14edc7..d7e40a5fe5ca 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
#include <linux/iommu.h>
#include <linux/amd-iommu.h>
#include <linux/nospec.h>
+#include <linux/workqueue.h>
#include <asm/sev.h>
#include <asm/processor.h>
@@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
static u64 probed_rmp_base, probed_rmp_size;
static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
+
+enum rmpopt_function {
+ RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+ RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT 10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
static LIST_HEAD(snp_leaked_pages_list);
static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -566,6 +578,14 @@ static void rmpopt_cleanup(void)
{
int cpu;
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ if (!rmpopt_wq)
+ return;
+
+ cancel_delayed_work_sync(&rmpopt_delayed_work);
+ destroy_workqueue(rmpopt_wq);
+
cpus_read_lock();
for_each_cpu(cpu, &rmpopt_cpumask)
@@ -574,7 +594,8 @@ static void rmpopt_cleanup(void)
cpus_read_unlock();
cpumask_clear(&rmpopt_cpumask);
- rmpopt_pa_start = 0;
+ rmpopt_pa_start = rmpopt_pa_end = 0;
+ rmpopt_wq = NULL;
}
void snp_shutdown(void)
@@ -592,6 +613,134 @@ void snp_shutdown(void)
}
EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
+static inline bool __rmpopt(u64 pa_start, u64 op_type)
+{
+ bool optimized;
+
+ asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+ : "=@ccc" (optimized)
+ : "a" (pa_start), "c" (op_type)
+ : "memory", "cc");
+
+ return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+ u64 pa_start = ALIGN_DOWN(pa, SZ_1G);
+ u64 op_type = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+ __rmpopt(pa_start, op_type);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+ rmpopt((u64)val);
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+ cpumask_var_t follower_mask;
+ phys_addr_t pa;
+ int this_cpu;
+
+ pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+ rmpopt_pa_start, rmpopt_pa_end);
+
+ if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
+ return;
+
+ /*
+ * RMPOPT scans the RMP table, stores the result of the scan in the
+ * reserved processor memory. The RMP scan is the most expensive
+ * part. If a second RMPOPT occurs, it can skip the expensive scan
+ * if they can see a cached result in the reserved processor memory.
+ *
+ * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+ * on every other primary thread. Followers are "designed to"
+ * skip the scan if they see the "cached" scan results.
+ */
+ cpumask_copy(follower_mask, &rmpopt_cpumask);
+
+ /*
+ * Pin the worker to the current CPU for the leader loop so that
+ * this_cpu remains valid and the RMPOPT instruction executes on
+ * the correct CPU.
+ *
+ * Use migrate_disable() rather than get_cpu() to prevent
+ * migration while still allowing preemption.
+ */
+ migrate_disable();
+ this_cpu = smp_processor_id();
+
+ if (cpumask_test_cpu(this_cpu, follower_mask)) {
+ /*
+ * Current CPU is a primary thread in rmpopt_cpumask.
+ * Run leader locally and remove from follower mask.
+ */
+ cpumask_clear_cpu(this_cpu, follower_mask);
+
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+ rmpopt(pa);
+ } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
+ follower_mask)) {
+ /*
+ * Current CPU is a sibling thread whose primary is in
+ * rmpopt_cpumask. RMPOPT_BASE MSR is per-core, so it
+ * is safe to run the leader locally. Remove the sibling's
+ * primary from the follower mask as this core is already
+ * covered by the leader.
+ */
+ cpumask_andnot(follower_mask, follower_mask,
+ topology_sibling_cpumask(this_cpu));
+
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+ rmpopt(pa);
+ } else {
+ /*
+ * Current CPU does not have RMPOPT_BASE MSR programmed.
+ * Pick an explicit leader from the cpumask to avoid #UD.
+ */
+ int leader_cpu = cpumask_first(follower_mask);
+
+ if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
+ migrate_enable();
+ goto out;
+ }
+
+ cpumask_clear_cpu(leader_cpu, follower_mask);
+
+ cpus_read_lock();
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+ smp_call_function_single(leader_cpu, rmpopt_smp,
+ (void *)pa, true);
+ cpus_read_unlock();
+ }
+
+ migrate_enable();
+
+ /* Followers: run RMPOPT on remaining cores */
+ cpus_read_lock();
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ on_each_cpu_mask(follower_mask, rmpopt_smp,
+ (void *)pa, true);
+
+ /* Give a chance for other threads to run */
+ cond_resched();
+ }
+ cpus_read_unlock();
+
+out:
+ free_cpumask_var(follower_mask);
+}
+
void snp_setup_rmpopt(void)
{
u64 rmpopt_base;
@@ -600,11 +749,35 @@ void snp_setup_rmpopt(void)
if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
return;
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ /*
+ * Guard against re-initialization. When SNP_SHUTDOWN_EX is issued
+ * with x86_snp_shutdown=0, snp_shutdown() is not called and
+ * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+ * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+ * again, leaking the existing workqueue, delayed work, debugfs
+ * entries, and cpumask state.
+ */
+ if (rmpopt_wq)
+ return;
+
+ /*
+ * Create an RMPOPT-specific workqueue to avoid scheduling
+ * RMPOPT workitem on the global system workqueue.
+ */
+ rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+ if (!rmpopt_wq) {
+ pr_err("Failed to allocate RMPOPT workqueue\n");
+ return;
+ }
+
cpus_read_lock();
/*
* The RMPOPT_BASE MSR is per-core, so only one thread per core needs
- * to set up the RMPOPT_BASE MSR.
+ * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+ * needs to issue the RMPOPT instruction.
*
* Note: only online primary threads are included. If a core's
* primary thread is offline, that core is not covered. CPU hotplug
@@ -628,6 +801,23 @@ void snp_setup_rmpopt(void)
wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
cpus_read_unlock();
+
+ INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
+ rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+ /* Limit memory scanning to 2TB of RAM */
+ if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+ pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+ rmpopt_pa_start + SZ_2T);
+ rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+ }
+
+ /*
+ * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+ * optimizations on all physical memory.
+ */
+ queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
}
EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox