Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: David Woodhouse @ 2026-06-04 21:48 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Kiryl Shutsemau, Paul Durrant, Dave Hansen, Rick Edgecombe, kvm,
	x86, linux-coco, linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <dc62e58e-b6ee-41e1-84a5-0716822fefc8@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 587 bytes --]

On Mon, 2026-05-18 at 17:55 +0800, Binbin Wu wrote:
>  
> > I'm suggesting that you clean up longmode→is_64bit for the *hypercalls*
> > but leave 'long_mode' as is.
> > 
> 
> Yes, will only do it for is_64_bit_hypercall().

If you did this, I'm not sure I saw it? 

In response to https://lore.kernel.org/all/aiHPPUk5DY7rH-zL@v4bel/#r I
now find myself with both 'longmode' (current vCPU mode, should be
called is_64bit), and 'long_mode' (latched VM-wide mode) in the *same*
function.

I cannot live with that; I'm going to do the longmode→is_64bit change
locally.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-04 21:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aiHeDZEPkAcWcSkn@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Wed, Jun 03, 2026, Ackerley Tng wrote:
>> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
>> writes:
>>
>> > This is v7 of guest_memfd in-place conversion support.
>> >
>>
>> Here's the outstanding items after going over everyone's comments
>> including Sashiko's:
>>
>> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>>     + Need to move page clearing into __kvm_gmem_get_pfn to resolve
>>       leak where populate can put initialized kernel memory into TDX
>>       guest
>>     + See suggested fix at [1]
>
> That fix works for me.  The initial guest image will typically be a tiny subset
> of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.
>

In regular usage moving the zeroing in [1] doesn't change anything,
since the same zeroing would have first happened when the host faults
the pages to put the initial image. When populating, there's no more
zeroing since it was zeroed.

[1] covers the case where the host doesn't write anything to the pages
and directly tries to populate the pages to the guest.

>> + KVM: guest_memfd: Only prepare folios for private pages,
>>     + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
>>       supported for non-CoCo VMs in a later patch in this series
>>     + Use Suggested-by: Michael Roth <michael.roth@amd.com>
>> + KVM: selftests: Test that shared/private status is consistent across
>>   processes
>>     + Improve test reliability using pthread_mutex
>>     + I have a fixup patch offline.
>> 	
>> I would like feedback on these:
>> 	
>> + KVM: selftests: Test conversion with elevated page refcount
>>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>>       take a dependency on CONFIG_GUP_TEST.
>
> I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
> it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
> but we're _also_ actively working to remove the need to pin.
>
> Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
> memory" syscall.
>

Hmm that takes a dependency on io_uring, which isn't always compiled
in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
CONFIG_GUP_TEST.

>> + KVM: selftests: Add script to exercise private_mem_conversions_test
>>     + Would like to know what people think of a wrapper script before
>>       I address Sashiko's comments.
>
> NAK to a wrapper script.  This sounds like a perfect fit for Vipin's selftest
> runner (which I'm like 4 months overdue for reviewing, testing, and merging).
> If the runner _can't_ do what you want, then I'd rather improve the runner.
>
> [*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com
>

Good to know we have this!

Thanks, I'll work on a v8 to clean up the above.

>>
>> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
>> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>>
>> >
>> > [...snip...]
>> >

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-06-04 20:20 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgGpaggjd3=ooyzv7iEbmA-x1mWJHgjLSjPi8=5CPrk-yQ@mail.gmail.com>

On Wed, Jun 03, 2026, Ackerley Tng wrote:
> Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
> writes:
> 
> > This is v7 of guest_memfd in-place conversion support.
> >
> 
> Here's the outstanding items after going over everyone's comments
> including Sashiko's:
> 
> + KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>     + Need to move page clearing into __kvm_gmem_get_pfn to resolve
>       leak where populate can put initialized kernel memory into TDX
>       guest
>     + See suggested fix at [1]

That fix works for me.  The initial guest image will typically be a tiny subset
of guest memory, so unnecessarily zeroing a few pages isn't a performance concern.

> + KVM: guest_memfd: Only prepare folios for private pages,
>     + s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
>       supported for non-CoCo VMs in a later patch in this series
>     + Use Suggested-by: Michael Roth <michael.roth@amd.com>
> + KVM: selftests: Test that shared/private status is consistent across
>   processes
>     + Improve test reliability using pthread_mutex
>     + I have a fixup patch offline.
> 	
> I would like feedback on these:
> 	
> + KVM: selftests: Test conversion with elevated page refcount
>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>       take a dependency on CONFIG_GUP_TEST.

I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
but we're _also_ actively working to remove the need to pin.

Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
memory" syscall.

> + KVM: selftests: Add script to exercise private_mem_conversions_test
>     + Would like to know what people think of a wrapper script before
>       I address Sashiko's comments.

NAK to a wrapper script.  This sounds like a perfect fit for Vipin's selftest
runner (which I'm like 4 months overdue for reviewing, testing, and merging).
If the runner _can't_ do what you want, then I'd rather improve the runner.

[*] https://lore.kernel.org/all/20260331194202.1722082-1-vipinsh@google.com

> 
> [1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
> [2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
> 
> >
> > [...snip...]
> >

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Michael Roth @ 2026-06-04 20:11 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>

On Thu, Jun 04, 2026 at 04:29:19PM +0100, Suzuki K Poulose wrote:
> On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> > From: Michael Roth <michael.roth@amd.com>
> > 
> > For vm_memory_attributes=1, in-place conversion/population is not
> > supported, so the initial contents necessarily must need to come
> > from a separate src address, which is enforced by the current
> > implementation. However, for vm_memory_attributes=0, it is possible for
> > guest memory to be initialized directly from userspace by mmap()'ing the
> > guest_memfd and writing to it while the corresponding GPA ranges are in
> > a 'shared' state before converting them to the 'private' state expected
> > by KVM_SEV_SNP_LAUNCH_UPDATE.
> > 
> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> > copy in data from a separate memory location. Continue to enforce
> > non-NULL for the original vm_memory_attributes=1 case.
> > 
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > [Added src_page check in error handling path when the firmware command fails]
> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> 
> 
> 
> 
> > ---
> >   Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
> >   arch/x86/kvm/svm/sev.c                               | 18 +++++++++++++-----
> >   virt/kvm/kvm_main.c                                  |  1 +
> >   3 files changed, 25 insertions(+), 9 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > index b2395dd4769de..43085f65b2d85 100644
> > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> > @@ -503,7 +503,8 @@ secrets.
> >   It is required that the GPA ranges initialized by this command have had the
> >   KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> > -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> > +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> > +this aspect.
> >   Upon success, this command is not guaranteed to have processed the entire
> >   range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> > @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> >   remaining range that has yet to be processed. The caller should continue
> >   calling this command until those fields indicate the entire range has been
> >   processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> > -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> > -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> > -``uaddr`` will be ignored completely.
> > +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> > +userspace-provided source buffer address plus 1.
> > +
> > +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> > +ignored completely. Otherwise, ``uaddr`` is required if
> > +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> > +in the latter case guest memory can be initialized directly from userspace
> > +prior to converting it to private and passing the GPA range on to this
> > +interface.
> 
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in the
> process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).

sev_gmem_prepare() does sort of destroy contents since it finalizes the
shared->private conversion which puts the page in an unusable state
until the guest 'accepts' it as private memory and re-initializes the
contents.

But that's run-time, when the guest is doing conversions. The
documentation here is relating to initialization time when we are
setting up the initial pre-encrypted/pre-measured guest memory image,
via SNP_LAUNCH_UPDATE. That path calls into kvm_gmem_populate(), and it
is then sev_gmem_post_populate() callback that actually finalizes the
shared->private conversion. The sev_gmem_prepare() hook doesn't get used
in this flow (kvm_gmem_populate() calls __kvm_gmem_get_pfn() which skips
preparation).

-Mike

> 
> Suzuki
> 
> 
> 
> >   Parameters (in): struct  kvm_sev_snp_launch_update
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 1a361f08c7a3d..e1dbc827c2807 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >   	int level;
> >   	int ret;
> > -	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> > +	/*
> > +	 * For vm_memory_attributes=1, in-place conversion/population is not
> > +	 * supported, so the initial contents necessarily need to come from a
> > +	 * separate src address. For vm_memory_attributes=0, this isn't
> > +	 * necessarily the case, since the pages may have been populated
> > +	 * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> > +	 */
> > +	if (vm_memory_attributes &&
> > +	    sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
> >   		return -EINVAL;
> >   	ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> > @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >   	 */
> >   	if (ret && !snp_page_reclaim(kvm, pfn) &&
> >   	    sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> > -	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> > +	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
> >   		void *src_vaddr = kmap_local_page(src_page);
> >   		void *dst_vaddr = kmap_local_pfn(pfn);
> > @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >   	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
> >   		return -EFAULT;
> > -	pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> > -		 params.gfn_start, params.len, params.type, params.flags);
> > +	pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> > +		 params.gfn_start, params.len, params.type, params.flags, params.uaddr);
> >   	if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
> >   	    (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> > @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >   	params.gfn_start += count;
> >   	params.len -= count * PAGE_SIZE;
> > -	if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> > +	if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> >   		params.uaddr += count * PAGE_SIZE;
> >   	if (copy_to_user(u64_to_user_ptr(argp->data), &params, sizeof(params)))
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ba195bb239aaa..3bf212fd99193 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
> >   #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> >   bool vm_memory_attributes = true;
> >   module_param(vm_memory_attributes, bool, 0444);
> > +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
> >   #endif
> >   DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> >   EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> > 
> 

^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-06-04 19:05 UTC (permalink / raw)
  To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <9d15479e-e36b-4865-804c-7d93eb339e4e@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

>
> [...snip...]
>
>> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
>> +ignored completely. Otherwise, ``uaddr`` is required if
>> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
>> +in the latter case guest memory can be initialized directly from userspace
>> +prior to converting it to private and passing the GPA range on to this
>> +interface.
>
> Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in
> the process of making it "private" ? i.e., the contents of a SNP shared
> page are preserved while transitioning to "SNP Private" (via RMP
> update).
>
> Suzuki
>

The following is the guest_memfd perspective, I didn't look at the SNP
spec:

Do you mean specifically for KVM_SEV_SNP_PAGE_TYPE_ZERO, or for any
type?

guest_memfd has no plans to do any special zeroing based on type.

guest_memfd decoupled zeroing from preparation a while ago (Michael had
some patches), so zeroing is supposed to be once during folio ownership
by guest_memfd, tracked by the uptodate flag, and preparation is tracked
outside of guest_memfd. So far only SNP does preparation.

>
>
>>
>> [...snip...]
>>

^ permalink raw reply

* [CfP] Confidential Computing Microconference (LPC 2026)
From: Jörg Rödel @ 2026-06-04 18:29 UTC (permalink / raw)
  To: linux-coco, linux-kernel, kvm, virtualization, coconut-svsm,
	linux-sgx
  Cc: Dhaval Giani

Hi everyone,

We are pleased to officially open the Call for Presentations for the 2026
Confidential Computing Microconference at LPC this year. LPC will take place
from October 5th to 7th in Prague.

Our goal is to bring open-source developers and industry experts together for
productive discussions that lead to concrete solutions. We are looking for
interactive discussions, ongoing development topics, and problem-solving
sessions rather than static status updates.

We are looking for proposals covering, but not limited to, the following
developments and challenges:

	* Enhancements to CVM memory backing via `guest_memfd`
	* KVM Support for ARM CCA
	* Privilege separation features in KVM
	* CVM live migration
	* Secure VM Service Module (SVSM) architecture and Linux support
	* Trusted I/O software architecture
	* Solutions for the full CVM (remote) attestation problem
	* Linux as a CVM operating system across hypervisors
	* CVM Performance optimization and benchmarking

If you are working on any of these areas or have another critical Confidential
Computing topic that requires community alignment, please submit your proposal!

LPC microconferences are built around discussion and collaboration. Proposals
should focus on open problems, architectural roadblocks, or design choices that
would benefit from in-person feedback from the community.

* Submit here:		https://lpc.events/event/20/abstracts/
* Submission Deadline:	August 7, 2026

Make sure to select "Confidential Computing MC" as the track! This year the LPC
organization committee will grant pre-registration vouchers to anyone who has
submitted a topic. These are at the usual price ($600) which must be used
before registration opens. If your topic is not accepted you should be eligible
for a refund if your employer doesn’t approve your travel. For more details see

	https://lpc.events/blog/current/index.php/2026/04/06/changes-to-registration-availability-for-2026/

Looking forward to seeing you there,


- Dhaval and Joerg

^ permalink raw reply

* Re: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Jason Gunthorpe @ 2026-06-04 18:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michael Kelley, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Mostafa Saleh, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <yq5apl26qrof.fsf@kernel.org>

On Thu, Jun 04, 2026 at 08:27:36PM +0530, Aneesh Kumar K.V wrote:
> I already sent a v6 in the hope of getting this merged for the next
> merge window. Should I send a v7, or would you prefer that I do the
> rename on top of v6?

I think it is too late for such a major change, but this should be
imaginged to be for rc2ish next cycle. You also have to spell out how
the pkvm patch will get sequenced as well, it would be best to push
that it gets picked up right away.

Jason

^ permalink raw reply

* Re: [PATCH v6 10/11] x86/virt/tdx: Enable Dynamic PAMT
From: Kiryl Shutsemau @ 2026-06-04 17:14 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang, Kirill A. Shutemov
In-Reply-To: <20260526023515.288829-11-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:14PM -0700, Rick Edgecombe wrote:
> @@ -152,7 +156,12 @@ const struct tdx_sys_info *tdx_get_sysinfo(void);
>  
>  static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
>  {
> -	return false; /* To be enabled when kernel is ready */
> +	/*
> +	 * The TDX Module's internal Dynamic PAMT tree structure can't
> +	 * handle physical addresses with more than 48 bits.
> +	 */
> +	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT &&
> +	       boot_cpu_data.x86_phys_bits <= 48;

Should we warn for >48?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 08/11] x86/tdx: Add APIs to support Dynamic PAMT ops from KVM's fault path
From: Kiryl Shutsemau @ 2026-06-04 17:11 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang
In-Reply-To: <20260526023515.288829-9-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:12PM -0700, Rick Edgecombe wrote:
> When handling an EPT violation, KVM holds a spinlock while manipulating
> the EPT. Before entering the spinlock it doesn't know how many EPT page
> tables will need to be installed or whether a huge page will be used. For
> this reason it allocates a worst case number of page tables that it might
> need as part of servicing the EPT violation.
> 
> Under Dynamic PAMT these pre-allocated pages will potentially need to have
> Dynamic PAMT backing pages installed for them. KVM already has helpers to
> manage topping up page caches before taking the MMU lock, but they cannot be
> passed from KVM to arch/x86 code.
> 
> The problem of how and when to install the DPAMT backing pages for the
> pages given to the TDX module during the fault path has had a lot of
> design attempts.
>  - Extracting KVM's MMU caches requires too much inlined code added to
>    headers.
>  - A few varieties of installing Dynamic PAMT backing when allocating the
>    S-EPT page tables. [0][1]
>  - Using mempool_t to transfer the pages between KVM and arch/x86 doesn't
>    work because it is the component is designed more around maintaining a
>    pool of pages, rather than topping up a continually drained cache.
> 
> So don't do these as they all had various problems. Instead just create a
> small simple data structure to use for handing a pre-allocated list of
> pages between KVM and arch/x86 code. Model this on KVM's existing MMU
> memory caches.
> 
> Add a tdx_pamt_cache arg to tdx_pamt_get() so it can draw pages from a
> cache when needed. Not all DPAMT page installations will happen under
> spinlock, for example control pages. So have tdx_pamt_get() maintain the
> existing behavior of allocating from the page allocator when NULL is
> passed for the struct tdx_pamt_cache arg. This prevents excess allocations
> for cases where it can be avoided.
> 
> Export the new helpers for KVM.
> 
> Assisted-by: GitHub Copilot:claude-opus-4-6 Claude:claude-opus-4-7
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/kvm/de05853257e9cc66998101943f78a4b7e6e3d741.camel@intel.com/ [0]
> Link: https://lore.kernel.org/kvm/aYprxnSHKHUtk7pt@google.com/ [1]

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-04 16:59 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Huang, Kai, Hansen, Dave, Zhao, Yan Y, seanjc@google.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <fe08f03a22acfe758cd97f7c2880deeafbc5fe58.camel@intel.com>

On Tue, May 26, 2026 at 04:42:24PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2026-05-26 at 16:57 +0800, Chao Gao wrote:
> > > -	scoped_guard(spinlock, &pamt_lock) {
> > 
> > This converts the scoped_guard() added by the previous patch to
> > explicit lock/unlock and goto. It would reduce code churn if the
> > previous patch used that form directly.
> 
> Yea, it's a good point. I actually debated doing it, but decided not to because
> the scoped version is cleaner for the non-optimized version. But for
> reviewability, never doing the scoped version is probably better.

I don't see a reason why we can't keep the scoped_guard() on get side.

On put side, we cannot get atomic_get_and_lock() semantics without
dropping the scoped_guard().

Maybe we should keep it for get?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Michael Kelley @ 2026-06-04 16:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Michael Kelley, Jason Gunthorpe, Michael Kelley
  Cc: iommu@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Mostafa Saleh, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <yq5apl26qrof.fsf@kernel.org>

From: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Sent: Thursday, June 4, 2026 7:58 AM
> 
> Michael Kelley <mhklinux@outlook.com> writes:
> 
> > From: Jason Gunthorpe <jgg@ziepe.ca> Sent: Tuesday, June 2, 2026 5:55 PM
> >>
> >> On Tue, Jun 02, 2026 at 02:24:40PM +0000, Michael Kelley wrote:
> >>
> >> > Except that in a normal VM, the "unencrypted" pool attribute does *not*
> >> > describe the state of the memory itself.  In a normal VM, the memory is
> >> > unencrypted, but the "unencrypted" pool attribute is false. That
> >> > contradiction is the essence of my concern.
> >>
> >> I would argue no..
> >>
> >> When CC is enabled the default state of memory in a Linux environment
> >> is "encrypted". You have to take a special action to "decrypt" it.
> >>
> >> Thus the default state of memory in a non-CC environment is also
> >> paradoxically "encrypted" too.
> >
> > The need to have such an unnatural premise is usually an indication
> > of a conceptual problem with the overall model, or perhaps just a
> > terminology problem.
> >
> > Here's a proposal. The new DMA attribute is DMA_ATTR_CC_SHARED.
> > Name the pool attribute "cc_shared" instead of "unencrypted". Having
> > "cc_shared" set to false in a normal VM doesn't lead to the non-sensical
> > situation of claiming that a normal VM is encrypted. The boolean
> > "unencrypted" parameter that has been added to various calls also
> > becomes "cc_shared".  If "CC_SHARED" is a suitable name for the DMA
> > attribute, it ought to be suitable as the pool attribute. And everything
> > matches as well.
> >
> 
> That is better. It would also simplify:
> 
> 	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> 		return NULL;
> 
> to
> 	if (mem->cc_shared != !!(attrs & DMA_ATTR_CC_SHARED))
> 		return NULL;
> 
> 
> I already sent a v6 in the hope of getting this merged for the next
> merge window. Should I send a v7, or would you prefer that I do the
> rename on top of v6?
> 

I would advocate for a v7 with the rename, vs. a separate follow-on
patch to do the rename, just to reduce churn. But I don't know what
the tradeoffs are in trying to hit the next merge window. If a follow-on
patch is more practical from a timing standpoint, I won't object.

Michael


^ permalink raw reply

* Re: [PATCH v6 02/11] x86/virt/tdx: Allocate page bitmap for Dynamic PAMT
From: Kiryl Shutsemau @ 2026-06-04 16:14 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang, Kirill A. Shutemov, Binbin Wu
In-Reply-To: <20260526023515.288829-3-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:06PM -0700, Rick Edgecombe wrote:
> @@ -579,7 +591,12 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
>  	 * Calculate the PAMT size for each TDX supported page size
>  	 * and the total PAMT size.
>  	 */
> -	tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
> +	if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
> +		/* With Dynamic PAMT, PAMT_4K is replaced with a bitmap */
> +		tdmr->pamt_4k_size = tdmr_get_pamt_bitmap_sz(tdmr);
> +	} else {
> +		tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
> +	}
>  	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
>  	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
>  	tdmr_pamt_size = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;

Maybe it would more readable if we reverse the size order:

	/*
	 * Calculate the PAMT size for each TDX supported page size
	 * and the total PAMT size.
	 */
  	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
  	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);

	if (tdx_supports_dynamic_pamt(&tdx_sysinfo)) {
		/* With Dynamic PAMT, PAMT_4K is replaced with a bitmap */
		tdmr->pamt_4k_size = tdmr_get_pamt_bitmap_sz(tdmr);
	} else {
		tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
	}

  	tdmr_pamt_size = tdmr->pamt_1g_size + tdmr->pamt_2m_size + tdmr->pamt_4k_size;

It allows split it into logical blocks while keeping the comment attached.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 01/11] x86/virt/tdx: Simplify tdmr_get_pamt_sz()
From: Kiryl Shutsemau @ 2026-06-04 16:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: bp, dave.hansen, hpa, kvm, linux-coco, linux-doc, linux-kernel,
	mingo, nik.borisov, pbonzini, seanjc, tglx, vannapurve, x86,
	chao.gao, yan.y.zhao, kai.huang, Binbin Wu
In-Reply-To: <20260526023515.288829-2-rick.p.edgecombe@intel.com>

On Mon, May 25, 2026 at 07:35:05PM -0700, Rick Edgecombe wrote:
> For each memory region that the TDX module might use (called TDMR), three
> separate traditional PAMT allocations are needed. One for each supported
> page size (1GB, 2MB, 4KB). These store information on each page in the
> TDMR. In Linux, they are allocated out of one physically contiguous block,
> in order to more efficiently use some internal TDX module book keeping
> resources. So some simple math is needed to break the single large
> allocation into three smaller allocations for each page size.
> 
> There are some commonalities in the math needed to calculate the base and
> size for each smaller allocation, and so an effort was made to share logic
> across the three. Unfortunately doing this turned out unnaturally tortured,
> with a loop iterating over the three page sizes, only to call into a
> function with cases statement for each page size. In the future Dynamic
> PAMT will add more logic that is special to the 4KB page size, making the
> benefit of the math sharing even more questionable.
> 
> Three is not a very high number, so get rid of the loop and just duplicate
> the small calculation three times. In doing so, setup for future Dynamic
> PAMT changes.
> 
> Since the loop that iterates over it is gone, further simplify the code by
> dropping the array of intermediate size and base storage. Just store the
> values to their final locations. Accept the small complication of having
> to clear tdmr->pamt_4k_base in the error path, so that tdmr_do_pamt_func()
> will not try to operate on the TDMR struct when attempting to free it.
> 
> Assisted-by: GitHub Copilot:claude-opus-4-6 Claude:claude-opus-4-7
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

Couple of nits below.

> ---
> v6:
>  - Drop {} by moving a comment (Binbin)
>  - Log tweaks
> 
> v4:
>  - Just refer to global var instead of passing pamt_entry_size around
>    (Xiaoyao)
>  - Remove setting pamt_4k_base to zero, because it already is zero.
>    Adjust the comment appropriately (Kai)
> 
> v3:
>  - New patch
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 93 ++++++++++++-------------------------
>  1 file changed, 29 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 967482ae3c801..487f389f52f4b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -516,31 +516,21 @@ static __init int fill_out_tdmrs(struct list_head *tmb_list,
>   * Calculate PAMT size given a TDMR and a page size.  The returned
>   * PAMT size is always aligned up to 4K page boundary.
>   */
> -static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
> -					     u16 pamt_entry_size)
> +static __init unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
>  {
>  	unsigned long pamt_sz, nr_pamt_entries;
> +	const int tdx_pg_size_shift[] = { PAGE_SHIFT, PMD_SHIFT, PUD_SHIFT };
> +	const u16 pamt_entry_size[TDX_PS_NR] = {
> +		tdx_sysinfo.tdmr.pamt_4k_entry_size,
> +		tdx_sysinfo.tdmr.pamt_2m_entry_size,
> +		tdx_sysinfo.tdmr.pamt_1g_entry_size,
> +	};
>  
> -	switch (pgsz) {
> -	case TDX_PS_4K:
> -		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> -		break;
> -	case TDX_PS_2M:
> -		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> -		break;
> -	case TDX_PS_1G:
> -		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> -		break;
> -	default:
> -		WARN_ON_ONCE(1);
> -		return 0;
> -	}
> +	nr_pamt_entries = tdmr->size >> tdx_pg_size_shift[pgsz];
> +	pamt_sz = nr_pamt_entries * pamt_entry_size[pgsz];
>  
> -	pamt_sz = nr_pamt_entries * pamt_entry_size;
>  	/* TDX requires PAMT size must be 4K aligned */
> -	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> -
> -	return pamt_sz;
> +	return PAGE_ALIGN(pamt_sz);
>  }
>  
>  /*
> @@ -578,28 +568,21 @@ static __init int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_lis
>   * within @tdmr, and set up PAMTs for @tdmr.
>   */
>  static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> -				   struct list_head *tmb_list,
> -				   u16 pamt_entry_size[])
> +				   struct list_head *tmb_list)
>  {
> -	unsigned long pamt_base[TDX_PS_NR];
> -	unsigned long pamt_size[TDX_PS_NR];
> -	unsigned long tdmr_pamt_base;
>  	unsigned long tdmr_pamt_size;
>  	struct page *pamt;
> -	int pgsz, nid;
> -
> +	int nid;

Add a newline here?

>  	nid = tdmr_get_nid(tdmr, tmb_list);
>  
>  	/*
>  	 * Calculate the PAMT size for each TDX supported page size
>  	 * and the total PAMT size.
>  	 */
> -	tdmr_pamt_size = 0;
> -	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
> -		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
> -					pamt_entry_size[pgsz]);
> -		tdmr_pamt_size += pamt_size[pgsz];
> -	}
> +	tdmr->pamt_4k_size = tdmr_get_pamt_sz(tdmr, TDX_PS_4K);
> +	tdmr->pamt_2m_size = tdmr_get_pamt_sz(tdmr, TDX_PS_2M);
> +	tdmr->pamt_1g_size = tdmr_get_pamt_sz(tdmr, TDX_PS_1G);
> +	tdmr_pamt_size = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
>  
>  	/*
>  	 * Allocate one chunk of physically contiguous memory for all
> @@ -607,26 +590,18 @@ static __init int tdmr_set_up_pamt(struct tdmr_info *tdmr,
>  	 * in overlapped TDMRs.
>  	 */
>  	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> -			nid, &node_online_map);
> +				  nid, &node_online_map);
> +

Looks like unrelated whitespace change. Is it intentional?

> +	/*
> +	 * tdmr->pamt_4k_base is still zero so the error
> +	 * path of the caller will skip freeing the pamt.
> +	 */
>  	if (!pamt)
>  		return -ENOMEM;
>  
> -	/*
> -	 * Break the contiguous allocation back up into the
> -	 * individual PAMTs for each page size.
> -	 */
> -	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> -	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
> -		pamt_base[pgsz] = tdmr_pamt_base;
> -		tdmr_pamt_base += pamt_size[pgsz];
> -	}
> -
> -	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> -	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> -	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> -	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
> -	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
> -	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
> +	tdmr->pamt_4k_base = page_to_phys(pamt);
> +	tdmr->pamt_2m_base = tdmr->pamt_4k_base + tdmr->pamt_4k_size;
> +	tdmr->pamt_1g_base = tdmr->pamt_2m_base + tdmr->pamt_2m_size;
>  
>  	return 0;
>  }
> @@ -657,10 +632,7 @@ static __init void tdmr_do_pamt_func(struct tdmr_info *tdmr,
>  	tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
>  
>  	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> -	if (!pamt_size)
> -		return;
> -
> -	if (WARN_ON_ONCE(!pamt_base))
> +	if (!pamt_base)
>  		return;
>  
>  	pamt_func(pamt_base, pamt_size);
> @@ -686,14 +658,12 @@ static __init void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
>  
>  /* Allocate and set up PAMTs for all TDMRs */
>  static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> -					struct list_head *tmb_list,
> -					u16 pamt_entry_size[])
> +				 struct list_head *tmb_list)
>  {
>  	int i, ret = 0;
>  
>  	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> -		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
> -				pamt_entry_size);
> +		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list);
>  		if (ret)
>  			goto err;
>  	}
> @@ -970,18 +940,13 @@ static __init int construct_tdmrs(struct list_head *tmb_list,
>  				  struct tdmr_info_list *tdmr_list,
>  				  struct tdx_sys_info_tdmr *sysinfo_tdmr)
>  {
> -	u16 pamt_entry_size[TDX_PS_NR] = {
> -		sysinfo_tdmr->pamt_4k_entry_size,
> -		sysinfo_tdmr->pamt_2m_entry_size,
> -		sysinfo_tdmr->pamt_1g_entry_size,
> -	};
>  	int ret;
>  
>  	ret = fill_out_tdmrs(tmb_list, tdmr_list);
>  	if (ret)
>  		return ret;
>  
> -	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, pamt_entry_size);
> +	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list);
>  	if (ret)
>  		return ret;
>  
> -- 
> 2.54.0
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.
From: Steven Price @ 2026-06-04 15:55 UTC (permalink / raw)
  To: Suzuki K Poulose, Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <1bb9da2a-bcc2-4232-944e-0730e9e1f45a@arm.com>

On 02/06/2026 15:49, Suzuki K Poulose wrote:
> Hi Marc
> 
> On 28/05/2026 08:10, Marc Zyngier wrote:
>> On Wed, 13 May 2026 14:17:22 +0100,
>> Steven Price <steven.price@arm.com> wrote:
>>>
>>> Introduce the skeleton functions for creating and destroying a realm.
>>> The IPA size requested is checked against what the RMM supports.
>>>
>>> The actual work of constructing the realm will be added in future
>>> patches.
>>
>> Again, $SUBJECT doesn't reflect that this is purely a KVM patch.

Indeed - "KVM: arm64: CCA" is a better prefix.

>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>> Changes since v13:
>>>   * Rebased and updated to RMM-v2.0-bet1.
>>>   * Auxiliary granules have been removed in RMM-v2.0-bet1
>>> Changes since v12:
>>>   * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now
>>> configured to
>>>     be the same as the host's page size.
>>>   * Rework delegate/undelegate functions to use the new RMI range based
>>>     operations.
>>> Changes since v11:
>>>   * Major rework to drop the realm configuration and make the
>>>     construction of realms implicit rather than driven by the VMM
>>>     directly.
>>>   * The code to create RDs, handle VMIDs etc is moved to later patches.
>>> Changes since v10:
>>>   * Rename from RME to RMI.
>>>   * Move the stage2 cleanup to a later patch.
>>> Changes since v9:
>>>   * Avoid walking the stage 2 page tables when destroying the realm -
>>>     the real ones are not accessible to the non-secure world, and the
>>> RMM
>>>     may leave junk in the physical pages when returning them.
>>>   * Fix an error path in realm_create_rd() to actually return an
>>> error value.
>>> Changes since v8:
>>>   * Fix free_delegated_granule() to not call
>>> kvm_account_pgtable_pages();
>>>     a separate wrapper will be introduced in a later patch to deal with
>>>     RTTs.
>>>   * Minor code cleanups following review.
>>> Changes since v7:
>>>   * Minor code cleanup following Gavin's review.
>>> Changes since v6:
>>>   * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>>>     host page size to be larger than 4k while still communicating
>>> with an
>>>     RMM which uses 4k granules.
>>> Changes since v5:
>>>   * Introduce free_delegated_granule() to replace many
>>>     undelegate/free_page() instances and centralise the comment on
>>>     leaking when the undelegate fails.
>>>   * Several other minor improvements suggested by reviews - thanks for
>>>     the feedback!
>>> Changes since v2:
>>>   * Improved commit description.
>>>   * Improved return failures for rmi_check_version().
>>>   * Clear contents of PGD after it has been undelegated in case the RMM
>>>     left stale data.
>>>   * Minor changes to reflect changes in previous patches.
>>> ---
>>>   arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
>>>   arch/arm64/include/asm/kvm_rmi.h     | 51 +++++++++++++++++++++++++
>>>   arch/arm64/kvm/arm.c                 | 12 ++++++
>>>   arch/arm64/kvm/mmu.c                 | 12 +++++-
>>>   arch/arm64/kvm/rmi.c                 | 57 ++++++++++++++++++++++++++++
>>>   5 files changed, 159 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/
>>> include/asm/kvm_emulate.h
>>> index 5bf3d7e1d92c..82fd777bd9bb 100644
>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu
>>> *vcpu)
>>>               vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
>>>       }
>>>   }
>>> +
>>> +static inline bool kvm_is_realm(struct kvm *kvm)
>>> +{
>>> +    if (static_branch_unlikely(&kvm_rmi_is_available))
>>> +        return kvm->arch.is_realm;
>>> +    return false;
>>> +}
>>> +
>>> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>>> +{
>>> +    return READ_ONCE(kvm->arch.realm.state);
>>> +}
>>> +
>>> +static inline void kvm_set_realm_state(struct kvm *kvm,
>>> +                       enum realm_state new_state)
>>> +{
>>> +    WRITE_ONCE(kvm->arch.realm.state, new_state);
>>> +}
>>> +
>>> +static inline bool kvm_realm_is_created(struct kvm *kvm)
>>> +{
>>> +    return kvm_is_realm(kvm) && kvm_realm_state(kvm) !=
>>> REALM_STATE_NONE;
>>> +}
>>> +
>>> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>>> +{
>>> +    return false;
>>> +}
>>> +
>>>   #endif /* __ARM64_KVM_EMULATE_H__ */
>>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>>> asm/kvm_rmi.h
>>> index 4936007947fd..9de34983ee52 100644
>>> --- a/arch/arm64/include/asm/kvm_rmi.h
>>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>>> @@ -6,12 +6,63 @@
>>>   #ifndef __ASM_KVM_RMI_H
>>>   #define __ASM_KVM_RMI_H
>>>   +#include <asm/rmi_smc.h>
>>> +
>>> +/**
>>> + * enum realm_state - State of a Realm
>>> + */
>>> +enum realm_state {
>>> +    /**
>>> +     * @REALM_STATE_NONE:
>>> +     *      Realm has not yet been created. rmi_realm_create() has not
>>> +     *      yet been called.
>>> +     */
>>> +    REALM_STATE_NONE,
>>> +    /**
>>> +     * @REALM_STATE_NEW:
>>> +     *      Realm is under construction, rmi_realm_create() has been
>>> +     *      called, but it is not yet activated. Pages may be
>>> populated.
>>> +     */
>>> +    REALM_STATE_NEW,
>>> +    /**
>>> +     * @REALM_STATE_ACTIVE:
>>> +     *      Realm has been created and is eligible for execution with
>>> +     *      rmi_rec_enter(). Pages may no longer be populated with
>>> +     *      rmi_data_create().
>>> +     */
>>> +    REALM_STATE_ACTIVE,
>>> +    /**
>>> +     * @REALM_STATE_DYING:
>>> +     *      Realm is in the process of being destroyed or has
>>> already been
>>> +     *      destroyed.
>>> +     */
>>> +    REALM_STATE_DYING,
>>> +    /**
>>> +     * @REALM_STATE_DEAD:
>>> +     *      Realm has been destroyed.
>>> +     */
>>> +    REALM_STATE_DEAD
>>> +};
>>
>> What is the ABI status of this state? Is it purely internal to KVM? Or
>> is it something that the RMM actively tracks?
> 
> The states are in line with what the RMM maintains for the Realm state,
> (Section A2.2.5 Realm Lifecycle)
> except for :
> 
> 1. REALM_STATE_DYING is really a KVM internal state to indicate, we
> are in the process of destroying the Realm and no further requests
> needs to be serviced
> 
> 2. We don't track the REALM_SYSTEM_OFF, REALM_ZOMBIE states separately
> as we :
>  a) Always TERMINATE the Realm, just before the DESTROY
>  b) SYSTEM_OFF is naturally triggering the tear down path, leading to
> DYING.
> 

I'll add a comment:

+ * Mirrors the RMM's Realm lifecycle states where they are meaningful to KVM,
+ * with REALM_STATE_DYING being a KVM-internal state used to prevent further
+ * requests while teardown is in progress. KVM does not track REALM_SYSTEM_OFF
+ * or REALM_ZOMBIE separately as they naturally lead to teardown.

> 
> 
>>
>>> +
>>>   /**
>>>    * struct realm - Additional per VM data for a Realm
>>> + *
>>> + * @state: The lifetime state machine for the realm
>>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>>> + * @params: Parameters for the RMI_REALM_CREATE command
>>> + * @ia_bits: Number of valid Input Address bits in the IPA
>>>    */
>>>   struct realm {
>>> +    enum realm_state state;
>>> +    void *rd;
>>
>> Why is this void? Doesn't it have a proper type?
> 
> Not really. This is an object that RMM manages (Realm Descriptor)
> in the Realm world. We use it as a parameter to address the Realm.
> 
> 
>>
>>> +    struct realm_params *params;
>>> +    unsigned int ia_bits;
>>
>> Consider reordering this structure to avoid holes.

Sure

>>>   };
>>>     void kvm_init_rmi(void);
>>> +u32 kvm_realm_ipa_limit(void);
>>
>> The use of 'realm' is confusing. This is not a per-realm property, but
>> something global. I'd rather reserve the term 'realm' for CCA VMs (cue
>> the two prototypes below).
> 
> Agreed. Perhaps, kvm_rmm_ipa_limit() ?

Sounds good to me.

> 
>>
>>> +
>>> +int kvm_init_realm(struct kvm *kvm);
>>> +void kvm_destroy_realm(struct kvm *kvm);
>>>     #endif /* __ASM_KVM_RMI_H */
>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>> index 247e03b33035..18251e561524 100644
>>> --- a/arch/arm64/kvm/arm.c
>>> +++ b/arch/arm64/kvm/arm.c
>>> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned
>>> long type)
>>>         bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>>>   +    /* Initialise the realm bits after the generic bits are
>>> enabled */
>>> +    if (kvm_is_realm(kvm)) {
>>> +        ret = kvm_init_realm(kvm);
>>> +        if (ret)
>>> +            goto err_uninit_mmu;
>>> +    }
>>> +
>>>       return 0;
>>>     err_uninit_mmu:
>>> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>>       kvm_unshare_hyp(kvm, kvm + 1);
>>>         kvm_arm_teardown_hypercalls(kvm);
>>> +    if (kvm_is_realm(kvm))
>>> +        kvm_destroy_realm(kvm);
>>>   }
>>>     static bool kvm_has_full_ptr_auth(void)
>>> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
>>> long ext)
>>>           else
>>>               r = kvm_supports_cacheable_pfnmap();
>>>           break;
>>> +    case KVM_CAP_ARM_RMI:
>>> +        r = static_key_enabled(&kvm_rmi_is_available);
>>> +        break;
>>>         default:
>>>           r = 0;
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index d089c107d9b7..ba8286472286 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>>     static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned
>>> long type)
>>>   {
>>> +    struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>>>       u32 kvm_ipa_limit = get_kvm_ipa_limit();
>>>       u64 mmfr0, mmfr1;
>>>       u32 phys_shift;
>>>   +    if (kvm_is_realm(kvm))
>>> +        kvm_ipa_limit = kvm_realm_ipa_limit();
>>> +
>>>       phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>>>       if (is_protected_kvm_enabled()) {
>>>           phys_shift = kvm_ipa_limit;
>>> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>>> kvm_s2_mmu *mmu, unsigned long t
>>>           return -EINVAL;
>>>       }
>>>   +    mmu->arch = &kvm->arch;
>>> +
>>>       err = kvm_init_ipa_range(mmu, type);
>>>       if (err)
>>>           return err;
>>> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>>> kvm_s2_mmu *mmu, unsigned long t
>>>       if (!pgt)
>>>           return -ENOMEM;
>>>   -    mmu->arch = &kvm->arch;
>>
>> Why moving this init?
> 
> Because, we need to know the "kvm" instance for kvm_init_ipa_range to
> detect the limit that applies to Realms.
> 
>>
>>>       err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu,
>>> &kvm_s2_mm_ops);
>>>       if (err)
>>>           goto out_free_pgtable;
>>> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>>       write_unlock(&kvm->mmu_lock);
>>>         if (pgt) {
>>> -        kvm_stage2_destroy(pgt);
>>> +        if (!kvm_is_realm(kvm))
>>> +            kvm_stage2_destroy(pgt);
>>> +        else
>>> +            kvm_pgtable_stage2_destroy_pgd(pgt);
>>
>> Why can't you make kvm_stage2_destroy() do the right thing? Surely the
>> PTs have to be reclaimed one way or another.
> 
> Actually yes, we could make it work. We need to skip walking the page
> table for Realms. We may be able to do the checks via pgt->mmu->arch-
>>kvm and skip the walking for Realms. ( The S2 is unmapped and torn
> down before the RD is destroyed in kvm_destroy_realm(). We can't
> rely on the contents of the PGDs to be zero - e.g., with MEC.)

Yes I'll move the check into kvm_stage2_destroy() instead with a comment
explaining what's going on.

>>
>>>           kfree(pgt);
>>>       }
>>>   }
>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index 6e28b669ded2..f51ec667445e 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -5,6 +5,8 @@
>>>     #include <linux/kvm_host.h>
>>>   +#include <asm/kvm_emulate.h>
>>> +#include <asm/kvm_mmu.h>
>>>   #include <asm/kvm_pgtable.h>
>>>   #include <asm/rmi_cmds.h>
>>>   #include <asm/virt.h>
>>> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
>>>       return !!u64_get_bits(rmm_feat_reg0, feature);
>>>   }
>>>   +u32 kvm_realm_ipa_limit(void)
>>> +{
>>> +    return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>>> +}
>>> +
>>> +void kvm_destroy_realm(struct kvm *kvm)
>>> +{
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>>> +
>>> +    if (realm->params) {
>>> +        free_page((unsigned long)realm->params);
>>> +        realm->params = NULL;
>>> +    }
>>> +
>>> +    if (!kvm_realm_is_created(kvm))
>>> +        return;
>>> +
>>> +    kvm_set_realm_state(kvm, REALM_STATE_DYING);
>>> +
>>> +    write_lock(&kvm->mmu_lock);
>>> +    kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
>>> +                   BIT(realm->ia_bits - 1), true);
>>> +    write_unlock(&kvm->mmu_lock);
>>> +
>>> +    if (realm->rd) {
>>> +        phys_addr_t rd_phys = virt_to_phys(realm->rd);
>>> +
>>> +        if (WARN_ON(rmi_realm_terminate(rd_phys)))
>>> +            return;
>>> +
>>> +        if (WARN_ON(rmi_realm_destroy(rd_phys)))
>>> +            return;
>>> +        free_delegated_page(rd_phys);
>>> +        realm->rd = NULL;
>>> +    }
>>> +
>>> +    if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys,
>>> pgd_size)))
>>> +        return;
>>> +
>>> +    kvm_set_realm_state(kvm, REALM_STATE_DEAD);
>>> +
>>> +    /* Now that the Realm is destroyed, free the entry level RTTs */
>>> +    kvm_free_stage2_pgd(&kvm->arch.mmu);
>>> +}
>>
>> This really needs documentation: what happens at each stage? What
>> memory is reclaimed when?
> 
> Agreed.
> 
>>
>> But even more importantly, why is this built in a completely parallel
>> way, potentially deviating from the existing KVM S2 management?
> 
> 
> RMM requires a Realm is not live at the time of REALM_DESTROY.
> (See section A2.2.4 Realm Liveness).
> i.e., All RECs are destroyed, Root RTTs wiped clean (no live mappings)
> before the RD is destroyed. So, we need to make sure all of this is
> done at Realm Destroy. Hence we delay the kvm_free_stage2_pgd() until
> we destroy the RD.
> 
> Does that help? May be we could improve the comments around it.

I'll add a comment in kvm_destroy_realm().

Thanks,
Steve

> 
> Suzuki
> 
> 
> 
>> Thanks,>
>>     M.
>>
> 


^ permalink raw reply

* Re: [PATCH v7 20/42] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Suzuki K Poulose @ 2026-06-04 15:29 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-20-2f0fae496530@google.com>

On 23/05/2026 01:18, Ackerley Tng via B4 Relay wrote:
> From: Michael Roth <michael.roth@amd.com>
> 
> For vm_memory_attributes=1, in-place conversion/population is not
> supported, so the initial contents necessarily must need to come
> from a separate src address, which is enforced by the current
> implementation. However, for vm_memory_attributes=0, it is possible for
> guest memory to be initialized directly from userspace by mmap()'ing the
> guest_memfd and writing to it while the corresponding GPA ranges are in
> a 'shared' state before converting them to the 'private' state expected
> by KVM_SEV_SNP_LAUNCH_UPDATE.
> 
> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> copy in data from a separate memory location. Continue to enforce
> non-NULL for the original vm_memory_attributes=1 case.
> 
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> [Added src_page check in error handling path when the firmware command fails]
> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>




> ---
>   Documentation/virt/kvm/x86/amd-memory-encryption.rst | 15 +++++++++++----
>   arch/x86/kvm/svm/sev.c                               | 18 +++++++++++++-----
>   virt/kvm/kvm_main.c                                  |  1 +
>   3 files changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index b2395dd4769de..43085f65b2d85 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -503,7 +503,8 @@ secrets.
>   
>   It is required that the GPA ranges initialized by this command have had the
>   KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> +this aspect.
>   
>   Upon success, this command is not guaranteed to have processed the entire
>   range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> @@ -511,9 +512,15 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>   remaining range that has yet to be processed. The caller should continue
>   calling this command until those fields indicate the entire range has been
>   processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> -``uaddr`` will be ignored completely.
> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> +userspace-provided source buffer address plus 1.
> +
> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> +ignored completely. Otherwise, ``uaddr`` is required if
> +kvm.vm_memory_attributes=1 and optional if kvm.vm_memory_attributes=0, since
> +in the latter case guest memory can be initialized directly from userspace
> +prior to converting it to private and passing the GPA range on to this
> +interface.

Just to confirm, so the sev_gmem_prepare doesn't destroy the contents in 
the process of making it "private" ? i.e., the contents of a SNP shared
page are preserved while transitioning to "SNP Private" (via RMP
update).

Suzuki



>   
>   Parameters (in): struct  kvm_sev_snp_launch_update
>   
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 1a361f08c7a3d..e1dbc827c2807 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2343,7 +2343,15 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>   	int level;
>   	int ret;
>   
> -	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> +	/*
> +	 * For vm_memory_attributes=1, in-place conversion/population is not
> +	 * supported, so the initial contents necessarily need to come from a
> +	 * separate src address. For vm_memory_attributes=0, this isn't
> +	 * necessarily the case, since the pages may have been populated
> +	 * directly from userspace before calling KVM_SEV_SNP_LAUNCH_UPDATE.
> +	 */
> +	if (vm_memory_attributes &&
> +	    sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
>   		return -EINVAL;
>   
>   	ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> @@ -2390,7 +2398,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>   	 */
>   	if (ret && !snp_page_reclaim(kvm, pfn) &&
>   	    sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> -	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> +	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
>   		void *src_vaddr = kmap_local_page(src_page);
>   		void *dst_vaddr = kmap_local_pfn(pfn);
>   
> @@ -2423,8 +2431,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>   	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
>   		return -EFAULT;
>   
> -	pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> -		 params.gfn_start, params.len, params.type, params.flags);
> +	pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> +		 params.gfn_start, params.len, params.type, params.flags, params.uaddr);
>   
>   	if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
>   	    (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> @@ -2481,7 +2489,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>   
>   	params.gfn_start += count;
>   	params.len -= count * PAGE_SIZE;
> -	if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> +	if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
>   		params.uaddr += count * PAGE_SIZE;
>   
>   	if (copy_to_user(u64_to_user_ptr(argp->data), &params, sizeof(params)))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba195bb239aaa..3bf212fd99193 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
>   #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>   bool vm_memory_attributes = true;
>   module_param(vm_memory_attributes, bool, 0444);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(vm_memory_attributes);
>   #endif
>   DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>   EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> 


^ permalink raw reply

* Re: [PATCH v14 13/44] arm64: RMI: Define the user ABI
From: Steven Price @ 2026-06-04 15:27 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86jysovpxf.wl-maz@kernel.org>

On 27/05/2026 16:21, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:21 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> There is one CAP which identified the presence of CCA, and one ioctl.
>> The ioctl is used to populate memory during creation of the realm as
>> this requires the RMM to copy data from an unprotected address to the
>> protected memory - CCA does not support memory conversion where the
>> memory contents is preserved as this is incompatible with memory
>> encryption.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * KVM_ARM_VCPU_RMI_PSCI_COMPLETE removed.
>>  * KVM_ARM_RMI_POPULATE documentation updated to reflect that the
>>    structure is written by the kernel.
>>  * CAP number bumped.
>> Changes since v12:
>>  * Change KVM_ARM_RMI_POPULATE to update the structure with the amount
>>    that has been progressed rather than return the number of bytes
>>    populated.
>>  * Describe the flag KVM_ARM_RMI_POPULATE_FLAGS_MEASURE.
>>  * CAP number is bumped.
>>  * NOTE: The PSCI ioctl may be removed in a future spec release.
>> Changes since v11:
>>  * Completely reworked to be more implicit. Rather than having explicit
>>    CAP operations to progress the realm construction these operations
>>    are done when needed (on populating and on first vCPU run).
>>  * Populate and PSCI complete are promoted to proper ioctls.
>> Changes since v10:
>>  * Rename symbols from RME to RMI.
>> Changes since v9:
>>  * Improvements to documentation.
>>  * Bump the magic number for KVM_CAP_ARM_RME to avoid conflicts.
>> Changes since v8:
>>  * Minor improvements to documentation following review.
>>  * Bump the magic numbers to avoid conflicts.
>> Changes since v7:
>>  * Add documentation of new ioctls
>>  * Bump the magic numbers to avoid conflicts
>> Changes since v6:
>>  * Rename some of the symbols to make their usage clearer and avoid
>>    repetition.
>> Changes from v5:
>>  * Actually expose the new VCPU capability (KVM_ARM_VCPU_REC) by bumping
>>    KVM_VCPU_MAX_FEATURES - note this also exposes KVM_ARM_VCPU_HAS_EL2!
>> ---
>>  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/kvm.h       | 13 +++++++++++
>>  2 files changed, 53 insertions(+)
> 
> $SUBJECT looks wrong. This is a KVM change, not an RMI change.

Ah, true I guess "KVM: arm64: Define the user ABI for CCA" is more accurate.

>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 52bbbb553ce1..ca68aae7faa2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6553,6 +6553,37 @@ KVM_S390_KEYOP_SSKE
>>    Sets the storage key for the guest address ``guest_addr`` to the key
>>    specified in ``key``, returning the previous value in ``key``.
>>  
>> +4.145 KVM_ARM_RMI_POPULATE
>> +--------------------------
>> +
>> +:Capability: KVM_CAP_ARM_RMI
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_rmi_populate (in/out)
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> +  struct kvm_arm_rmi_populate {
>> +	__u64 base;
>> +	__u64 size;
>> +	__u64 source_uaddr;
>> +	__u32 flags;
>> +	__u32 reserved;
>> +  };
>> +
>> +Populate a region of protected address space by copying the data from the
>> +(non-protected) user space pointer provided into a protected region (backed by
>> +guestmem_fd). It implicitly sets the destination region to RIPAS RAM. This is
>> +only valid before any VCPUs have been run. The ioctl might not populate the
>> +entire region and in this case the kernel updates the fields `base`, `size` and
>> +`source_uaddr`. User space may have to repeatedly call it until `size` is 0 to
>> +populate the entire region.
>> +
>> +`flags` can be set to `KVM_ARM_RMI_POPULATE_FLAGS_MEASURE` to request that the
>> +populated data is hashed and added to the guest's Realm Initial Measurement
>> +(RIM).
> 
> Where is that measurement stored? And retrieved? At least a pointer to
> that would help.

It's stored within the RMM and retrieved by the guest (using the RSI
interface). I'll update to:

`flags` can be set to `KVM_ARM_RMI_POPULATE_FLAGS_MEASURE` to request
that the populated data is hashed and added to the guest's Realm Initial
Measurement (RIM) stored by the RMM. This can then be retrieved by the
guest (using the RSI interface) to present to an attestation server.

Thanks,

Steve

>> +
>>  .. _kvm_run:
>>  
>>  5. The kvm_run structure
>> @@ -8904,6 +8935,15 @@ helpful if user space wants to emulate instructions which are not
>>  This capability can be enabled dynamically even if VCPUs were already
>>  created and are running.
>>  
>> +7.47 KVM_CAP_ARM_RMI
>> +--------------------
>> +
>> +:Architectures: arm64
>> +:Target: VM
>> +:Parameters: None
>> +
>> +This capability indicates that support for CCA realms is available.
>> +
>>  8. Other capabilities.
>>  ======================
>>  
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6c8afa2047bf..b8cff0938041 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
>>  #define KVM_CAP_S390_USER_OPEREXEC 246
>>  #define KVM_CAP_S390_KEYOP 247
>>  #define KVM_CAP_S390_VSIE_ESAMODE 248
>> +#define KVM_CAP_ARM_RMI 249
>>  
>>  struct kvm_irq_routing_irqchip {
>>  	__u32 irqchip;
>> @@ -1669,4 +1670,16 @@ struct kvm_pre_fault_memory {
>>  	__u64 padding[5];
>>  };
>>  
>> +/* Available with KVM_CAP_ARM_RMI, only for VMs with KVM_VM_TYPE_ARM_REALM */
>> +#define KVM_ARM_RMI_POPULATE	_IOWR(KVMIO, 0xd7, struct kvm_arm_rmi_populate)
>> +#define KVM_ARM_RMI_POPULATE_FLAGS_MEASURE	(1 << 0)
>> +
>> +struct kvm_arm_rmi_populate {
>> +	__u64 base;
>> +	__u64 size;
>> +	__u64 source_uaddr;
>> +	__u32 flags;
>> +	__u32 reserved;
>> +};
>> +
>>  #endif /* __LINUX_KVM_H */
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 13/44] arm64: RMI: Define the user ABI
From: Steven Price @ 2026-06-04 15:27 UTC (permalink / raw)
  To: Wei-Lin Chang, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <vixej2afqamvr7eaksoepr3wmpwzp73rtgrageeyuawa6azotw@payxzyhymwbt>

On 26/05/2026 23:17, Wei-Lin Chang wrote:
> On Wed, May 13, 2026 at 02:17:21PM +0100, Steven Price wrote:
>> There is one CAP which identified the presence of CCA, and one ioctl.
>> The ioctl is used to populate memory during creation of the realm as
>> this requires the RMM to copy data from an unprotected address to the
>> protected memory - CCA does not support memory conversion where the
>> memory contents is preserved as this is incompatible with memory
>> encryption.
> 
> Nit:
> I believe spelling out the CAP and ioctl names can improve the commit
> message. Also "memory conversion" is a little vague, maybe
> 
> ... CCA does not support shared <-> private memory conversion where ...
> 
> would make this clearer?

Thanks for the suggestions - yes I agree that would make it clearer.

Thanks,
Steve

> Thanks,
> Wei-Lin Chang
> 
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * KVM_ARM_VCPU_RMI_PSCI_COMPLETE removed.
>>  * KVM_ARM_RMI_POPULATE documentation updated to reflect that the
>>    structure is written by the kernel.
>>  * CAP number bumped.
>> Changes since v12:
>>  * Change KVM_ARM_RMI_POPULATE to update the structure with the amount
>>    that has been progressed rather than return the number of bytes
>>    populated.
>>  * Describe the flag KVM_ARM_RMI_POPULATE_FLAGS_MEASURE.
>>  * CAP number is bumped.
>>  * NOTE: The PSCI ioctl may be removed in a future spec release.
>> Changes since v11:
>>  * Completely reworked to be more implicit. Rather than having explicit
>>    CAP operations to progress the realm construction these operations
>>    are done when needed (on populating and on first vCPU run).
>>  * Populate and PSCI complete are promoted to proper ioctls.
>> Changes since v10:
>>  * Rename symbols from RME to RMI.
>> Changes since v9:
>>  * Improvements to documentation.
>>  * Bump the magic number for KVM_CAP_ARM_RME to avoid conflicts.
>> Changes since v8:
>>  * Minor improvements to documentation following review.
>>  * Bump the magic numbers to avoid conflicts.
>> Changes since v7:
>>  * Add documentation of new ioctls
>>  * Bump the magic numbers to avoid conflicts
>> Changes since v6:
>>  * Rename some of the symbols to make their usage clearer and avoid
>>    repetition.
>> Changes from v5:
>>  * Actually expose the new VCPU capability (KVM_ARM_VCPU_REC) by bumping
>>    KVM_VCPU_MAX_FEATURES - note this also exposes KVM_ARM_VCPU_HAS_EL2!
>> ---
>>  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/kvm.h       | 13 +++++++++++
>>  2 files changed, 53 insertions(+)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 52bbbb553ce1..ca68aae7faa2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6553,6 +6553,37 @@ KVM_S390_KEYOP_SSKE
>>    Sets the storage key for the guest address ``guest_addr`` to the key
>>    specified in ``key``, returning the previous value in ``key``.
>>  
>> +4.145 KVM_ARM_RMI_POPULATE
>> +--------------------------
>> +
>> +:Capability: KVM_CAP_ARM_RMI
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_rmi_populate (in/out)
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> +  struct kvm_arm_rmi_populate {
>> +	__u64 base;
>> +	__u64 size;
>> +	__u64 source_uaddr;
>> +	__u32 flags;
>> +	__u32 reserved;
>> +  };
>> +
>> +Populate a region of protected address space by copying the data from the
>> +(non-protected) user space pointer provided into a protected region (backed by
>> +guestmem_fd). It implicitly sets the destination region to RIPAS RAM. This is
>> +only valid before any VCPUs have been run. The ioctl might not populate the
>> +entire region and in this case the kernel updates the fields `base`, `size` and
>> +`source_uaddr`. User space may have to repeatedly call it until `size` is 0 to
>> +populate the entire region.
>> +
>> +`flags` can be set to `KVM_ARM_RMI_POPULATE_FLAGS_MEASURE` to request that the
>> +populated data is hashed and added to the guest's Realm Initial Measurement
>> +(RIM).
>> +
>>  .. _kvm_run:
>>  
>>  5. The kvm_run structure
>> @@ -8904,6 +8935,15 @@ helpful if user space wants to emulate instructions which are not
>>  This capability can be enabled dynamically even if VCPUs were already
>>  created and are running.
>>  
>> +7.47 KVM_CAP_ARM_RMI
>> +--------------------
>> +
>> +:Architectures: arm64
>> +:Target: VM
>> +:Parameters: None
>> +
>> +This capability indicates that support for CCA realms is available.
>> +
>>  8. Other capabilities.
>>  ======================
>>  
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6c8afa2047bf..b8cff0938041 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
>>  #define KVM_CAP_S390_USER_OPEREXEC 246
>>  #define KVM_CAP_S390_KEYOP 247
>>  #define KVM_CAP_S390_VSIE_ESAMODE 248
>> +#define KVM_CAP_ARM_RMI 249
>>  
>>  struct kvm_irq_routing_irqchip {
>>  	__u32 irqchip;
>> @@ -1669,4 +1670,16 @@ struct kvm_pre_fault_memory {
>>  	__u64 padding[5];
>>  };
>>  
>> +/* Available with KVM_CAP_ARM_RMI, only for VMs with KVM_VM_TYPE_ARM_REALM */
>> +#define KVM_ARM_RMI_POPULATE	_IOWR(KVMIO, 0xd7, struct kvm_arm_rmi_populate)
>> +#define KVM_ARM_RMI_POPULATE_FLAGS_MEASURE	(1 << 0)
>> +
>> +struct kvm_arm_rmi_populate {
>> +	__u64 base;
>> +	__u64 size;
>> +	__u64 source_uaddr;
>> +	__u32 flags;
>> +	__u32 reserved;
>> +};
>> +
>>  #endif /* __LINUX_KVM_H */
>> -- 
>> 2.43.0
>>


^ permalink raw reply

* Re: [PATCH v14 10/44] arm64: RMI: Add support for SRO
From: Steven Price @ 2026-06-04 15:19 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <864ik0x22q.wl-maz@kernel.org>

On 21/05/2026 15:35, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:18 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> RMM v2.0 introduces the concept of "Stateful RMI Operations" (SRO). This
>> means that an SMC can return with an operation still in progress. The
>> host is excepted to continue the operation until is reaches a conclusion
>> (either success or failure). During this process the RMM can request
>> additional memory ('donate') or hand memory back to the host
>> ('reclaim'). The host can request an in progress operation is cancelled,
>> but still continue the operation until it has completed (otherwise the
>> incomplete operation may cause future RMM operations to fail).
>>
>> The SRO is tracked using a struct rmi_sro_state object which keeps track
>> of any memory which has been allocated but not yet consumed by the RMM
>> or reclaimed from the RMM. This allows the memory to be reused in a
>> future request within the same operation. It will also permit an
>> operation to be done in a context where memory allocation may be
>> difficult (e.g. atomic context) with the option to abort the operation
>> and retry the memory allocation outside of the atomic context. The
>> memory stored in the struct rmi_sro_state object can then be reused on
>> the subsequent attempt.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> v14:
>>  * SRO support has improved although is still not fully complete. The
>>    infrastructure has been moved out of KVM.
>> ---
>>  arch/arm64/include/asm/rmi_cmds.h |   1 +
>>  arch/arm64/kernel/rmi.c           | 359 ++++++++++++++++++++++++++++++
>>  2 files changed, 360 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
>> index eb213c8e6f26..1a7b0c8f1e38 100644
>> --- a/arch/arm64/include/asm/rmi_cmds.h
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -35,6 +35,7 @@ struct rmi_sro_state {
>>  
>>  int rmi_delegate_range(phys_addr_t phys, unsigned long size);
>>  int rmi_undelegate_range(phys_addr_t phys, unsigned long size);
>> +int free_delegated_page(phys_addr_t phys);
>>  
>>  static inline int rmi_delegate_page(phys_addr_t phys)
>>  {
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> index 08cef54acadb..a8107ca9bb6d 100644
>> --- a/arch/arm64/kernel/rmi.c
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -48,6 +48,365 @@ int rmi_undelegate_range(phys_addr_t phys, unsigned long size)
>>  	return ret;
>>  }
>>  
>> +static unsigned long donate_req_to_size(unsigned long donatereq)
>> +{
>> +	unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +
>> +	switch (unit_size) {
>> +	case 0:
>> +		return PAGE_SIZE;
>> +	case 1:
>> +		return PMD_SIZE;
>> +	case 2:
>> +		return PUD_SIZE;
>> +	case 3:
>> +		return P4D_SIZE;
> 
> How does this work when we have folded levels? If this is supposed to
> be the architected size, then it should actively express that:
> 
> 	return BIT(unit_size * (PAGE_SHIFT - 3) + PAGE_SHIFT);

It doesn't work (as Gavin also pointed out). There's an existing macro
to make this even cleaner:

return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(3 - unit_size));

>> +	}
>> +	unreachable();
>> +}
>> +
>> +static void rmi_smccc_invoke(struct arm_smccc_1_2_regs *regs_in,
>> +			     struct arm_smccc_1_2_regs *regs_out)
>> +{
>> +	struct arm_smccc_1_2_regs regs = *regs_in;
>> +	unsigned long status;
>> +
>> +	do {
>> +		arm_smccc_1_2_invoke(&regs, regs_out);
>> +		status = RMI_RETURN_STATUS(regs_out->a0);
>> +	} while (status == RMI_BUSY || status == RMI_BLOCKED);
>> +}
>> +
>> +int free_delegated_page(phys_addr_t phys)
>> +{
>> +	if (WARN_ON(rmi_undelegate_page(phys))) {
> 
> Please drop this WARN_ON(). Or at least make it ONCE. Everywhere.

Happy to change to WARN_ON_ONCE(). I think we should keep a WARN of some
sort as this is causing Linux to leak pages - it's definitely something
the sysadmin would want to know about.

>> +		/* Undelegate failed: leak the page */
>> +		return -EBUSY;
>> +	}
>> +
>> +	free_page((unsigned long)phys_to_virt(phys));
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_sro_ensure_capacity(struct rmi_sro_state *sro,
>> +				   unsigned long count)
>> +{
>> +	if (WARN_ON_ONCE(sro->addr_count > RMI_MAX_ADDR_LIST))
>> +		return -EOVERFLOW;
>> +
>> +	if (count > RMI_MAX_ADDR_LIST - sro->addr_count)
>> +		return -ENOSPC;
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_sro_donate_contig(struct rmi_sro_state *sro,
>> +				 unsigned long sro_handle,
>> +				 unsigned long donatereq,
>> +				 struct arm_smccc_1_2_regs *out_regs,
>> +				 gfp_t gfp)
>> +{
>> +	unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +	unsigned long unit_size_bytes = donate_req_to_size(donatereq);
>> +	unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +	unsigned long state = RMI_DONATE_STATE(donatereq);
>> +	unsigned long size = unit_size_bytes * count;
>> +	unsigned long addr_range;
>> +	int ret;
>> +	void *virt;
>> +	phys_addr_t phys;
>> +	struct arm_smccc_1_2_regs regs = {
>> +		SMC_RMI_OP_MEM_DONATE,
>> +		sro_handle
>> +	};
>> +
>> +	for (int i = 0; i < sro->addr_count; i++) {
>> +		unsigned long entry = sro->addr_list[i];
>> +
>> +		if (RMI_ADDR_RANGE_SIZE(entry) == unit_size &&
>> +		    RMI_ADDR_RANGE_COUNT(entry) == count &&
>> +		    RMI_ADDR_RANGE_STATE(entry) == state) {
>> +			sro->addr_count--;
>> +			swap(sro->addr_list[sro->addr_count],
>> +			     sro->addr_list[i]);
>> +
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	ret = rmi_sro_ensure_capacity(sro, 1);
>> +	if (ret)
>> +		return ret;
>> +
>> +	virt = alloc_pages_exact(size, gfp);
>> +	if (!virt)
>> +		return -ENOMEM;
>> +	phys = virt_to_phys(virt);
>> +
>> +	if (state == RMI_OP_MEM_DELEGATED) {
>> +		if (rmi_delegate_range(phys, size)) {
>> +			free_pages_exact(virt, size);
>> +			return -ENXIO;
>> +		}
>> +	}
>> +
>> +	addr_range = phys & RMI_ADDR_RANGE_ADDR_MASK;
>> +	FIELD_MODIFY(RMI_ADDR_RANGE_SIZE_MASK, &addr_range, unit_size);
>> +	FIELD_MODIFY(RMI_ADDR_RANGE_COUNT_MASK, &addr_range, count);
>> +	FIELD_MODIFY(RMI_ADDR_RANGE_STATE_MASK, &addr_range, state);
>> +
>> +	sro->addr_list[sro->addr_count] = addr_range;
>> +
> 
> Shouldn't this be moved to a helper that ensures capacity, and returns
> an error otherwise?

I'm not sure quite what you are suggesting. I already have a
rmi_sro_ensure_capacity() helper. By this point we know there's space.

>> +out:
>> +	regs.a2 = virt_to_phys(&sro->addr_list[sro->addr_count]);
>> +	regs.a3 = 1;
> 
> This could really do with context specific helpers that populate regs
> based on a set of parameters. I have no idea what this 1 here is, and
> the init is spread over too much code. Think of the children!
> 
> That's valid for the whole patch.

That's a good point. SRO is a bit tricky because I wanted the actual SMC
call to be done in one place so we can handle all the RMI_INCOMPLETE
cases together. But I could certainly add some helpers to setup the
registers rather than assigning directly to regs.a<n>.

Thanks,
Steve

> 	M.
>> +	rmi_smccc_invoke(&regs, out_regs);
>> +
>> +	unsigned long donated_granules = out_regs->a1;
>> +	unsigned long donated_size = donated_granules << PAGE_SHIFT;
>> +
>> +	if (donated_granules == 0) {
>> +		/* No pages used by the RMM */
>> +		sro->addr_count++;
>> +	} else if (donated_size < size) {
>> +		phys = sro->addr_list[sro->addr_count] & RMI_ADDR_RANGE_ADDR_MASK;
>> +
>> +		/* Not all granules used by the RMM, free the remaining pages */
>> +		for (long i = donated_size; i < size; i += PAGE_SIZE) {
>> +			if (state == RMI_OP_MEM_DELEGATED)
>> +				free_delegated_page(phys + i);
>> +			else
>> +				__free_page(phys_to_page(phys + i));
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_sro_donate_noncontig(struct rmi_sro_state *sro,
>> +				    unsigned long sro_handle,
>> +				    unsigned long donatereq,
>> +				    struct arm_smccc_1_2_regs *out_regs,
>> +				    gfp_t gfp)
>> +{
>> +	unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +	unsigned long unit_size_bytes = donate_req_to_size(donatereq);
>> +	unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +	unsigned long state = RMI_DONATE_STATE(donatereq);
>> +	unsigned long found = 0;
>> +	unsigned long addr_list_start = sro->addr_count;
>> +	int ret;
>> +	struct arm_smccc_1_2_regs regs = {
>> +		SMC_RMI_OP_MEM_DONATE,
>> +		sro_handle
>> +	};
>> +
>> +	for (int i = 0; i < addr_list_start && found < count; i++) {
>> +		unsigned long entry = sro->addr_list[i];
>> +
>> +		if (RMI_ADDR_RANGE_SIZE(entry) == unit_size &&
>> +		    RMI_ADDR_RANGE_COUNT(entry) == 1 &&
>> +		    RMI_ADDR_RANGE_STATE(entry) == state) {
>> +			addr_list_start--;
>> +			swap(sro->addr_list[addr_list_start],
>> +			     sro->addr_list[i]);
>> +			found++;
>> +			i--;
>> +		}
>> +	}
>> +
>> +	ret = rmi_sro_ensure_capacity(sro, count - found);
>> +	if (ret)
>> +		return ret;
>> +
>> +	while (found < count) {
>> +		unsigned long addr_range;
>> +		void *virt = alloc_pages_exact(unit_size_bytes, gfp);
>> +		phys_addr_t phys;
>> +
>> +		if (!virt)
>> +			return -ENOMEM;
>> +
>> +		phys = virt_to_phys(virt);
>> +
>> +		if (state == RMI_OP_MEM_DELEGATED) {
>> +			if (rmi_delegate_range(phys, unit_size_bytes)) {
>> +				free_pages_exact(virt, unit_size_bytes);
>> +				return -ENXIO;
>> +			}
>> +		}
>> +
>> +		addr_range = phys & RMI_ADDR_RANGE_ADDR_MASK;
>> +		FIELD_MODIFY(RMI_ADDR_RANGE_SIZE_MASK, &addr_range, unit_size);
>> +		FIELD_MODIFY(RMI_ADDR_RANGE_COUNT_MASK, &addr_range, 1);
>> +		FIELD_MODIFY(RMI_ADDR_RANGE_STATE_MASK, &addr_range, state);
>> +
>> +		sro->addr_list[sro->addr_count++] = addr_range;
>> +		found++;
>> +	}
>> +
>> +	regs.a2 = virt_to_phys(&sro->addr_list[addr_list_start]);
>> +	regs.a3 = found;
>> +	rmi_smccc_invoke(&regs, out_regs);
>> +
>> +	unsigned long donated_granules = out_regs->a1;
>> +
>> +	if (WARN_ON(donated_granules & ((unit_size_bytes >> PAGE_SHIFT) - 1))) {
>> +		/*
>> +		 * FIXME: RMM has only consumed part of a huge page, this leaks
>> +		 * the rest of the huge page
>> +		 */
>> +		donated_granules = ALIGN(donated_granules,
>> +					 (unit_size_bytes >> PAGE_SHIFT));
>> +	}
>> +	unsigned long donated_blocks = donated_granules / (unit_size_bytes >> PAGE_SHIFT);
>> +
>> +	if (WARN_ON(donated_blocks > found))
>> +		donated_blocks = found;
>> +
>> +	unsigned long undonated_blocks = found - donated_blocks;
>> +
>> +	while (donated_blocks && undonated_blocks) {
>> +		sro->addr_count--;
>> +		swap(sro->addr_list[addr_list_start],
>> +		     sro->addr_list[sro->addr_count]);
>> +		addr_list_start++;
>> +
>> +		donated_blocks--;
>> +		undonated_blocks--;
>> +	}
>> +	sro->addr_count -= donated_blocks;
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_sro_donate(struct rmi_sro_state *sro,
>> +			  unsigned long sro_handle,
>> +			  unsigned long donatereq,
>> +			  struct arm_smccc_1_2_regs *regs,
>> +			  gfp_t gfp)
>> +{
>> +	unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +
>> +	if (WARN_ON(!count))
>> +		return 0;
>> +
>> +	if (RMI_DONATE_CONTIG(donatereq)) {
>> +		return rmi_sro_donate_contig(sro, sro_handle, donatereq,
>> +					     regs, gfp);
>> +	} else {
>> +		return rmi_sro_donate_noncontig(sro, sro_handle, donatereq,
>> +						regs, gfp);
>> +	}
>> +}
>> +
>> +static int rmi_sro_reclaim(struct rmi_sro_state *sro,
>> +			   unsigned long sro_handle,
>> +			   struct arm_smccc_1_2_regs *out_regs)
>> +{
>> +	unsigned long capacity;
>> +	struct arm_smccc_1_2_regs regs;
>> +	int ret;
>> +
>> +	ret = rmi_sro_ensure_capacity(sro, 1);
>> +	if (ret)
>> +		rmi_sro_free(sro);
>> +
>> +	capacity = RMI_MAX_ADDR_LIST - sro->addr_count;
>> +
>> +	regs = (struct arm_smccc_1_2_regs){
>> +		SMC_RMI_OP_MEM_RECLAIM,
>> +		sro_handle,
>> +		virt_to_phys(&sro->addr_list[sro->addr_count]),
>> +		capacity
>> +	};
>> +	rmi_smccc_invoke(&regs, out_regs);
>> +
>> +	if (WARN_ON_ONCE(out_regs->a1 > capacity))
>> +		out_regs->a1 = capacity;
>> +
>> +	sro->addr_count += out_regs->a1;
>> +
>> +	return 0;
>> +}
>> +
>> +void rmi_sro_free(struct rmi_sro_state *sro)
>> +{
>> +	for (int i = 0; i < sro->addr_count; i++) {
>> +		unsigned long entry = sro->addr_list[i];
>> +		unsigned long addr = RMI_ADDR_RANGE_ADDR(entry);
>> +		unsigned long unit_size = RMI_ADDR_RANGE_SIZE(entry);
>> +		unsigned long count = RMI_ADDR_RANGE_COUNT(entry);
>> +		unsigned long state = RMI_ADDR_RANGE_STATE(entry);
>> +		unsigned long size = donate_req_to_size(unit_size) * count;
>> +
>> +		if (state == RMI_OP_MEM_DELEGATED) {
>> +			if (WARN_ON(rmi_undelegate_range(addr, size))) {
>> +				/* Leak the pages */
>> +				continue;
>> +			}
>> +		}
>> +		free_pages_exact(phys_to_virt(addr), size);
>> +	}
>> +
>> +	sro->addr_count = 0;
>> +}
>> +
>> +unsigned long rmi_sro_execute(struct rmi_sro_state *sro, gfp_t gfp)
>> +{
>> +	unsigned long sro_handle;
>> +	struct arm_smccc_1_2_regs regs;
>> +	struct arm_smccc_1_2_regs *regs_in = &sro->regs;
>> +
>> +	rmi_smccc_invoke(regs_in, &regs);
>> +
>> +	sro_handle = regs.a1;
>> +
>> +	while (RMI_RETURN_STATUS(regs.a0) == RMI_INCOMPLETE) {
>> +		bool can_cancel = RMI_RETURN_CAN_CANCEL(regs.a0);
>> +		int ret;
>> +
>> +		switch (RMI_RETURN_MEMREQ(regs.a0)) {
>> +		case RMI_OP_MEM_REQ_NONE:
>> +			regs = (struct arm_smccc_1_2_regs){
>> +				SMC_RMI_OP_CONTINUE, sro_handle, 0
>> +			};
>> +			rmi_smccc_invoke(&regs, &regs);
>> +			break;
>> +		case RMI_OP_MEM_REQ_DONATE:
>> +			ret = rmi_sro_donate(sro, sro_handle, regs.a2, &regs,
>> +					     gfp);
>> +			break;
>> +		case RMI_OP_MEM_REQ_RECLAIM:
>> +			ret = rmi_sro_reclaim(sro, sro_handle, &regs);
>> +			break;
>> +		default:
>> +			ret = WARN_ON(1);
>> +			break;
>> +		}
>> +
>> +		if (ret) {
>> +			if (can_cancel) {
>> +				/*
>> +				 * FIXME: Handle cancelling properly!
>> +				 *
>> +				 * If the operation has failed due to memory
>> +				 * allocation failure then the information on
>> +				 * the memory allocation should be saved, so
>> +				 * that the allocation can be repeated outside
>> +				 * of any context which prevented the
>> +				 * allocation.
> 
> Honestly, this is the sort of stuff that I'd expect to be solved
> *before* posting this code. Since this is so central to the whole
> memory management, it needs to be correct from day-1.
> 
> If you can't make it work in time, then tone the supported features
> down. But FIXMEs and WARN_ONs are not the way to go.
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 10/44] arm64: RMI: Add support for SRO
From: Steven Price @ 2026-06-04 15:19 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <872ce762-0a83-42a4-bbe8-b21ea6f4e1cf@redhat.com>

On 21/05/2026 05:38, Gavin Shan wrote:
> Hi Steven,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> RMM v2.0 introduces the concept of "Stateful RMI Operations" (SRO). This
>> means that an SMC can return with an operation still in progress. The
>> host is excepted to continue the operation until is reaches a conclusion
>> (either success or failure). During this process the RMM can request
>> additional memory ('donate') or hand memory back to the host
>> ('reclaim'). The host can request an in progress operation is cancelled,
>> but still continue the operation until it has completed (otherwise the
>> incomplete operation may cause future RMM operations to fail).
>>
>> The SRO is tracked using a struct rmi_sro_state object which keeps track
>> of any memory which has been allocated but not yet consumed by the RMM
>> or reclaimed from the RMM. This allows the memory to be reused in a
>> future request within the same operation. It will also permit an
>> operation to be done in a context where memory allocation may be
>> difficult (e.g. atomic context) with the option to abort the operation
>> and retry the memory allocation outside of the atomic context. The
>> memory stored in the struct rmi_sro_state object can then be reused on
>> the subsequent attempt.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> v14:
>>   * SRO support has improved although is still not fully complete. The
>>     infrastructure has been moved out of KVM.
>> ---
>>   arch/arm64/include/asm/rmi_cmds.h |   1 +
>>   arch/arm64/kernel/rmi.c           | 359 ++++++++++++++++++++++++++++++
>>   2 files changed, 360 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/
>> asm/rmi_cmds.h
>> index eb213c8e6f26..1a7b0c8f1e38 100644
>> --- a/arch/arm64/include/asm/rmi_cmds.h
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -35,6 +35,7 @@ struct rmi_sro_state {
>>     int rmi_delegate_range(phys_addr_t phys, unsigned long size);
>>   int rmi_undelegate_range(phys_addr_t phys, unsigned long size);
>> +int free_delegated_page(phys_addr_t phys);
>>     static inline int rmi_delegate_page(phys_addr_t phys)
>>   {
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> index 08cef54acadb..a8107ca9bb6d 100644
>> --- a/arch/arm64/kernel/rmi.c
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -48,6 +48,365 @@ int rmi_undelegate_range(phys_addr_t phys,
>> unsigned long size)
>>       return ret;
>>   }
>>   +static unsigned long donate_req_to_size(unsigned long donatereq)
>> +{
>> +    unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +
>> +    switch (unit_size) {
>> +    case 0:
>> +        return PAGE_SIZE;
>> +    case 1:
>> +        return PMD_SIZE;
>> +    case 2:
>> +        return PUD_SIZE;
>> +    case 3:
>> +        return P4D_SIZE;
>> +    }
>> +    unreachable();
>> +}
>> +
> 
> It's worthy to have 'inline'.

Generally it's best to let the compiler make the decision. A small
static function like this is highly likely to be inlined by the compiler
already.

The exception of course is when the function is in a header and then
it's necessary to use "static inline".

> {P4D, PUD, PMD}_SIZE can be equal if there> are
> no P4D and PUD, depending on CONFIG_PGTABLE_LEVELS. In this case, can the
> 'unit_size' be translated to wrong value?

Technically yes. I think I can actually rewrite this more simply as:

static unsigned long donate_req_to_size(unsigned long donatereq)
{
	unsigned long unit_size = RMI_DONATE_SIZE(donatereq);

	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(3 - unit_size));
}

which neatly sidesteps the CONFIG_PGTABLE_LEVELS problem too.

>> +static void rmi_smccc_invoke(struct arm_smccc_1_2_regs *regs_in,
>> +                 struct arm_smccc_1_2_regs *regs_out)
>> +{
>> +    struct arm_smccc_1_2_regs regs = *regs_in;
>> +    unsigned long status;
>> +
>> +    do {
>> +        arm_smccc_1_2_invoke(&regs, regs_out);
>> +        status = RMI_RETURN_STATUS(regs_out->a0);
>> +    } while (status == RMI_BUSY || status == RMI_BLOCKED);
>> +}
>> +
>> +int free_delegated_page(phys_addr_t phys)
>> +{
>> +    if (WARN_ON(rmi_undelegate_page(phys))) {
>> +        /* Undelegate failed: leak the page */
>> +        return -EBUSY;
>> +    }
>> +
>> +    free_page((unsigned long)phys_to_virt(phys));
>> +
>> +    return 0;
>> +}
>> +
>> +static int rmi_sro_ensure_capacity(struct rmi_sro_state *sro,
>> +                   unsigned long count)
>> +{
>> +    if (WARN_ON_ONCE(sro->addr_count > RMI_MAX_ADDR_LIST))
>> +        return -EOVERFLOW;
>> +
>> +    if (count > RMI_MAX_ADDR_LIST - sro->addr_count)
>> +        return -ENOSPC;
>> +
>> +    return 0;
>> +}
>> +
>> +static int rmi_sro_donate_contig(struct rmi_sro_state *sro,
>> +                 unsigned long sro_handle,
>> +                 unsigned long donatereq,
>> +                 struct arm_smccc_1_2_regs *out_regs,
>> +                 gfp_t gfp)
>> +{
>> +    unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +    unsigned long unit_size_bytes = donate_req_to_size(donatereq);
>> +    unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +    unsigned long state = RMI_DONATE_STATE(donatereq);
>> +    unsigned long size = unit_size_bytes * count;
>> +    unsigned long addr_range;
>> +    int ret;
>> +    void *virt;
>> +    phys_addr_t phys;
>> +    struct arm_smccc_1_2_regs regs = {
>> +        SMC_RMI_OP_MEM_DONATE,
>> +        sro_handle
>> +    };
>> +
>> +    for (int i = 0; i < sro->addr_count; i++) {
>> +        unsigned long entry = sro->addr_list[i];
>> +
>> +        if (RMI_ADDR_RANGE_SIZE(entry) == unit_size &&
>> +            RMI_ADDR_RANGE_COUNT(entry) == count &&
>> +            RMI_ADDR_RANGE_STATE(entry) == state) {
>> +            sro->addr_count--;
>> +            swap(sro->addr_list[sro->addr_count],
>> +                 sro->addr_list[i]);
>> +
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    ret = rmi_sro_ensure_capacity(sro, 1);
>> +    if (ret)
>> +        return ret;
>> +
>> +    virt = alloc_pages_exact(size, gfp);
>> +    if (!virt)
>> +        return -ENOMEM;
>> +    phys = virt_to_phys(virt);
>> +
> 
> alloc_pages_exact() will fail if the requested size exceeds the maximal
> allowed
> size (1 << MAX_PAGE_ORDER). The maximal size is usually smaller than
> PUD_SIZE
> but PUD_SIZE is allowed by the RMM.

This is an area where to be honest I'm really not sure what to do.
Technically the RMM is allowed to ask for a contiguous range of 512GB
pages (on a 4K system - larger with larger page sizes) - but clearly no
real OS is going to be able to provide anything like that.

In practise we don't expect the RMM to do anything so crazy. It's not
really clear to be whether even 2MB (PMD_SIZE) is needed. But the spec
is written to be generic.

So my current approach is to calculate the required size and pass it
into alloc_pages_exact(). For "stupidly large" values this will fail and
Linux just doesn't support an RMM which attempts this. If there is ever
a usecase which needs this then we'd need to find a different method of
providing the memory (most likely some form of carveout to avoid
fragmentation). But my view is we should wait for that usecase to be
identified first.

>> +    if (state == RMI_OP_MEM_DELEGATED) {
>> +        if (rmi_delegate_range(phys, size)) {
>> +            free_pages_exact(virt, size);
>> +            return -ENXIO;
>> +        }
>> +    }
>> +
>> +    addr_range = phys & RMI_ADDR_RANGE_ADDR_MASK;
>> +    FIELD_MODIFY(RMI_ADDR_RANGE_SIZE_MASK, &addr_range, unit_size);
>> +    FIELD_MODIFY(RMI_ADDR_RANGE_COUNT_MASK, &addr_range, count);
>> +    FIELD_MODIFY(RMI_ADDR_RANGE_STATE_MASK, &addr_range, state);
>> +
>> +    sro->addr_list[sro->addr_count] = addr_range;
>> +
>> +out:
>> +    regs.a2 = virt_to_phys(&sro->addr_list[sro->addr_count]);
>> +    regs.a3 = 1;
>> +    rmi_smccc_invoke(&regs, out_regs);
>> +
>> +    unsigned long donated_granules = out_regs->a1;
>> +    unsigned long donated_size = donated_granules << PAGE_SHIFT;
>> +
>> +    if (donated_granules == 0) {
>> +        /* No pages used by the RMM */
>> +        sro->addr_count++;
>> +    } else if (donated_size < size) {
>> +        phys = sro->addr_list[sro->addr_count] &
>> RMI_ADDR_RANGE_ADDR_MASK;
>> +
>> +        /* Not all granules used by the RMM, free the remaining pages */
>> +        for (long i = donated_size; i < size; i += PAGE_SIZE) {
>> +            if (state == RMI_OP_MEM_DELEGATED)
>> +                free_delegated_page(phys + i);
>> +            else
>> +                __free_page(phys_to_page(phys + i));
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int rmi_sro_donate_noncontig(struct rmi_sro_state *sro,
>> +                    unsigned long sro_handle,
>> +                    unsigned long donatereq,
>> +                    struct arm_smccc_1_2_regs *out_regs,
>> +                    gfp_t gfp)
>> +{
>> +    unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +    unsigned long unit_size_bytes = donate_req_to_size(donatereq);
>> +    unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +    unsigned long state = RMI_DONATE_STATE(donatereq);
>> +    unsigned long found = 0;
>> +    unsigned long addr_list_start = sro->addr_count;
>> +    int ret;
>> +    struct arm_smccc_1_2_regs regs = {
>> +        SMC_RMI_OP_MEM_DONATE,
>> +        sro_handle
>> +    };
>> +
>> +    for (int i = 0; i < addr_list_start && found < count; i++) {
>> +        unsigned long entry = sro->addr_list[i];
>> +
>> +        if (RMI_ADDR_RANGE_SIZE(entry) == unit_size &&
>> +            RMI_ADDR_RANGE_COUNT(entry) == 1 &&
>> +            RMI_ADDR_RANGE_STATE(entry) == state) {
>> +            addr_list_start--;
>> +            swap(sro->addr_list[addr_list_start],
>> +                 sro->addr_list[i]);
>> +            found++;
>> +            i--;
>> +        }
>> +    }
>> +
>> +    ret = rmi_sro_ensure_capacity(sro, count - found);
>> +    if (ret)
>> +        return ret;
>> +
>> +    while (found < count) {
>> +        unsigned long addr_range;
>> +        void *virt = alloc_pages_exact(unit_size_bytes, gfp);
>> +        phys_addr_t phys;
>> +
>> +        if (!virt)
>> +            return -ENOMEM;
>> +
>> +        phys = virt_to_phys(virt);
>> +
>> +        if (state == RMI_OP_MEM_DELEGATED) {
>> +            if (rmi_delegate_range(phys, unit_size_bytes)) {
>> +                free_pages_exact(virt, unit_size_bytes);
>> +                return -ENXIO;
>> +            }
>> +        }
>> +
>> +        addr_range = phys & RMI_ADDR_RANGE_ADDR_MASK;
>> +        FIELD_MODIFY(RMI_ADDR_RANGE_SIZE_MASK, &addr_range, unit_size);
>> +        FIELD_MODIFY(RMI_ADDR_RANGE_COUNT_MASK, &addr_range, 1);
>> +        FIELD_MODIFY(RMI_ADDR_RANGE_STATE_MASK, &addr_range, state);
>> +
>> +        sro->addr_list[sro->addr_count++] = addr_range;
>> +        found++;
>> +    }
>> +
>> +    regs.a2 = virt_to_phys(&sro->addr_list[addr_list_start]);
>> +    regs.a3 = found;
>> +    rmi_smccc_invoke(&regs, out_regs);
>> +
>> +    unsigned long donated_granules = out_regs->a1;
>> +
>> +    if (WARN_ON(donated_granules & ((unit_size_bytes >> PAGE_SHIFT) -
>> 1))) {
>> +        /*
>> +         * FIXME: RMM has only consumed part of a huge page, this leaks
>> +         * the rest of the huge page
>> +         */
>> +        donated_granules = ALIGN(donated_granules,
>> +                     (unit_size_bytes >> PAGE_SHIFT));
>> +    }
>> +    unsigned long donated_blocks = donated_granules /
>> (unit_size_bytes >> PAGE_SHIFT);
>> +
>> +    if (WARN_ON(donated_blocks > found))
>> +        donated_blocks = found;
>> +
>> +    unsigned long undonated_blocks = found - donated_blocks;
>> +
>> +    while (donated_blocks && undonated_blocks) {
>> +        sro->addr_count--;
>> +        swap(sro->addr_list[addr_list_start],
>> +             sro->addr_list[sro->addr_count]);
>> +        addr_list_start++;
>> +
>> +        donated_blocks--;
>> +        undonated_blocks--;
>> +    }
>> +    sro->addr_count -= donated_blocks;
>> +
>> +    return 0;
>> +}
>> +
>> +static int rmi_sro_donate(struct rmi_sro_state *sro,
>> +              unsigned long sro_handle,
>> +              unsigned long donatereq,
>> +              struct arm_smccc_1_2_regs *regs,
>> +              gfp_t gfp)
>> +{
>> +    unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +
>> +    if (WARN_ON(!count))
>> +        return 0;
>> +
>> +    if (RMI_DONATE_CONTIG(donatereq)) {
>> +        return rmi_sro_donate_contig(sro, sro_handle, donatereq,
>> +                         regs, gfp);
>> +    } else {
>> +        return rmi_sro_donate_noncontig(sro, sro_handle, donatereq,
>> +                        regs, gfp);
>> +    }
>> +}
>> +
>> +static int rmi_sro_reclaim(struct rmi_sro_state *sro,
>> +               unsigned long sro_handle,
>> +               struct arm_smccc_1_2_regs *out_regs)
>> +{
>> +    unsigned long capacity;
>> +    struct arm_smccc_1_2_regs regs;
>> +    int ret;
>> +
>> +    ret = rmi_sro_ensure_capacity(sro, 1);
>> +    if (ret)
>> +        rmi_sro_free(sro);
>> +
>> +    capacity = RMI_MAX_ADDR_LIST - sro->addr_count;
>> +
>> +    regs = (struct arm_smccc_1_2_regs){
>> +        SMC_RMI_OP_MEM_RECLAIM,
>> +        sro_handle,
>> +        virt_to_phys(&sro->addr_list[sro->addr_count]),
>> +        capacity
>> +    };
>> +    rmi_smccc_invoke(&regs, out_regs);
>> +
>> +    if (WARN_ON_ONCE(out_regs->a1 > capacity))
>> +        out_regs->a1 = capacity;
>> +
>> +    sro->addr_count += out_regs->a1;
>> +
>> +    return 0;
>> +}
>> +
>> +void rmi_sro_free(struct rmi_sro_state *sro)
>> +{
>> +    for (int i = 0; i < sro->addr_count; i++) {
>> +        unsigned long entry = sro->addr_list[i];
>> +        unsigned long addr = RMI_ADDR_RANGE_ADDR(entry);
>> +        unsigned long unit_size = RMI_ADDR_RANGE_SIZE(entry);
>> +        unsigned long count = RMI_ADDR_RANGE_COUNT(entry);
>> +        unsigned long state = RMI_ADDR_RANGE_STATE(entry);
>> +        unsigned long size = donate_req_to_size(unit_size) * count;
>> +
>> +        if (state == RMI_OP_MEM_DELEGATED) {
>> +            if (WARN_ON(rmi_undelegate_range(addr, size))) {
>> +                /* Leak the pages */
>> +                continue;
>> +            }
>> +        }
>> +        free_pages_exact(phys_to_virt(addr), size);
>> +    }
>> +
>> +    sro->addr_count = 0;
>> +}
>> +
>> +unsigned long rmi_sro_execute(struct rmi_sro_state *sro, gfp_t gfp)
>> +{
>> +    unsigned long sro_handle;
>> +    struct arm_smccc_1_2_regs regs;
>> +    struct arm_smccc_1_2_regs *regs_in = &sro->regs;
>> +
>> +    rmi_smccc_invoke(regs_in, &regs);
>> +
>> +    sro_handle = regs.a1;
>> +
>> +    while (RMI_RETURN_STATUS(regs.a0) == RMI_INCOMPLETE) {
>> +        bool can_cancel = RMI_RETURN_CAN_CANCEL(regs.a0);
>> +        int ret;
>> +
>> +        switch (RMI_RETURN_MEMREQ(regs.a0)) {
>> +        case RMI_OP_MEM_REQ_NONE:
>> +            regs = (struct arm_smccc_1_2_regs){
>> +                SMC_RMI_OP_CONTINUE, sro_handle, 0
>> +            };
>> +            rmi_smccc_invoke(&regs, &regs);
>> +            break;
> 
> 'ret' isn't initialized for case RMI_OP_MEM_REQ_NONE.

Good spot - ret should be initialised to 0.

Thanks,
Steve

>> +        case RMI_OP_MEM_REQ_DONATE:
>> +            ret = rmi_sro_donate(sro, sro_handle, regs.a2, &regs,
>> +                         gfp);
>> +            break;
>> +        case RMI_OP_MEM_REQ_RECLAIM:
>> +            ret = rmi_sro_reclaim(sro, sro_handle, &regs);
>> +            break;
>> +        default:
>> +            ret = WARN_ON(1);
>> +            break;
>> +        }
>> +
>> +        if (ret) {
>> +            if (can_cancel) {
>> +                /*
>> +                 * FIXME: Handle cancelling properly!
>> +                 *
>> +                 * If the operation has failed due to memory
>> +                 * allocation failure then the information on
>> +                 * the memory allocation should be saved, so
>> +                 * that the allocation can be repeated outside
>> +                 * of any context which prevented the
>> +                 * allocation.
>> +                 */
>> +            }
>> +            if (WARN_ON(ret))
>> +                return ret;
>> +        }
>> +    }
>> +
>> +    return regs.a0;
>> +}
>> +
>>   static int rmi_check_version(void)
>>   {
>>       struct arm_smccc_res res;
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 10/44] arm64: RMI: Add support for SRO
From: Steven Price @ 2026-06-04 15:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <yq5a7bp0szrw.fsf@kernel.org>

On 19/05/2026 07:02, Aneesh Kumar K.V wrote:
> Steven Price <steven.price@arm.com> writes:
> 
>> +static unsigned long donate_req_to_size(unsigned long donatereq)
>> +{
>> +	unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +
>> +	switch (unit_size) {
>> +	case 0:
>> +		return PAGE_SIZE;
>> +	case 1:
>> +		return PMD_SIZE;
>> +	case 2:
>> +		return PUD_SIZE;
>> +	case 3:
>> +		return P4D_SIZE;
>> +	}
>> +	unreachable();
>> +}
>>
>> +
>> +static void rmi_smccc_invoke(struct arm_smccc_1_2_regs *regs_in,
>> +			     struct arm_smccc_1_2_regs *regs_out)
>> +{
>> +	struct arm_smccc_1_2_regs regs = *regs_in;
>> +	unsigned long status;
>> +
>> +	do {
>> +		arm_smccc_1_2_invoke(&regs, regs_out);
>> +		status = RMI_RETURN_STATUS(regs_out->a0);
>> +	} while (status == RMI_BUSY || status == RMI_BLOCKED);
>> +}
>> +
>> +int free_delegated_page(phys_addr_t phys)
>> +{
>> +	if (WARN_ON(rmi_undelegate_page(phys))) {
>> +		/* Undelegate failed: leak the page */
>> +		return -EBUSY;
>> +	}
>> +
>> +	free_page((unsigned long)phys_to_virt(phys));
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_sro_ensure_capacity(struct rmi_sro_state *sro,
>> +				   unsigned long count)
>> +{
>> +	if (WARN_ON_ONCE(sro->addr_count > RMI_MAX_ADDR_LIST))
>> +		return -EOVERFLOW;
>> +
>> +	if (count > RMI_MAX_ADDR_LIST - sro->addr_count)
>> +		return -ENOSPC;
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_sro_donate_contig(struct rmi_sro_state *sro,
>> +				 unsigned long sro_handle,
>> +				 unsigned long donatereq,
>> +				 struct arm_smccc_1_2_regs *out_regs,
>> +				 gfp_t gfp)
>> +{
>> +	unsigned long unit_size = RMI_DONATE_SIZE(donatereq);
>> +	unsigned long unit_size_bytes = donate_req_to_size(donatereq);
>> +	unsigned long count = RMI_DONATE_COUNT(donatereq);
>> +	unsigned long state = RMI_DONATE_STATE(donatereq);
>> +	unsigned long size = unit_size_bytes * count;
>> +	unsigned long addr_range;
>>
> 
> Looking at above and the related code, I am wondering whether we should
> use u64 instead of unsigned long for everything that the specification
> defines as 64-bit.

I'm split on this. The kernel makes use of "(unsigned) long" for a
register sized value in quite a few places. Not least in struct
arm_smccc_1_2_regs which is ultimately where most of the values are read
from or written to.

I'm not a great fan of the kernel's approach to using long like this -
there's a good argument that uintptr_t is more correct. Equally when
we're in arch code for a 64 bit architecture (i.e. "arm64") then we know
the size is u64.

The disadvantage here is that if I use u64 then there's a bunch of
implicit conversions going on between unsigned long and u64 - which
might come back to bite if anything changes. Hence my current view that
"unsigned long" is the best option here in the kernel.

Anyone else have any view on the best type here?

Thanks,
Steve


^ permalink raw reply

* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Aneesh Kumar K.V @ 2026-06-04 14:57 UTC (permalink / raw)
  To: Michael Kelley, Jason Gunthorpe, Michael Kelley
  Cc: iommu@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Mostafa Saleh, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <SN6PR02MB4157F94C902B78E55E99372DD4102@SN6PR02MB4157.namprd02.prod.outlook.com>

Michael Kelley <mhklinux@outlook.com> writes:

> From: Jason Gunthorpe <jgg@ziepe.ca> Sent: Tuesday, June 2, 2026 5:55 PM
>> 
>> On Tue, Jun 02, 2026 at 02:24:40PM +0000, Michael Kelley wrote:
>> 
>> > Except that in a normal VM, the "unencrypted" pool attribute does *not*
>> > describe the state of the memory itself.  In a normal VM, the memory is
>> > unencrypted, but the "unencrypted" pool attribute is false. That
>> > contradiction is the essence of my concern.
>> 
>> I would argue no..
>> 
>> When CC is enabled the default state of memory in a Linux environment
>> is "encrypted". You have to take a special action to "decrypt" it.
>> 
>> Thus the default state of memory in a non-CC environment is also
>> paradoxically "encrypted" too. 
>
> The need to have such an unnatural premise is usually an indication
> of a conceptual problem with the overall model, or perhaps just a
> terminology problem. 
>
> Here's a proposal. The new DMA attribute is DMA_ATTR_CC_SHARED.
> Name the pool attribute "cc_shared" instead of "unencrypted". Having
> "cc_shared" set to false in a normal VM doesn't lead to the non-sensical
> situation of claiming that a normal VM is encrypted. The boolean
> "unencrypted" parameter that has been added to various calls also
> becomes "cc_shared".  If "CC_SHARED" is a suitable name for the DMA
> attribute, it ought to be suitable as the pool attribute. And everything
> matches as well.
>

That is better. It would also simplify:

	if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
		return NULL;

to
	if (mem->cc_shared != !!(attrs & DMA_ATTR_CC_SHARED))
		return NULL;


I already sent a v6 in the hope of getting this merged for the next
merge window. Should I send a v7, or would you prefer that I do the
rename on top of v6?

-aneesh

^ permalink raw reply

* [PATCH v4 3/3] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Kiryl Shutsemau (Meta) @ 2026-06-04 14:47 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen
  Cc: seanjc, pbonzini, sathyanarayanan.kuppuswamy, kai.huang,
	xiaoyao.li, binbin.wu, rick.p.edgecombe, david.laight.linux, ak,
	djbw, tsyrulnikov.borys, x86, kvm, linux-coco, linux-kernel,
	Kiryl Shutsemau (Meta), stable
In-Reply-To: <cover.1780584300.git.kas@kernel.org>

According to x86 architecture rules, 32-bit operations zero-extend the
result to 64 bits. The current implementation of handle_in() only masks
the lower 32 bits, which preserves the upper 32 bits of RAX when a
32-bit port IN instruction is emulated.

Use insn_assign_reg() to write the result back into RAX with proper
partial-register-write semantics: 1- and 2-byte forms leave the upper
bits untouched, the 4-byte form zero-extends to the full register.

Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Cc: stable@vger.kernel.org
---
 arch/x86/coco/tdx/tdx.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 65119362f9a2..41cc23cc63dd 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -693,8 +693,8 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 		.r13 = PORT_READ,
 		.r14 = port,
 	};
-	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
 	bool success;
+	u64 val;
 
 	/*
 	 * Emulate the I/O read via hypercall. More info about ABI can be found
@@ -702,11 +702,9 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 	 * "TDG.VP.VMCALL<Instruction.IO>".
 	 */
 	success = !__tdx_hypercall(&args);
+	val = success ? args.r11 : 0;
 
-	/* Update part of the register affected by the emulated instruction */
-	regs->ax &= ~mask;
-	if (success)
-		regs->ax |= args.r11 & mask;
+	insn_assign_reg(&regs->ax, val, size);
 
 	return success;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 2/3] x86/insn-eval: Add insn_assign_reg() helper
From: Kiryl Shutsemau (Meta) @ 2026-06-04 14:47 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen
  Cc: seanjc, pbonzini, sathyanarayanan.kuppuswamy, kai.huang,
	xiaoyao.li, binbin.wu, rick.p.edgecombe, david.laight.linux, ak,
	djbw, tsyrulnikov.borys, x86, kvm, linux-coco, linux-kernel,
	Kiryl Shutsemau (Meta)
In-Reply-To: <cover.1780584300.git.kas@kernel.org>

KVM's instruction emulator has a small helper, assign_register(), that
writes a value into a sub-register with x86 partial-register-write
semantics: 1- and 2-byte writes leave the upper bits of the destination
untouched, 4-byte writes zero-extend to 64 bits, 8-byte writes overwrite
the full register.

The TDX guest #VE handler needs the same logic for port I/O emulation
to get 32-bit zero-extension right. Rather than copy-pasting the helper,
lift it to <asm/insn-eval.h> as insn_assign_reg() so both can use it.

Rewrite the body using arithmetic instead of pointer punning so the
helper does not depend on -fno-strict-aliasing or little-endian byte
order, and add <asm/insn.h> to the header's includes so it builds
standalone in callers that have not pulled it in transitively.

No functional change.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 arch/x86/include/asm/insn-eval.h | 25 +++++++++++++++++++++++++
 arch/x86/kvm/emulate.c           | 26 ++++----------------------
 2 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 4733e9064ee5..85251e718a77 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -9,6 +9,7 @@
 #include <linux/compiler.h>
 #include <linux/bug.h>
 #include <linux/err.h>
+#include <asm/insn.h>
 #include <asm/ptrace.h>
 
 #define INSN_CODE_SEG_ADDR_SZ(params) ((params >> 4) & 0xf)
@@ -46,4 +47,28 @@ enum insn_mmio_type insn_decode_mmio(struct insn *insn, int *bytes);
 
 bool insn_is_nop(struct insn *insn);
 
+/*
+ * Write @val into *@reg with x86 partial-register-write semantics: a 1-
+ * or 2-byte write leaves the upper bits of the destination untouched; a
+ * 4-byte write zero-extends to 64 bits (matching IN[BWL], MOV[BWL]
+ * etc.); an 8-byte write overwrites the full register.
+ */
+static inline void insn_assign_reg(unsigned long *reg, u64 val, int bytes)
+{
+	switch (bytes) {
+	case 1:
+		*reg = (*reg & ~0xfful)   | (val & 0xff);
+		break;
+	case 2:
+		*reg = (*reg & ~0xfffful) | (val & 0xffff);
+		break;
+	case 4:
+		*reg = (u32)val;
+		break;
+	case 8:
+		*reg = val;
+		break;
+	}
+}
+
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8013dccb3110..74972c17edb8 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -24,6 +24,7 @@
 #include "kvm_emulate.h"
 #include <linux/stringify.h>
 #include <asm/debugreg.h>
+#include <asm/insn-eval.h>
 #include <asm/nospec-branch.h>
 #include <asm/ibt.h>
 #include <asm/text-patching.h>
@@ -439,25 +440,6 @@ static void assign_masked(ulong *dest, ulong src, ulong mask)
 	*dest = (*dest & ~mask) | (src & mask);
 }
 
-static void assign_register(unsigned long *reg, u64 val, int bytes)
-{
-	/* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */
-	switch (bytes) {
-	case 1:
-		*(u8 *)reg = (u8)val;
-		break;
-	case 2:
-		*(u16 *)reg = (u16)val;
-		break;
-	case 4:
-		*reg = (u32)val;
-		break;	/* 64b: zero-extend */
-	case 8:
-		*reg = val;
-		break;
-	}
-}
-
 static inline unsigned long ad_mask(struct x86_emulate_ctxt *ctxt)
 {
 	return (1UL << (ctxt->ad_bytes << 3)) - 1;
@@ -505,7 +487,7 @@ register_address_increment(struct x86_emulate_ctxt *ctxt, int reg, int inc)
 {
 	ulong *preg = reg_rmw(ctxt, reg);
 
-	assign_register(preg, *preg + inc, ctxt->ad_bytes);
+	insn_assign_reg(preg, *preg + inc, ctxt->ad_bytes);
 }
 
 static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc)
@@ -1766,7 +1748,7 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
 
 static void write_register_operand(struct operand *op)
 {
-	return assign_register(op->addr.reg, op->val, op->bytes);
+	return insn_assign_reg(op->addr.reg, op->val, op->bytes);
 }
 
 static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op)
@@ -2007,7 +1989,7 @@ static int em_popa(struct x86_emulate_ctxt *ctxt)
 		rc = emulate_pop(ctxt, &val, ctxt->op_bytes);
 		if (rc != X86EMUL_CONTINUE)
 			break;
-		assign_register(reg_rmw(ctxt, reg), val, ctxt->op_bytes);
+		insn_assign_reg(reg_rmw(ctxt, reg), val, ctxt->op_bytes);
 		--reg;
 	}
 	return rc;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 1/3] x86/tdx: Fix off-by-one in port I/O handling
From: Kiryl Shutsemau (Meta) @ 2026-06-04 14:46 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen
  Cc: seanjc, pbonzini, sathyanarayanan.kuppuswamy, kai.huang,
	xiaoyao.li, binbin.wu, rick.p.edgecombe, david.laight.linux, ak,
	djbw, tsyrulnikov.borys, x86, kvm, linux-coco, linux-kernel,
	Kiryl Shutsemau (Meta), stable
In-Reply-To: <cover.1780584300.git.kas@kernel.org>

handle_in() and handle_out() in arch/x86/coco/tdx/tdx.c use:

    u64 mask = GENMASK(BITS_PER_BYTE * size, 0);

GENMASK(h, l) includes bit h. For size=1 (INB), this produces
GENMASK(8, 0) = 0x1FF (9 bits) instead of GENMASK(7, 0) = 0xFF (8
bits). The mask is one bit too wide for all I/O sizes.

Fix the mask calculation.

Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
Link: https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/coco/tdx/tdx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 186915a17c50..65119362f9a2 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -693,7 +693,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 		.r13 = PORT_READ,
 		.r14 = port,
 	};
-	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
 	bool success;
 
 	/*
@@ -713,7 +713,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
 
 static bool handle_out(struct pt_regs *regs, int size, int port)
 {
-	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
 
 	/*
 	 * Emulate the I/O write via hypercall. More info about ABI can be found
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 0/3] x86/tdx: Fix port I/O handling bugs
From: Kiryl Shutsemau (Meta) @ 2026-06-04 14:46 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen
  Cc: seanjc, pbonzini, sathyanarayanan.kuppuswamy, kai.huang,
	xiaoyao.li, binbin.wu, rick.p.edgecombe, david.laight.linux, ak,
	djbw, tsyrulnikov.borys, x86, kvm, linux-coco, linux-kernel,
	Kiryl Shutsemau (Meta)

Two bugs in the TDX guest port I/O #VE emulation, plus a small helper
extracted from KVM to avoid open-coding partial-register-write logic
in the second fix.

Patch 1 is an off-by-one in the mask used to clip the I/O value:
GENMASK(BITS_PER_BYTE * size, 0) is one bit too wide. Unchanged from
v3 1/2.

Patch 2 lifts KVM's instruction-emulator helper assign_register() out
of arch/x86/kvm/emulate.c into <asm/insn-eval.h>, renamed to
insn_assign_reg(). Dave suggested consolidating rather than adding a
third copy of the same partial-register switch; the body is rewritten
using plain arithmetic (suggested by David Laight) so the helper does
not rely on -fno-strict-aliasing or little-endian byte order. KVM
behaviour is unchanged.

Patch 3 fixes the architectural zero-extension of 32-bit IN: the old
mask-based handle_in() preserves RAX[63:32] after inl, which is wrong.
Now done by calling the helper.

Changes since v3:
  - Patch 1/2 carried over unchanged as 1/3.
  - Helper extracted from KVM (new patch 2/3) and used from
    handle_in() (Dave, David Laight).
  - Reviewed-by tags from v3 2/2 dropped on patch 3/3 because the
    implementation changed substantially. v3 1/2 -> v4 1/3 Rb tags
    preserved (patch unchanged).

v3: https://lore.kernel.org/all/20260527120544.2903923-1-kas@kernel.org/

Kiryl Shutsemau (Meta) (3):
  x86/tdx: Fix off-by-one in port I/O handling
  x86/insn-eval: Add insn_assign_reg() helper
  x86/tdx: Fix zero-extension for 32-bit port I/O

 arch/x86/coco/tdx/tdx.c          | 10 ++++------
 arch/x86/include/asm/insn-eval.h | 25 +++++++++++++++++++++++++
 arch/x86/kvm/emulate.c           | 26 ++++----------------------
 3 files changed, 33 insertions(+), 28 deletions(-)

-- 
2.54.0


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox