* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-06-24 17:46 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajneQVLriUshjFIO@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, Jun 19, 2026, Fuad Tabba wrote:
>> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Ackerley Tng <ackerleytng@google.com>
>> >
>> > When memory in guest_memfd is converted from private to shared, the
>> > platform-specific state associated with the guest-private pages must be
>> > invalidated or cleaned up.
>> >
>> > Iterate over the folios in the affected range and call the
>> > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> > architectures to perform necessary teardown, such as updating hardware
>> > metadata or encryption states, before the pages are transitioned to the
>> > shared state.
>> >
>> > Invoke this helper after indicating to KVM's mmu code that an invalidation
>> > is in progress to stop in-flight page faults from succeeding.
>> >
>> > Reviewed-by: Fuad Tabba <tabba@google.com>
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> Coming back to this after working through the arm64/pKVM side. My
>> Reviewed-by here is from the previous round and the patch hasn't
>> changed, but I missed an implication for arm64.
>>
>> kvm_arch_gmem_invalidate() is now called from two paths with the same
>> (start, end) signature: folio teardown (kvm_gmem_free_folio) and
>> private->shared conversion (here). For SNP/TDX that's fine, conversion is
>> destructive anyway. For pKVM the two need opposite content semantics:
>> conversion must preserve the page in place (same physical page, the point
>> of in-place conversion without encryption), while teardown must scrub it
>> before returning it to the host.
>>
>> The hook gets only a pfn range with no indication of which caller it's
>> serving, so arm64 can't give the two paths the behaviour they need. It
>> would help to signal intent on the conversion path: a reason/flag, a
>> separate hook, or not routing non-destructive conversion through the
>> teardown hook.
>>
>> arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
>> second caller now, and it's cheaper to leave room for the distinction
>> than to change a generic contract other arches depend on later.
>
> Crud. It may not be urgent for arm64, but it's urgent for other reasons that
> I "can't" describe in detail at the moment, and even if that weren't the case, I
> think we should clean things up now. More below.
>
>> > virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 41 insertions(+)
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index 433f79047b9d1..3c94442bc8131 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> > return safe;
>> > }
>> >
>> > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>
> Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed. It's not
> "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> put physical pages back into a shared state when a maybe-private folio is freed.
>
> As Fuad points out, (ab)using that hook for the private=>shared conversion case
> "works", but not broadly. And it makes the bad name worse, because it's called
> from code that _is_ doing true invalidations. For pKVM, it may not even need to
> do anything invalidation-like.
>
Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
especially when conversions also calls
kvm_gmem_invalidate_{start,end}() and those do different things.
> To avoid a conflict with patches that are going to have priority over this series,
> to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> "invalidation" on this specific transition), I think we should add an arch hook
> to do conversions straightaway.
>
> Unless there's a clever option I'm missing, it'll mean adding yet another
> HAVE_KVM_ARCH_GMEM_XXX flag? Hmm, especially because IIUC, arm64/pKVM doesn't
> need a callback for this case, only the free_folio case.
>
>> > +{
>> > + struct folio_batch fbatch;
>> > + pgoff_t next = start;
>> > + int i;
>> > +
>> > + folio_batch_init(&fbatch);
>> > + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> > + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> > + struct folio *folio = fbatch.folios[i];
>> > + pgoff_t start_index, end_index;
>> > + kvm_pfn_t start_pfn, end_pfn;
>> > +
>> > + start_index = max(start, folio->index);
>> > + end_index = min(end, folio_next_index(folio));
>> > + /*
>> > + * end_index is either in folio or points to
>> > + * the first page of the next folio. Hence,
>> > + * all pages in range [start_index, end_index)
>> > + * are contiguous.
>> > + */
>> > + start_pfn = folio_file_pfn(folio, start_index);
>> > + end_pfn = start_pfn + end_index - start_index;
>> > +
>> > + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> > + }
>> > +
>> > + folio_batch_release(&fbatch);
>> > + cond_resched();
>> > + }
>> > +}
>> > +#else
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> > +#endif
>> > +
>> > static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> > size_t nr_pages, uint64_t attrs,
>> > pgoff_t *err_index)
>> > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> > */
>> >
>> > kvm_gmem_invalidate_start(inode, start, end);
>> > +
>> > + if (!to_private)
>> > + kvm_gmem_invalidate(inode, start, end);
>
> E.g. instead make this something like this?
>
> kvm_gmem_set_pfn_attributes(...)
>
> Hrm, though that wastes folio lookups in the to_private case. So maybe just this,
> assuming pKVM doesn't need to take additional action on conversions?
>
> if (!to_private)
> kvm_gmem_make_shared(...)
>
> Actually, if we do that, then we don't need a separate arch hook, just a separate
> config. It'll still bleed SNP details into guest_memfd, but it'll at least be
> done in a way that's more explicitly arch specific (and it's no different than
> what we already do for PREPARE...).
>
pKVM needs some arch guest_memfd lifecycle functions that
+ for conversion, doesn't do anything,
+ for teardown, resets page state (IIUC it'll be reset to
PKVM_PAGE_OWNED (by the host))
So I think we need different functions for those two stages in the
lifecycle of a page with guest_memfd? What if we have
CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates
+ kvm_gmem_should_set_pfn_attributes(attributes) and
.gmem_should_set_pfn_attributes
+ kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
.gmem_set_pfn_attributes
CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates
+ kvm_gmem_teardown() and .gmem_teardown
SNP:
+ .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
and sev_gmem_should_set_pfn_attributes returns !is_private
+ Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
+ .gmem_teardown = sev_gmem_set_pfn_attributes
TDX:
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN
pKVM:
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ .gmem_teardown = pkvm_gmem_set_pfn_attributes
Suzuki, does this work for ARM CCA?
This way,
+ The if (is_private) check doesn't leak SNP details into guest_memfd
+ .gmem_make_shared doesn't stick out without a .gmem_make_private
+ .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
conceptually as lifecycle hooks
+ I think the private/shared check for prepare can also be folded into
preparation.
+ Preparation perhaps doesn't need a should_prepare equivalent since
there's no iteration and getting the gfn is just doing some math?
+ In another patch series?
> E.g. this? There will still be a looming rename conflict, but that's easy enough
> to handle.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index 9ce5be7843f2..8aead0abd788 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> return safe;
> }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
> {
> struct folio_batch fbatch;
> pgoff_t next = start;
> @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> }
> }
> #else
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
> #endif
>
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> kvm_gmem_invalidate_start(inode, start, end);
>
> if (!to_private)
> - kvm_gmem_invalidate(inode, start, end);
> + kvm_gmem_make_shared(inode, start, end);
>
> mas_store_prealloc(&mas, xa_mk_value(attrs));
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 17:01 UTC (permalink / raw)
To: Binbin Wu
Cc: ackerleytng, aik, andrew.jones, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <6fc7f450-6d0a-494d-b295-297e4703148d@linux.intel.com>
On Tue, Jun 23, 2026, Binbin Wu wrote:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> > @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> > next = start;
> > while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
> >
> > - for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > + for (i = 0; i < folio_batch_count(&fbatch);) {
> > struct folio *folio = fbatch.folios[i];
> >
> > - if (folio_ref_count(folio) !=
> > - folio_nr_pages(folio) + filemap_get_folios_refcount) {
> > - safe = false;
> > + safe = (folio_ref_count(folio) ==
> > + folio_nr_pages(folio) +
> > + filemap_get_folios_refcount);
> > +
> > + if (safe) {
> > + ++i;
> > + } else if (folio_may_be_lru_cached(folio) &&
> > + !lru_drained) {
> > + lru_add_drain_all();
>
> It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
> by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?
FIW, if there's a risk, then AFAICT fadvise() and memfd's F_ADD_SEALS already
have the same risk.
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 16:57 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>
On Thu, Jun 18, 2026, Ackerley Tng wrote:
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount.
Under what circumstances does this happen, and what alternatives are there for
userspace to work around the issue?
^ permalink raw reply
* Re: [PATCH v13 11/22] KVM: selftests: Set up TDX boot parameters region
From: Xiaoyao Li @ 2026-06-24 16:07 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-11-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> +void tdx_vm_load_common_boot_parameters(struct kvm_vm *vm)
> +{
> + struct td_boot_parameters *params =
> + addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA);
> + u32 cr4;
> +
> + TEST_ASSERT_EQ(vm->mode, VM_MODE_PXXVYY_4K);
Why TDX needs to check it explicitly?
x86's virt_arch_pgd_alloc() already has such assertation. I think we can
just drop it.
^ permalink raw reply
* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Xiaoyao Li @ 2026-06-24 15:43 UTC (permalink / raw)
To: Ackerley Tng, Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgEBiivp9tOHZnFmWvF6ek-Ar-m2B+hb=-H0dgAcM2=z8Q@mail.gmail.com>
On 6/16/2026 7:46 AM, Ackerley Tng wrote:
> Lisa Wang <wyihan@google.com> writes:
>
>> Set the initial state of the first memory region as shared if it is
>> backed by guest_memfd, so that the KVM selftest framework functions can
>> populate mmap()-ed guest_memfd memory the same way memory from other
>> memory providers are populated.
>>
>> For CoCo VMs, pages that need to be private are explicitly set to
>> private before executing the VM.
>>
>>
>> [...snip...]
>>
>> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>> vm = ____vm_create(shape);
>>
>> /*
>> - * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
>> - * for CoCo VMs that require GUEST_MEMFD backed private memory.
>> + * Force GUEST_MEMFD for the primary memory region if necessary, and
>> + * initialize it as shared so the selftest framework can populate it
>> + * exactly like other memory providers.
>> */
>> - flags = 0;
>> - if (is_guest_memfd_required(shape))
>> + if (is_guest_memfd_required(shape)) {
>> flags |= KVM_MEM_GUEST_MEMFD;
>> + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
>> + }
>>
>
> Just noticed this while hacking some SNP tests.
>
>> - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
>> + vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>> for (i = 0; i < NR_MEM_REGIONS; i++)
>> vm->memslots[i] = 0;
>>
>>
>> --
>> 2.54.0.746.g67dd491aae-goog
>
> I think this patch should fully buy into in-place conversions, so we
> need to also set GUEST_MEMFD_FLAG_MMAP:
>
> @@ -483,6 +483,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape,
> u32 nr_runnable_vcpus,
> {
> u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
> nr_extra_pages);
> + enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS;
> struct userspace_mem_region *slot0;
> u64 gmem_flags = 0;
> struct kvm_vm *vm;
> @@ -503,10 +504,16 @@ struct kvm_vm *__vm_create(struct vm_shape
> shape, u32 nr_runnable_vcpus,
> */
> if (is_guest_memfd_required(shape)) {
> flags |= KVM_MEM_GUEST_MEMFD;
> - gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_MMAP;
GUEST_MEMFD_FLAG_INIT_SHARED is valid only when the memory attributes is
per-gmem.
we need to check KVM_CAP_GUEST_MEMFD_FLAGS or kvm_has_gmem_attributes.
^ permalink raw reply
* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Sean Christopherson @ 2026-06-24 15:12 UTC (permalink / raw)
To: Ackerley Tng
Cc: Binbin Wu, aik, andrew.jones, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgGF+O7r-YHqcLp-ZgoXTCbqjuUhpOdD5eE5w2wu3YYYpw@mail.gmail.com>
On Tue, Jun 23, 2026, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
>
> > On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> >> From: Sean Christopherson <seanjc@google.com>
> >>
> >> When memory attributes become trackable in guest_memfd, the concept of
> >> having private memory is no longer dependent on
> >> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> >>
> >> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
> >> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
> >> in.
> >>
> >> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> >
> > One nit below.
> >
> >> ---
> >> arch/x86/include/asm/kvm_host.h | 4 +++-
> >> include/linux/kvm_host.h | 2 +-
> >> 2 files changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> >> int tdp_max_root_level, int tdp_huge_page_level);
> >>
> >>
> >> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> >> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) || \
> >> + defined(CONFIG_KVM_INTEL_TDX) || \
> >> + defined(CONFIG_KVM_AMD_SEV)
> >
> > Nit:
> > Vertically align the defined(XXX) statements for better readability?
> >
>
> Sean had this aligned with spaces, and checkpatch complained about
checkpatch is a tool, it is neither omniscient nor authoritative. And for things
like this, the *entire* purpose for rules/guildlines like "no tabs after spaces"
is to help ensure the code is easier to read, e.g. doesn't end up with wonky
formatting when viewed in certain editors or whatever. So, ignore checkpatch if
it complains about formatting that is visually superior to what makes checkpatch
happy.
> having no spaces before tabs, so I switched it to tabs instead since I
> don't think alignment like that is officially documented either way.
This exact case may not be "officially" documented, but the general gist is in
Documentation/process/maintainer-tip.rst:
When splitting function declarations or function calls, then please align
the first argument in the second line with the first argument in the first
line::
And there is lots and lots of prior art on-list (from me and others) that is more
or less as good as official documentation.
> Either way is fine :)
Please restore the alignment.
^ permalink raw reply
* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Ackerley Tng @ 2026-06-24 14:38 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <1b59fec2-a464-4429-8532-880394912af5@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
>
> [...snip...]
>
>> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
>> +{
>> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> + struct inode *inode;
>> +
>> + /*
>> + * If this gfn has no associated memslot, there's no chance of the gfn
>> + * being backed by private memory, since guest_memfd must be used for
>> + * private memory,
>
> "guest_memfd must be used for private memory" is a bit confusing to me.
>
Hmm good point. Is the source of confusion that guest_memfd can be used
for both shared and private memory?
Perhaps this can be rephrased as:
guest_memfd is the only provider of private memory and guest_memfd must
be used with a memslot, hence if there's no associated memslot, there's
no chance of this gfn being private.
>> and guest_memfd must be associated with some memslot.
>> + */
>> + if (!slot)
>> + return 0;
>> +
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Pratik R. Sampat @ 2026-06-24 14:23 UTC (permalink / raw)
To: Kiryl Shutsemau, Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, pbonzini, mst, peterx,
chenyi.qiang, elena.reshetova, michael.roth, ackerleytng,
linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <ajvLaBs62bDoxC3W@thinkstation>
On 6/24/26 8:25 AM, Kiryl Shutsemau wrote:
> On Tue, Jun 23, 2026 at 06:17:32AM -0400, Zhenzhong Duan wrote:
>> Currently, allocate_unaccepted_bitmap() only scans the initial EFI
>> boot memory map. This misses hotpluggable ranges described in the
>> ACPI SRAT. Without early tracking, hotplug pages are accessed without
>> acceptance and this triggers guest crash.
>>
>> Introduce a lightweight ACPI SRAT parser to scan these regions early.
>> If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
>> flags, expand the tracking boundaries. This avoids pulling in the full
>> ACPI subsystem while ensuring the bitmap covers both static memory and
>> hotplug memory.
>
> Ugh.. Parsing SRAT there is ugly. I would rather avoid it.
>
I agree. Parsing it here means SRAT gets parsed twice, which doesn't make much
sense.
> Do I understand correctly that we don't have a way represent pluggable,
> but not present memory in EFI memory map?
>
> IIUC, EFI_MEMORY_HOT_PLUGGABLE is actually present, but unpluggable
> memory.
>
Right. And repurposing EFI_MEMORY_HOT_PLUGGABLE (plus updating the spec) would
likely make this messier: by its current definition it describes cold-plugged
pages that may be removed, not pages that may be hot-added later.
> Maybe it would be better just allocate bitmap upto maxmem?
>
> And fix EFI spec to add pluggable-but-not-present attribute.
>
I am currently working with the UEFI community around two proposals for a spec
change:
1. Add a new attribute, as Kiryl suggested, or
2. Add a generic new hotplug memory type that represents all the memory that
could be added later.
In either case, we could then precisely allocate the bitmap by parsing the
region with the attribute/type.
I prefer (1), but I have RFC proposals, code-first edk2 changes, and the Linux
plumbing ready for both approaches, and plan to post them in the following week
after ironing out a few kinks.
Thanks,
--Pratik
^ permalink raw reply
* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Ackerley Tng @ 2026-06-24 14:18 UTC (permalink / raw)
To: Suzuki K Poulose, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <3ec15992-2a29-434b-8c99-8b86bfcf007e@arm.com>
Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>
> [...snip...]
>
>>>> @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>>>> bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>>>> struct kvm_gfn_range *range);
>>>>
>>>> -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>>> +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>
>> Should have read the Sashiko review first, but where is this used?
>> It's not used at all in this series...
>
> See below:
>
>>
>> /fuad
>>
>>>> {
>>>> return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>>>> }
>>>> @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE,
>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE);
>>>> }
>>>> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>>>> +
>>>> +#ifdef kvm_arch_has_private_mem
>>>> +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
>>>> +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
>>>> +
>>>> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>>> +{
>>>> + return static_call(__kvm_mem_is_private)(kvm, gfn);
>>>> +}
>>>> #else
>>>> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>>> {
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 6669f1477013c..8b238e461b854 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>>>> }
>>>> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>>>>
>>>> +#ifdef kvm_arch_has_private_mem
>>>> +DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
>>>> +EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
>>>> +
>>>> +static void kvm_init_memory_attributes(void)
>>>> +{
>>>> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>>>> + static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
>>>> +#endif
>>>> +}
>
>
> Here ^^ as the static call update ?
>
>
> Suzuki
Thanks Suzuki, it is used here. kvm_mem_is_private() was and still is
the function used to check if some gfn is private or shared. Hence, in
this patch, the usages of kvm_mem_is_private() were not
updated. Instead, kvm_mem_is_private() is now set up as a static call,
and the static call is hard-wired to kvm_vm_mem_is_private() in this
patch.
In the later wiring patch, all the places where attributes are looked up
are updated all at once: if conversion enabled, take gmem route, else
take VM route.
kvm_mem_is_private() is special in that the if-else is done at KVM load
time rather than runtime, and I believe that's for performance reasons
since this is checked quite often from the KVM fault handling code.
Buut I think perhaps Fuad was referring to kvm_mem_range_is_private(),
which is indeed not used anywhere. Binbin also asked about this, I think
we should drop kvm_mem_range_is_private(). My reply to Binbin is at [1].
[1] https://lore.kernel.org/all/CAEvNRgGbBcrX5Fw3vNTsTOBNC=Ypi=9-S07674yPxLU9i4akjA@mail.gmail.com/
^ permalink raw reply
* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Ackerley Tng @ 2026-06-24 13:44 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <96fb369d-dbff-4ed6-b1f9-0ce63d7d4ed0@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
>
> [...snip...]
>
>> +static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
>> + gfn_t end)
>> +{
>> + return kvm_range_has_vm_memory_attributes(kvm, start, end,
>> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
>> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
>> }
>
> This function is added, but never used in this patch series.
> Is it intended to be called only when CONFIG_KVM_VM_MEMORY_ATTRIBUTES is
> enabled?
>
Thank you for catching this! I think in some earlier revision this was
meant to be used from the guest_memfd populate flow.
I think the version of kvm_gmem_range_is_private in this revision is
good because it is symmetric. If conversion is enabled, call the gmem
range-has-attributes function, and if conversion is disabled, use the VM
range-has-attributes function.
Sean, if no new revision is needed would you be able to drop
kvm_mem_range_is_private() while you're pulling it in?
>>
>> [...snip...]
>>
^ permalink raw reply
* [PATCH] virt: coco: harden TSM MR attribute allocation
From: Yousef Alhouseen @ 2026-06-24 13:00 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-coco, linux-kernel, Yousef Alhouseen
tsm_mr_create_attribute_group() combines the bin_attribute pointer table
and generated MR name strings into one allocation. It open-coded both the
aggregate name length calculation and the final allocation size as plain
additions and multiplication.
The current in-tree caller uses a small static MR table, but this helper is
exported for confidential-computing guest drivers. Reject impossible MR
definitions instead of allowing arithmetic wraparound to under-allocate the
combined attributes buffer.
Use size_add() and array_size() for the name-length accumulation and the
final allocation size.
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
drivers/virt/coco/guest/tsm-mr.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/drivers/virt/coco/guest/tsm-mr.c b/drivers/virt/coco/guest/tsm-mr.c
index 657b9c573..789988111 100644
--- a/drivers/virt/coco/guest/tsm-mr.c
+++ b/drivers/virt/coco/guest/tsm-mr.c
@@ -140,7 +140,11 @@ static ssize_t tm_digest_write(struct file *filp, struct kobject *kobj,
const struct attribute_group *
tsm_mr_create_attribute_group(const struct tsm_measurements *tm)
{
+ const struct bin_attribute **attrs __free(kfree) = NULL;
+ struct tm_context *ctx __free(kfree) = NULL;
+ size_t attrs_size, name_len;
size_t nlen;
+ char *name, *end;
if (!tm || !tm->mrs)
return ERR_PTR(-EINVAL);
@@ -164,8 +168,12 @@ tsm_mr_create_attribute_group(const struct tsm_measurements *tm)
return ERR_PTR(-EINVAL);
/* MR sysfs attribute names have the form of MRNAME:HASH */
- nlen += strlen(tm->mrs[i].mr_name) + 1 +
- strlen(hash_algo_name[tm->mrs[i].mr_hash]) + 1;
+ name_len = size_add(strlen(tm->mrs[i].mr_name),
+ strlen(hash_algo_name[tm->mrs[i].mr_hash]));
+ name_len = size_add(name_len, 2);
+ nlen = size_add(nlen, name_len);
+ if (name_len == SIZE_MAX || nlen == SIZE_MAX)
+ return ERR_PTR(-EINVAL);
}
/*
@@ -173,11 +181,13 @@ tsm_mr_create_attribute_group(const struct tsm_measurements *tm)
* so that we don't have to free MR names one-by-one in
* tsm_mr_free_attribute_group()
*/
- const struct bin_attribute **attrs __free(kfree) =
- kzalloc(sizeof(*attrs) * (tm->nr_mrs + 1) + nlen, GFP_KERNEL);
- struct tm_context *ctx __free(kfree) =
- kzalloc_flex(*ctx, mrs, tm->nr_mrs);
- char *name, *end;
+ attrs_size = size_add(array_size(size_add(tm->nr_mrs, 1),
+ sizeof(*attrs)), nlen);
+ if (attrs_size == SIZE_MAX)
+ return ERR_PTR(-EINVAL);
+
+ attrs = kzalloc(attrs_size, GFP_KERNEL);
+ ctx = kzalloc_flex(*ctx, mrs, tm->nr_mrs);
if (!ctx || !attrs)
return ERR_PTR(-ENOMEM);
--
2.54.0
^ permalink raw reply related
* Re: [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Kiryl Shutsemau @ 2026-06-24 12:33 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
ackerleytng, linux-kernel, linux-coco, virtualization, x86,
yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-6-zhenzhong.duan@intel.com>
On Tue, Jun 23, 2026 at 06:17:36AM -0400, Zhenzhong Duan wrote:
> + spin_lock_irqsave(&unaccepted_memory_lock, flags);
> + for (; range_start < bitmap_size; range_start = range_end) {
> + unsigned long phys_start, phys_end;
> + unsigned long unaccepted_one, plugged_zero;
> +
> + range_start = find_next_andnot_bit(plugged_bitmap, unaccepted->bitmap,
> + bitmap_size, range_start);
> +
> + if (range_start >= bitmap_size)
> + break;
> +
> + unaccepted_one = find_next_bit(unaccepted->bitmap, bitmap_size, range_start);
> + plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size, range_start);
> + range_end = min(unaccepted_one, plugged_zero);
> +
> + phys_start = range_start * unit_size + unaccepted->phys_base;
> + phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> + arch_unaccept_memory(phys_start, phys_end);
> + bitmap_set(unaccepted->bitmap, range_start, range_end - range_start);
> + }
> + spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
Accept TDCALL under the spin lock will kill scalability.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Kiryl Shutsemau @ 2026-06-24 12:29 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
ackerleytng, linux-kernel, linux-coco, virtualization, x86,
yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-3-zhenzhong.duan@intel.com>
On Tue, Jun 23, 2026 at 06:17:33AM -0400, Zhenzhong Duan wrote:
> In coco guests, hotpluggable memory ranges are initially unaccepted.
> While a previous change expanded the unaccepted memory bitmap boundaries
> to include these hotplug spaces, the actual bits inside the bitmap are
> not yet marked as unaccepted.
>
> Walks SRAT a second time after the bitmap is allocated and sets the bits
> corresponding to hotpluggable ranges.
>
> This ensures the bitmap state accurately reflects all static and hotplug
> memory ranges before booting kernel.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> .../firmware/efi/libstub/unaccepted_memory.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> index bfbb78bd7b8a..01bed8e751ca 100644
> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -92,6 +92,23 @@ static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct sra
> *(ctx->mem_end) = range_end;
> }
>
> +static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity *mem,
> + struct srat_parse_ctx *ctx)
> +{
> + u64 unit_size = unaccepted_table->unit_size;
> + u64 start, end;
> +
> + start = round_up(mem->base_address, unit_size);
> + end = round_down(mem->base_address + mem->length, unit_size);
We can get here with start > end if srat range is less then unit_size.
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted_table->phys_base;
> + end -= unaccepted_table->phys_base;
> +
> + bitmap_set(unaccepted_table->bitmap,
> + start / unit_size, (end - start) / unit_size);
> +}
> +
> efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> struct efi_boot_memmap *map)
> {
> @@ -169,6 +186,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> unaccepted_table->phys_base = unaccepted_start;
> unaccepted_table->size = bitmap_size;
> memset(unaccepted_table->bitmap, 0, bitmap_size);
> + parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
>
> status = efi_bs_call(install_configuration_table,
> &unaccepted_table_guid, unaccepted_table);
> --
> 2.52.0
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Kiryl Shutsemau @ 2026-06-24 12:25 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
mst, peterx, chenyi.qiang, elena.reshetova, michael.roth,
ackerleytng, linux-kernel, linux-coco, virtualization, x86,
yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-2-zhenzhong.duan@intel.com>
On Tue, Jun 23, 2026 at 06:17:32AM -0400, Zhenzhong Duan wrote:
> Currently, allocate_unaccepted_bitmap() only scans the initial EFI
> boot memory map. This misses hotpluggable ranges described in the
> ACPI SRAT. Without early tracking, hotplug pages are accessed without
> acceptance and this triggers guest crash.
>
> Introduce a lightweight ACPI SRAT parser to scan these regions early.
> If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
> flags, expand the tracking boundaries. This avoids pulling in the full
> ACPI subsystem while ensuring the bitmap covers both static memory and
> hotplug memory.
Ugh.. Parsing SRAT there is ugly. I would rather avoid it.
Do I understand correctly that we don't have a way represent pluggable,
but not present memory in EFI memory map?
IIUC, EFI_MEMORY_HOT_PLUGGABLE is actually present, but unpluggable
memory.
Maybe it would be better just allocate bitmap upto maxmem?
And fix EFI spec to add pluggable-but-not-present attribute.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v2 02/17] x86/virt/tdx: Configure add-on features on TDX module init and update
From: Xu Yilun @ 2026-06-24 12:00 UTC (permalink / raw)
To: Dave Hansen
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
tony.lindgren, peter.fang, baolu.lu, zhenzhong.duan, dave.hansen,
seanjc
In-Reply-To: <4f4b0f29-424b-45ed-8cfd-c77da2ea390f@intel.com>
> There's also zero stopping us from putting version in args:
>
> struct tdx_module_args args = {};
> int ret;
>
> if (tdx_addon_feature0) {
> args.r9 = tdx_addon_feature0;
> args.version = 1;
> }
>
> ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
>
> Eh?
>
> That gives args.version==0 in all the normal cases which just happens to
> be the exact behavior we want. It also avoids having to plumb version
> through all the seamcall*() wrappers.
Ah, on 2nd reading, I'm pretty sure now I understand your logical argument in
patch 1 and 2. It's good to me. I append my diff at the end.
>
> But this is *exactly* the kind of thing that shouldn't be a part of an
> attestation patch series. This could very much have been a separate
> discussion and happened a month or a year ago. But now it is blocking
> this DICE thing from getting done <grumble>.
Sorry, I should have been more active in searching for the solution
rather than sticking to "kernel never keeps versions", when I've found
the problem that public modules are not available.
----8<----
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index f20e91d7ac35..972880910a5e 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -143,6 +143,8 @@ struct tdx_module_args {
u64 rbx;
u64 rdi;
u64 rsi;
+ /* for RAX encoding */
+ u8 version;
};
/* Used to communicate with the TDX module */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 081816888f7a..b3c00ff4d819 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -95,6 +95,7 @@ static void __used common(void)
OFFSET(TDX_MODULE_rbx, tdx_module_args, rbx);
OFFSET(TDX_MODULE_rdi, tdx_module_args, rdi);
OFFSET(TDX_MODULE_rsi, tdx_module_args, rsi);
+ OFFSET(TDX_MODULE_version, tdx_module_args, version);
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 016a2a1ec1d6..d1d3d40c5614 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -48,6 +48,14 @@
/* Move Leaf ID to RAX */
mov %rdi, %rax
+ /*
+ * Extract the version from 'struct tdx_module_args', append it to
+ * RAX[23:16]
+ */
+ movzbl TDX_MODULE_version(%rsi), %ecx
+ shll $16, %ecx
+ orq %rcx, %rax
+
/* Move other input regs from 'struct tdx_module_args' */
movq TDX_MODULE_rcx(%rsi), %rcx
movq TDX_MODULE_rdx(%rsi), %rdx
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a6f8fd0a3df0..bc3aa1f78fc8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1036,7 +1036,6 @@ static __init void set_tdx_addon_features(void)
static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
u64 global_keyid)
{
- u64 seamcall_fn = TDH_SYS_CONFIG_V0;
struct tdx_module_args args = {};
u64 *tdmr_pa_array;
size_t array_sz;
@@ -1059,18 +1058,18 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
+ set_tdx_addon_features();
+
args.rcx = __pa(tdmr_pa_array);
args.rdx = tdmr_list->nr_consumed_tdmrs;
args.r8 = global_keyid;
- set_tdx_addon_features();
-
if (tdx_addon_feature0) {
args.r9 = tdx_addon_feature0;
- seamcall_fn = TDH_SYS_CONFIG;
+ args.version = 1;
}
- ret = seamcall_prerr(seamcall_fn, &args);
+ ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
/* Free the array as it is not required anymore. */
kfree(tdmr_pa_array);
@@ -1761,16 +1760,15 @@ int tdx_module_shutdown(void)
int tdx_module_run_update(void)
{
- u64 seamcall_fn = TDH_SYS_UPDATE_V0;
struct tdx_module_args args = {};
int ret;
if (tdx_addon_feature0) {
args.r9 = tdx_addon_feature0;
- seamcall_fn = TDH_SYS_UPDATE;
+ args.version = 1;
}
- ret = seamcall_prerr(seamcall_fn, &args);
+ ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
if (ret)
return ret;
@@ -2353,6 +2351,7 @@ u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid)
.rcx = vp->tdvpr_pa,
.rdx = initial_rcx,
.r8 = x2apicid,
+ .version = 1,
};
return seamcall(TDH_VP_INIT, &args);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 32b13b0c85f9..018988c25caa 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -44,7 +44,7 @@
#define TDH_VP_CREATE 10
#define TDH_MNG_KEY_FREEID 20
#define TDH_MNG_INIT 21
-#define TDH_VP_INIT SEAMCALL_LEAF_VER(22, 1)
+#define TDH_VP_INIT 22
#define TDH_PHYMEM_PAGE_RDMD 24
#define TDH_VP_RD 26
#define TDH_PHYMEM_PAGE_RECLAIM 28
@@ -58,11 +58,9 @@
#define TDH_PHYMEM_CACHE_WB 40
#define TDH_PHYMEM_PAGE_WBINVD 41
#define TDH_VP_WR 43
-#define TDH_SYS_CONFIG_V0 45
-#define TDH_SYS_CONFIG SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
+#define TDH_SYS_CONFIG 45
#define TDH_SYS_SHUTDOWN 52
-#define TDH_SYS_UPDATE_V0 53
-#define TDH_SYS_UPDATE SEAMCALL_LEAF_VER(TDH_SYS_UPDATE_V0, 1)
+#define TDH_SYS_UPDATE 53
#define TDH_EXT_INIT 60
#define TDH_EXT_MEM_ADD 61
#define TDH_SYS_DISABLE 69
^ permalink raw reply related
* Re: [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Ackerley Tng @ 2026-06-24 0:14 UTC (permalink / raw)
To: Sean Christopherson, Julian Braha
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajnQVkLvFl_lMuGB@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, Jun 19, 2026, Julian Braha wrote:
>> Hi Ackerley,
>>
>> On 6/19/26 01:31, Ackerley Tng via B4 Relay wrote:
>>
>> > config KVM_VM_MEMORY_ATTRIBUTES
>> > - bool
>> > + depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
>> > + bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
>>
>> Sorry for the style nitpick, but could you keep the type and prompt as
>> the first attribute in the Kconfig option definition (like the other
>> options do)?
>
> No need to be sorry, I've no idea why I put the "depends" first. I don't even
> know if that qualifies as a nit :-)
>
> Ackerley, if you can provide your SoB (for Fuad's feedback), I can fixup when
> applying (assuming nothing else necessitates v9).
Thanks, didn't notice this when consolidating this revision.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
^ permalink raw reply
* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Ackerley Tng @ 2026-06-24 0:13 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <a21bfc05-787e-4cd8-89af-8579357e6a12@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> When memory attributes become trackable in guest_memfd, the concept of
>> having private memory is no longer dependent on
>> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
>>
>> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
>> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
>> in.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
>
> One nit below.
>
>> ---
>> arch/x86/include/asm/kvm_host.h | 4 +++-
>> include/linux/kvm_host.h | 2 +-
>> 2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>> int tdp_max_root_level, int tdp_huge_page_level);
>>
>>
>> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) || \
>> + defined(CONFIG_KVM_INTEL_TDX) || \
>> + defined(CONFIG_KVM_AMD_SEV)
>
> Nit:
> Vertically align the defined(XXX) statements for better readability?
>
Sean had this aligned with spaces, and checkpatch complained about
having no spaces before tabs, so I switched it to tabs instead since I
don't think alignment like that is officially documented either way.
Either way is fine :)
>> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>> #endif
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 201d0f2143976..d370e834d619e 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>> }
>> #endif
>>
>> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>> +#ifndef kvm_arch_has_private_mem
>> static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>> {
>> return false;
>>
^ permalink raw reply
* Re: [PATCH v8 01/46] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng @ 2026-06-24 0:09 UTC (permalink / raw)
To: Sean Christopherson, Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajnjTJdQKD1Kz3tf@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Mon, Jun 22, 2026, Binbin Wu wrote:
>> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>>
>> [...]
>>
>> >
>> > +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
>> > +{
>> > + struct maple_tree *mt = &GMEM_I(inode)->attributes;
>> > + void *entry = mtree_load(mt, index);
>> > +
>> > + return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
>>
>> If the entry is unexpectedly missing, returning 0 means the attribute would
>> be treated as shared. And then in kvm_gmem_fault_user_mapping(), it would
>> allow the userspace to fault in the folio.
>>
>> Should gmem deny such edge case?
>
> After several bugs this year where a WARN_ON_ONCE() fired, but was entirely
> insufficient to prevent true badness, I'm definitely senstive to making the "bad"
> behavior as harmless as possible.
>
I guess both are indeed awkward.
> However, in this case I think we're just hosed. If KVM treats the memory as
> private, KVM will incorrectly do prepare(), incorrectly allow populate(), and
> will caused missed invalidations (though I suppose __kvm_gmem_set_attributes()
> "only" lies to userspace in that case).
>
> That said, assuming SHARED is definitely odd for cases where guest_memfd *can't*
> hold shared memory. Ditto for assuming PRIVATE. What if we instead fall back to
> the "init" state, e.g.?
>
> static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
> {
> struct maple_tree *mt = &GMEM_I(inode)->attributes;
> void *entry = mtree_load(mt, index);
>
> if (WARN_ON_ONCE(!entry)) {
> bool shared = GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED;
>
> return shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
I was wondering if we should not only return the init state but also set
the init state, but that would involve performing a conversion to the
init state... Too complicated for an edge case.
> }
>
> return xa_to_value(entry);
> }
Thanks Binbin and Sean!
^ permalink raw reply
* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-23 19:15 UTC (permalink / raw)
To: Ackerley Tng, Jethro Beekman, tglx, mingo, bp, dave.hansen, x86,
hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgHDNGCETxLsy0v-_cBO1=1U+tXtOXWEFrXLU7pYz7U9ow@mail.gmail.com>
Hello Ackerley,
On 6/23/2026 12:48 PM, Ackerley Tng wrote:
> Jethro Beekman <jethro@fortanix.com> writes:
>
>> On 2026-06-15 21:49, Ashish Kalra wrote:
>>> From: Ashish Kalra <ashish.kalra@amd.com>
>>>
>>> The SEV firmware enumerates the CPUs at SNP initialization and is not
>>> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
>>> hotplug can diverge from the firmware's expectations and break SNP.
>>> Disable CPU hotplug while SNP is active.
>>
>> I think this is too broad. If I have a hypervisor that supports SNP virtualization, a (non-confidential) L1 guest running Linux should still support CPU hotplug while also running confidential L2 guests.
>>
>> --
>> Jethro Beekman | CTO | Fortanix
>>
>
> Were any other solutions considered other than disabling CPU hotplug?
>
> Is this temporary until something else is implemented?
>
> I'm not sure how commonly CPU hotplug is used, and if people are okay
> with trading in CPU hotplug to get SNP.
>
> Is it that fundamentally the SEV firmware can't support hotplug, so
> there's no point in keeping it enabled anyway?
Yes, essentially. The SEV firmware knows nothing about when the OS takes CPUs online or offline. At SNP_INIT it accounts for all
the CPUs enabled via the BIOS/UEFI and establishes the per-core SNP state for them; it has no notion of the OS bringing CPUs up or
down afterwards. So OS hotplug actions can diverge from the firmware's expectations and break SNP. Disabling hotplug just makes
that constraint explicit — there's nothing useful to keep it enabled for: a hot-removed core still "exists" as far as
the firmware and the per-core RMP/RMPOPT state are concerned, and a core brought online later was never set up for SNP.
>
> Is there some way of supporting hotplug for CPUs that won't be used with
> SNP, for serving non-SNP VMs on the same host as SNP VMs, or is that too
> complicated?
>
Not really. SNP's memory-integrity guarantee rests on a single invariant: every memory write is subject to RMP checks to protect
against corruption of SEV-SNP guest memory. The moment any CPU can issue writes that aren't RMP-checked, that protection is
broken for the whole system — it's not something that can be confined to "that one core."
That's because SNP isn't per-core in that sense — it's a system-wide mode. SYSCFG[SNP_EN] is set on every core, the RMP covers all
of physical memory, and once SNP is enabled every memory write is subject to RMP checks on every core. A non-SNP guest sharing the
host still runs on cores that are part of the SNP-enabled system.
By the SNP architecture there simply can't be a CPU that isn't doing RMP checks while SNP is active, so SNP_EN has to be enforced
on every core. RMP enforcement is gated per-core by SYSCFG[SNP_EN] and it must be set on every core before SNP_INIT; a core with
SNP_EN clear performs no RMP checks at all, which the architecture doesn't allow once SNP is up. A newly hotplugged CPU comes up
without SNP_EN (SNP not enabled on it), and since it wasn't present when SNP_INIT ran it isn't part of the initialized SNP
configuration either — so it does no RMP checking. And because an SNP guest's vCPUs (or any guest for that matter) can be scheduled
on any online CPU, the guest could end up running on that core, accessing memory with no RMP enforcement and breaking SNP's memory integrity.
There's no way to prevent that: KVM doesn't fence SNP guests (or any guests for that matter) off particular online cores.
And carving out cores for non-SNP-only use isn't possible by the architecture: SNP requires RMP checks on every CPU, so there's no valid
configuration with SNP active and a subset of cores exempt.
Thanks,
Ashish
>>>
>>> [...snip...]
>>>
^ permalink raw reply
* SVSM Development Call June 24th, 2026
From: Jörg Rödel @ 2026-06-23 19:01 UTC (permalink / raw)
To: coconut-svsm, linux-coco
Hi,
Here is the call for agenda items for this weeks SVSM development call. Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.
We will use the LF Zoom instance. Details of the meeting can be found in our
governance repository at:
https://github.com/coconut-svsm/governance
The link to the COCONUT-SVSM calendar is:
https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week
The meeting will be recorded and the recording eventually published.
Regards,
Jörg
^ permalink raw reply
* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ackerley Tng @ 2026-06-23 17:50 UTC (permalink / raw)
To: Kalra, Ashish, K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86,
hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <8c5f4082-e3a5-4f65-b058-33938a7ee324@amd.com>
"Kalra, Ashish" <ashish.kalra@amd.com> writes:
>
> [...snip...]
>
>
> Yes, a simpler implementation will be like this:
> ...
>
> if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
Perhaps have a WARN_ON_ONCE() here so we know rmpopt was not performed?
Not a huge deal without though.
> return;
>
> cpumask_copy(follower_mask, &rmpopt_cpumask);
>
> /*
> * The current CPU's core always has RMPOPT_BASE programmed
> * (snp_prepare() required all CPUs online at setup and CPU hotplug
> * is disabled while SNP is active), so it can always be the leader.
> * RMPOPT_BASE is per-core; exclude this core from the followers.
> */
> migrate_disable();
> cpumask_andnot(follower_mask, follower_mask,
> topology_sibling_cpumask(smp_processor_id()));
>
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> rmpopt(pa);
> cond_resched();
> }
> migrate_enable();
>
> cpus_read_lock();
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
> cond_resched();
> }
> cpus_read_unlock();
>
> free_cpumask_var(follower_mask);
>
>
Definitely better than the version in the original patch :) Thanks!
> Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:
>
> cpumask_andnot(follower_mask, follower_mask,
> topology_sibling_cpumask(smp_processor_id()));
>
> - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
>
> - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
> follower_mask -> andnot still removes this core's primary.
>
> So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
> clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
> both cases).
>
> Thanks,
> Ashish
>
>>> + goto followers;
>>> + }
>>> +
>>> + migrate_enable();
>>> +
^ permalink raw reply
* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Ackerley Tng @ 2026-06-23 17:48 UTC (permalink / raw)
To: Jethro Beekman, Ashish Kalra, tglx, mingo, bp, dave.hansen, x86,
hpa, seanjc, peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0df3b665-3a9c-4c46-a7aa-14388e8e1577@fortanix.com>
Jethro Beekman <jethro@fortanix.com> writes:
> On 2026-06-15 21:49, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> The SEV firmware enumerates the CPUs at SNP initialization and is not
>> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
>> hotplug can diverge from the firmware's expectations and break SNP.
>> Disable CPU hotplug while SNP is active.
>
> I think this is too broad. If I have a hypervisor that supports SNP virtualization, a (non-confidential) L1 guest running Linux should still support CPU hotplug while also running confidential L2 guests.
>
> --
> Jethro Beekman | CTO | Fortanix
>
Were any other solutions considered other than disabling CPU hotplug?
Is this temporary until something else is implemented?
I'm not sure how commonly CPU hotplug is used, and if people are okay
with trading in CPU hotplug to get SNP.
Is it that fundamentally the SEV firmware can't support hotplug, so
there's no point in keeping it enabled anyway?
Is there some way of supporting hotplug for CPUs that won't be used with
SNP, for serving non-SNP VMs on the same host as SNP VMs, or is that too
complicated?
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH 1/4] kvm: sev: Fix user-space triggerable WARN_ON on snp_launch_update path
From: Sean Christopherson @ 2026-06-23 14:46 UTC (permalink / raw)
To: Jörg Rödel
Cc: Paolo Bonzini, x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky,
Ashish Kalra, Michael Roth, kvm, linux-kernel, linux-coco,
Joerg Roedel
In-Reply-To: <20260623091556.1500930-2-joro@8bytes.org>
Please capitalize the scope, i.e. "KVM: SEV:".
On Tue, Jun 23, 2026, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
>
> Sashiko reported on an unrelated patch:
>
> [Severity: High]
> This is a pre-existing issue, but can a host userspace process trigger a
> kernel warning by passing a NULL user address (uaddr = 0) here?
>
> If params.uaddr is 0, src becomes NULL and passes the PAGE_ALIGNED(src)
> check. kvm_gmem_populate() skips fetching the user page and passes
> src_page = NULL to sev_gmem_post_populate().
>
> That function then unconditionally evaluates:
>
> WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
> !src_page)
>
> Since the type isn't ZERO, won't this allow an unprivileged user to spam
> the kernel log?
Use Reported-by: + Closes to capture Sashiko's effecitve bug report instead of
copy+pasting the finding. There's no reason to treat Sashiko any differently
than any other bot.
> The assessment is correct, so check for this condition earlier in the
> snp_launch_update() path to avoid the WARN_ON_ONCE.
>
> Fixes: dee5a47cc7a45 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
> arch/x86/kvm/svm/sev.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 6c6a6d663e29..41dcba5180ca 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2438,6 +2438,13 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (!PAGE_ALIGNED(src))
> return -EINVAL;
>
> + /*
> + * Make sure user-mode did not pass NULL as src with
> + * type != KVM_SEV_SNP_PAGE_TYPE_ZERO.
Meh, that's pretty obvious from the code.
> + */
> + if (src == NULL && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
I think I'd prefer this over checking for KVM_SEV_SNP_PAGE_TYPE_ZERO twice,
especially since the PAGE_ALIGNED() check for the NULL pointer case is rather
weird.
if (params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO)
src = NULL;
else if (!params.uaddr || !PAGE_ALIGNED(params.uaddr))
return -EINVAL;
else
src = u64_to_user_ptr(params.uaddr);
> + return -EINVAL;
Gah, we created quite the mess for ourselves. TDX returns -EOPNOTSUPP instead
of -EINVAL, I guess as a placeholder for in-place conversion? I don't care which
error code is returned, but I do want KVM to be consistent.
We should also adjust TDX to pre-check the source address, because checking only
in the post-populate flow subtly relies on tdx_vcpu_init_mem_region() returning
immediately on error. If that weren't the case (ignoring for the moment that
continuing on would be nonsensical), KVM would advace the address by PAGE_SIZE
and suddenly a NULL userspace address becomes non-NULL.
I also think it makes sense to drop the WARN in sev_gmem_post_populate(), it's
completely redundant once this bug is fixed.
Ugh, and both SNP and TDX fail to account for tags, and fail to check for
striding into kernel space. Which I suppose is fine, since gup() handles those
correctly. And I don't see a strong argument for disallowing tagged addresses,
because unlike the userspace address for memslots, KVM doesn't keep the address
around long-term.
So over two patches, the below? I can send a v2, I've already got changelogs
written (I was fiddling around with extracting and reusing kvm_set_memory_region()'s
checks on the userspace address+size, but as above, convinced myself that KVM
should continue to allow tagged addresses for SNP and TDX).
---
arch/x86/kvm/svm/sev.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 74fb15551e83..621a2eaa58f2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2330,9 +2330,6 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
int level;
int ret;
- if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
- return -EINVAL;
-
ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
if (ret || assigned) {
pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n",
@@ -2421,10 +2418,12 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
params.type != KVM_SEV_SNP_PAGE_TYPE_CPUID))
return -EINVAL;
- src = params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO ? NULL : u64_to_user_ptr(params.uaddr);
-
- if (!PAGE_ALIGNED(src))
+ if (params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO)
+ src = NULL;
+ else if (!params.uaddr || !PAGE_ALIGNED(params.uaddr))
return -EINVAL;
+ else
+ src = u64_to_user_ptr(params.uaddr);
npages = params.len / PAGE_SIZE;
---
arch/x86/kvm/vmx/tdx.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ffe9d0db58c5..b0ec054732b9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3198,9 +3198,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
return -EIO;
- if (!src_page)
- return -EOPNOTSUPP;
-
kvm_tdx->page_add_src = src_page;
ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
kvm_tdx->page_add_src = NULL;
@@ -3247,8 +3244,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
if (copy_from_user(®ion, u64_to_user_ptr(cmd->data), sizeof(region)))
return -EFAULT;
- if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
- !region.nr_pages ||
+ if (!PAGE_ALIGNED(region.source_addr) || !region.source_addr ||
+ !PAGE_ALIGNED(region.gpa) || !region.nr_pages ||
region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
!vt_is_tdx_private_gpa(kvm, region.gpa) ||
!vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
--
^ permalink raw reply related
* Re: [PATCH 3/4] KVM: guest_memfd: Add `write` parameter to kvm_gmem_populate()
From: Sean Christopherson @ 2026-06-23 12:57 UTC (permalink / raw)
To: Jörg Rödel
Cc: Paolo Bonzini, x86, Kiryl Shutsemau, Rick Edgecombe, Tom Lendacky,
Ashish Kalra, Michael Roth, kvm, linux-kernel, linux-coco,
Joerg Roedel
In-Reply-To: <20260623091556.1500930-4-joro@8bytes.org>
On Tue, Jun 23, 2026, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
>
> The call-path of kvm_gmem_populate() might subsequently write to the
> page provided by user-space. This is used to provide detailed error
> information in case the page population failed.
>
> But since kvm_gmem_populate() only acquires a read-only reference to
> the user-space page via get_user_pages_fast(), the error information
> might be written to a read-only page later on.
>
> Add a parameter to kvm_gmem_populate() to optionally acquire a
> writeable reference to the source page to make sure page permissions
> can be enforced.
Already fixed, commit f13e90059908 ("KVM: SEV: Pin source page for write when
adding CPUID data for SNP guest").
^ permalink raw reply
* [RFCv2 PATCH 6/6] virtio-mem: Support memory hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
michael.roth, ackerleytng
Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>
Integrate coco memory management operations into the virtio-mem driver to
manage the state of hotplug memory.
In virtio_mem_send_plug_request(), once the host hypervisor acknowledges a
plug request, invoke coco_set_plugged_bitmap() to set the corresponding
bits in the plugged bitmap. Conversely, in virtio_mem_send_unplug_request()
and virtio_mem_send_unplug_all_request(), call unaccept_memory() to let the
guest autonomously transition the target private pages back to "unaccepted"
state before asking the VMM to unplug them. After the VMM acknowledges the
unplug request, clear the ranges from the plugged bitmap.
Note that memory block hotplug/unplug also sets or clears the plugged
bitmap at memory block granularity. While doing this at device block
granularity here creates a slight redundancy, it is completely harmless.
Additionally, update virtio_mem_fake_online() to explicitly invoke
accept_memory() when transitioning memory out of the fake-offline state and
back into service. This ensures that any pages returning to the buddy
system are cleanly accepted by the guest architecture before they are freed
back into the allocator via free_contig_range().
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
drivers/virtio/virtio_mem.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 48051e9e98ab..9f6e53df8caf 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1211,6 +1211,7 @@ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
generic_online_page(page, order);
} else {
virtio_mem_clear_fake_offline(pfn + i, 1 << order, true);
+ accept_memory(page_to_phys(page), PAGE_SIZE << order);
free_contig_range(pfn + i, 1 << order);
adjust_managed_page_count(page, 1 << order);
}
@@ -1436,6 +1437,7 @@ static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
switch (virtio_mem_send_request(vm, &req)) {
case VIRTIO_MEM_RESP_ACK:
vm->plugged_size += size;
+ WARN_ON(coco_set_plugged_bitmap(addr, size, true));
return 0;
case VIRTIO_MEM_RESP_NACK:
rc = -EAGAIN;
@@ -1471,9 +1473,12 @@ static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
dev_dbg(&vm->vdev->dev, "unplugging memory: 0x%llx - 0x%llx\n", addr,
addr + size - 1);
+ unaccept_memory(addr, size);
+
switch (virtio_mem_send_request(vm, &req)) {
case VIRTIO_MEM_RESP_ACK:
vm->plugged_size -= size;
+ WARN_ON(coco_set_plugged_bitmap(addr, size, false));
return 0;
case VIRTIO_MEM_RESP_BUSY:
rc = -ETXTBSY;
@@ -1498,10 +1503,13 @@ static int virtio_mem_send_unplug_all_request(struct virtio_mem *vm)
dev_dbg(&vm->vdev->dev, "unplugging all memory");
+ unaccept_memory(vm->addr, vm->region_size);
+
switch (virtio_mem_send_request(vm, &req)) {
case VIRTIO_MEM_RESP_ACK:
vm->unplug_all_required = false;
vm->plugged_size = 0;
+ WARN_ON(coco_set_plugged_bitmap(vm->addr, vm->region_size, false));
/* usable region might have shrunk */
atomic_set(&vm->config_changed, 1);
return 0;
--
2.52.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox