* Re: [PATCH v8 36/46] KVM: selftests: Test that truncation does not change shared/private status
From: Fuad Tabba @ 2026-06-25 7:03 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-36-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add a test to verify that deallocating a page in a guest memfd region via
> fallocate() with FALLOC_FL_PUNCH_HOLE does not alter the shared or private
> status of the corresponding memory range.
>
> When a page backing a guest memfd mapping is deallocated, e.g., by punching
> a hole or truncating the file, and then subsequently faulted back in, the
> new page must inherit the correct shared/private status tracked by
> guest_memfd.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../selftests/kvm/x86/guest_memfd_conversions_test.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> index 0b024fb7227f0..f03af2c46426f 100644
> --- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -10,6 +10,7 @@
> #include <linux/sizes.h>
>
> #include "kvm_util.h"
> +#include "kvm_syscalls.h"
> #include "kselftest_harness.h"
> #include "test_util.h"
> #include "ucall_common.h"
> @@ -309,6 +310,19 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(unallocated_folios, 8)
> test_convert_to_shared(t, i, 'B', 'C', 'D');
> }
>
> +/* Truncation should not affect shared/private status. */
> +GMEM_CONVERSION_TEST_INIT_SHARED(truncate)
> +{
> + host_do_rmw(t->mem, 0, 0, 'A');
> + kvm_fallocate(t->gmem_fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, page_size);
> + host_do_rmw(t->mem, 0, 0, 'A');
> +
> + test_convert_to_private(t, 0, 'A', 'B');
> +
> + kvm_fallocate(t->gmem_fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, page_size);
> + test_private(t, 0, 0, 'A');
> +}
> +
> int main(int argc, char *argv[])
> {
> TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 35/46] KVM: selftests: Convert with allocated folios in different layouts
From: Fuad Tabba @ 2026-06-25 7:03 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-35-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add a guest_memfd selftest to verify that memory conversions work
> correctly with allocated folios in different layouts.
>
> By iterating through which pages are initially faulted, the test covers
> various layouts of contiguous allocated and unallocated regions, exercising
> conversion with different range layouts.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../kvm/x86/guest_memfd_conversions_test.c | 30 ++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> index b43ac196330f1..0b024fb7227f0 100644
> --- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -279,6 +279,36 @@ GMEM_CONVERSION_TEST_INIT_PRIVATE(before_allocation_private)
> test_convert_to_shared(t, 0, 0, 'A', 'B');
> }
>
> +/*
> + * Test that when some of the folios in the conversion range are allocated,
> + * conversion requests are handled correctly in guest_memfd. Vary the ranges
> + * allocated before conversion, using test_page, to cover various layouts of
> + * contiguous allocated and unallocated regions.
> + */
> +GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(unallocated_folios, 8)
> +{
> + const int second_page_to_fault = 4;
> + int i;
> +
> + /*
> + * Fault 2 of the pages to test filemap range operations except when
> + * test_page == second_page_to_fault.
> + */
> + host_do_rmw(t->mem, test_page, 0, 'A');
> + if (test_page != second_page_to_fault)
> + host_do_rmw(t->mem, second_page_to_fault, 0, 'A');
> +
> + gmem_set_private(t->gmem_fd, 0, nr_pages * page_size);
> + for (i = 0; i < nr_pages; ++i) {
> + char expected = (i == test_page || i == second_page_to_fault) ? 'A' : 0;
> +
> + test_private(t, i, expected, 'B');
> + }
> +
> + for (i = 0; i < nr_pages; ++i)
> + test_convert_to_shared(t, i, 'B', 'C', 'D');
> +}
> +
> int main(int argc, char *argv[])
> {
> TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 34/46] KVM: selftests: Test conversion before allocation
From: Fuad Tabba @ 2026-06-25 7:00 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-34-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add two test cases to the guest_memfd conversions selftest to cover
> the scenario where a conversion is requested before any memory has been
> allocated in the guest_memfd region.
>
> The KVM_SET_MEMORY_ATTRIBUTES2 ioctl can be called on a memory region at
> any time. If the guest had not yet faulted in any pages for that region,
> the kernel must record the conversion request and apply the requested state
> when the pages are eventually allocated.
>
> The new tests cover both conversion directions.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../selftests/kvm/x86/guest_memfd_conversions_test.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> index 8e17d5c08aeb8..b43ac196330f1 100644
> --- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -265,6 +265,20 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
> #undef combine
> }
>
> +/*
> + * Test that even if there are no folios yet, conversion requests are recorded
> + * in guest_memfd.
> + */
> +GMEM_CONVERSION_TEST_INIT_SHARED(before_allocation_shared)
> +{
> + test_convert_to_private(t, 0, 0, 'A');
> +}
> +
> +GMEM_CONVERSION_TEST_INIT_PRIVATE(before_allocation_private)
> +{
> + test_convert_to_shared(t, 0, 0, 'A', 'B');
> +}
> +
> int main(int argc, char *argv[])
> {
> TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 33/46] KVM: selftests: Test conversion precision in guest_memfd
From: Fuad Tabba @ 2026-06-25 6:57 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-33-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> The existing guest_memfd conversion tests only use single-page memory
> regions. This provides no coverage for multi-page guest_memfd objects,
> specifically whether KVM correctly handles the page index for conversion
> operations. An incorrect implementation could, for example, always operate
> on the first page regardless of the index provided.
>
> Add a new test case to verify that conversions between private and shared
> memory correctly target the specified page within a multi-page guest_memfd.
>
> This test also verifies the precision of memory conversions by converting a
> single page an then iterating through all other pages ensure they remain in
> their original state.
>
> To support this test, add a new GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED
> macro that handles setting up and tearing down the VM for each page
> iteration. The teardown logic is adjusted to prevent a double-free in this
> new scenario.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../kvm/x86/guest_memfd_conversions_test.c | 66 ++++++++++++++++++++++
> 1 file changed, 66 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> index 5b070d3374eae..8e17d5c08aeb8 100644
> --- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -61,8 +61,13 @@ static void gmem_conversions_do_setup(test_data_t *t, int nr_pages,
>
> static void gmem_conversions_do_teardown(test_data_t *t)
> {
> + /* Use NULL to avoid second free in FIXTURE_TEARDOWN (multipage tests). */
> + if (!t->vcpu)
> + return;
> +
> /* No need to close gmem_fd, it's owned by the VM structure. */
> kvm_vm_free(t->vcpu->vm);
> + t->vcpu = NULL;
> }
>
> FIXTURE_TEARDOWN(gmem_conversions)
> @@ -101,6 +106,29 @@ static void __gmem_conversions_##test(test_data_t *t, int nr_pages) \
> #define GMEM_CONVERSION_TEST_INIT_SHARED(test) \
> __GMEM_CONVERSION_TEST_INIT_SHARED(test, 1)
>
> +/*
> + * Repeats test over nr_pages in a guest_memfd of size nr_pages, providing each
> + * test iteration with test_page, the index of the page under test in
> + * guest_memfd. test_page takes values 0..(nr_pages - 1) inclusive.
> + */
> +#define GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(test, __nr_pages) \
> +static void __gmem_conversions_multipage_##test(test_data_t *t, int nr_pages, \
> + const int test_page); \
> + \
> +TEST_F(gmem_conversions, test) \
> +{ \
> + const u64 flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED; \
> + int i; \
> + \
> + for (i = 0; i < __nr_pages; ++i) { \
> + gmem_conversions_do_setup(self, __nr_pages, flags); \
> + __gmem_conversions_multipage_##test(self, __nr_pages, i); \
> + gmem_conversions_do_teardown(self); \
> + } \
> +} \
> +static void __gmem_conversions_multipage_##test(test_data_t *t, int nr_pages, \
> + const int test_page)
> +
> struct guest_check_data {
> void *mem;
> char expected_val;
> @@ -199,6 +227,44 @@ GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
> test_convert_to_shared(t, 0, 'C', 'D', 'E');
> }
>
> +GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
> +{
> + int i;
> +
> + /* Get a char that varies with both i and n. */
> +#define combine(x, n) ((x << 4) + (n))
> +#define i_(n) (combine(i, n))
> +#define t_(n) (combine(test_page, n))
> +
> + /*
> + * Start with the highest index, to catch any errors when, perhaps, the
> + * first page is returned even for the last index.
> + */
> + for (i = nr_pages - 1; i >= 0; --i)
> + test_shared(t, i, 0, i_(0), i_(2));
> +
> + test_convert_to_private(t, test_page, t_(2), t_(3));
> +
> + for (i = 0; i < nr_pages; ++i) {
> + if (i == test_page)
> + test_private(t, test_page, t_(3), t_(4));
> + else
> + test_shared(t, i, i_(2), i_(3), i_(4));
> + }
> +
> + test_convert_to_shared(t, test_page, t_(4), t_(5), t_(6));
> +
> + for (i = 0; i < nr_pages; ++i) {
> + char expected = i == test_page ? t_(6) : i_(4);
> +
> + test_shared(t, i, expected, i_(7), i_(8));
> + }
> +
> +#undef t_
> +#undef i_
> +#undef combine
> +}
> +
> int main(int argc, char *argv[])
> {
> TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Fuad Tabba @ 2026-06-25 6:48 UTC (permalink / raw)
To: Ackerley Tng
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CAEvNRgGX3GkazCWM=6y9YLgn=YemXuG==Oo+L58cac1Fd86_TQ@mail.gmail.com>
On Wed, 24 Jun 2026 at 18:46, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Fri, Jun 19, 2026, Fuad Tabba wrote:
> >> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> >> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >> >
> >> > From: Ackerley Tng <ackerleytng@google.com>
> >> >
> >> > When memory in guest_memfd is converted from private to shared, the
> >> > platform-specific state associated with the guest-private pages must be
> >> > invalidated or cleaned up.
> >> >
> >> > Iterate over the folios in the affected range and call the
> >> > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> >> > architectures to perform necessary teardown, such as updating hardware
> >> > metadata or encryption states, before the pages are transitioned to the
> >> > shared state.
> >> >
> >> > Invoke this helper after indicating to KVM's mmu code that an invalidation
> >> > is in progress to stop in-flight page faults from succeeding.
> >> >
> >> > Reviewed-by: Fuad Tabba <tabba@google.com>
> >> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >>
> >> Coming back to this after working through the arm64/pKVM side. My
> >> Reviewed-by here is from the previous round and the patch hasn't
> >> changed, but I missed an implication for arm64.
> >>
> >> kvm_arch_gmem_invalidate() is now called from two paths with the same
> >> (start, end) signature: folio teardown (kvm_gmem_free_folio) and
> >> private->shared conversion (here). For SNP/TDX that's fine, conversion is
> >> destructive anyway. For pKVM the two need opposite content semantics:
> >> conversion must preserve the page in place (same physical page, the point
> >> of in-place conversion without encryption), while teardown must scrub it
> >> before returning it to the host.
> >>
> >> The hook gets only a pfn range with no indication of which caller it's
> >> serving, so arm64 can't give the two paths the behaviour they need. It
> >> would help to signal intent on the conversion path: a reason/flag, a
> >> separate hook, or not routing non-destructive conversion through the
> >> teardown hook.
> >>
> >> arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
> >> second caller now, and it's cheaper to leave room for the distinction
> >> than to change a generic contract other arches depend on later.
> >
> > Crud. It may not be urgent for arm64, but it's urgent for other reasons that
> > I "can't" describe in detail at the moment, and even if that weren't the case, I
> > think we should clean things up now. More below.
> >
> >> > virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> >> > 1 file changed, 41 insertions(+)
> >> >
> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> > index 433f79047b9d1..3c94442bc8131 100644
> >> > --- a/virt/kvm/guest_memfd.c
> >> > +++ b/virt/kvm/guest_memfd.c
> >> > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> >> > return safe;
> >> > }
> >> >
> >> > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> >> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> >
> > Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed. It's not
> > "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> > put physical pages back into a shared state when a maybe-private folio is freed.
> >
> > As Fuad points out, (ab)using that hook for the private=>shared conversion case
> > "works", but not broadly. And it makes the bad name worse, because it's called
> > from code that _is_ doing true invalidations. For pKVM, it may not even need to
> > do anything invalidation-like.
> >
>
> Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
> especially when conversions also calls
> kvm_gmem_invalidate_{start,end}() and those do different things.
>
> > To avoid a conflict with patches that are going to have priority over this series,
> > to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> > into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> > "invalidation" on this specific transition), I think we should add an arch hook
> > to do conversions straightaway.
> >
> > Unless there's a clever option I'm missing, it'll mean adding yet another
> > HAVE_KVM_ARCH_GMEM_XXX flag? Hmm, especially because IIUC, arm64/pKVM doesn't
> > need a callback for this case, only the free_folio case.
> >
> >> > +{
> >> > + struct folio_batch fbatch;
> >> > + pgoff_t next = start;
> >> > + int i;
> >> > +
> >> > + folio_batch_init(&fbatch);
> >> > + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> >> > + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> >> > + struct folio *folio = fbatch.folios[i];
> >> > + pgoff_t start_index, end_index;
> >> > + kvm_pfn_t start_pfn, end_pfn;
> >> > +
> >> > + start_index = max(start, folio->index);
> >> > + end_index = min(end, folio_next_index(folio));
> >> > + /*
> >> > + * end_index is either in folio or points to
> >> > + * the first page of the next folio. Hence,
> >> > + * all pages in range [start_index, end_index)
> >> > + * are contiguous.
> >> > + */
> >> > + start_pfn = folio_file_pfn(folio, start_index);
> >> > + end_pfn = start_pfn + end_index - start_index;
> >> > +
> >> > + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> >> > + }
> >> > +
> >> > + folio_batch_release(&fbatch);
> >> > + cond_resched();
> >> > + }
> >> > +}
> >> > +#else
> >> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> >> > +#endif
> >> > +
> >> > static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> >> > size_t nr_pages, uint64_t attrs,
> >> > pgoff_t *err_index)
> >> > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> >> > */
> >> >
> >> > kvm_gmem_invalidate_start(inode, start, end);
> >> > +
> >> > + if (!to_private)
> >> > + kvm_gmem_invalidate(inode, start, end);
> >
> > E.g. instead make this something like this?
> >
> > kvm_gmem_set_pfn_attributes(...)
> >
> > Hrm, though that wastes folio lookups in the to_private case. So maybe just this,
> > assuming pKVM doesn't need to take additional action on conversions?
> >
> > if (!to_private)
> > kvm_gmem_make_shared(...)
> >
> > Actually, if we do that, then we don't need a separate arch hook, just a separate
> > config. It'll still bleed SNP details into guest_memfd, but it'll at least be
> > done in a way that's more explicitly arch specific (and it's no different than
> > what we already do for PREPARE...).
> >
>
> pKVM needs some arch guest_memfd lifecycle functions that
>
> + for conversion, doesn't do anything,
> + for teardown, resets page state (IIUC it'll be reset to
> PKVM_PAGE_OWNED (by the host))
>
> So I think we need different functions for those two stages in the
> lifecycle of a page with guest_memfd? What if we have
Yes, the split is what I was after. One PFN-range hook for both
teardown and private->shared conversion can't tell them apart, and for
pKVM the two want opposite content semantics.
Two configs rather than one is right, since the needs are independent.
pKVM wants teardown but not conversion.
>
> CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates
>
> + kvm_gmem_should_set_pfn_attributes(attributes) and
> .gmem_should_set_pfn_attributes
> + kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
> .gmem_set_pfn_attributes
>
> CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates
>
> + kvm_gmem_teardown() and .gmem_teardown
>
> SNP:
>
> + .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
> and sev_gmem_should_set_pfn_attributes returns !is_private
> + Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
> + .gmem_teardown = sev_gmem_set_pfn_attributes
>
> TDX:
>
> + Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
> + Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN
>
> pKVM:
>
> + Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
> + .gmem_teardown = pkvm_gmem_set_pfn_attributes
Right for pKVM:
- teardown is not a no-op: it scrubs the page and resets the host
state to PKVM_PAGE_OWNED before the page returns to the host. Your
"reset to PKVM_PAGE_OWNED" reading is correct.
- the arch conversion hook is a no-op, so disabling SET_PFN_ATTRIBUTES
is correct. Conversions in pKVM are guest-initiated: the
share/unshare hypercall does the stage-2 and page-state transition
at EL2. The host still runs the generic conversion path (safety
check, attribute update) and accepts the conversion, but EL2 has
already done the transition, so there is nothing arch-specific left
for a hook to do. The page is preserved in place (no scrub).
If pKVM does turn out to need a step on conversion, it stays
non-destructive either way, and it can opt in later without touching
a contract others depend on.
Folding the direction check behind .gmem_should_set_pfn_attributes is
a good cleanup, it keeps the !to_private check out of generic gmem.
On naming: gmem_teardown is better. gmem_set_pfn_attributes reads a
bit close to KVM_SET_MEMORY_ATTRIBUTES, but naming is hard. :)
>
> Suzuki, does this work for ARM CCA?
>
> This way,
>
> + The if (is_private) check doesn't leak SNP details into guest_memfd
> + .gmem_make_shared doesn't stick out without a .gmem_make_private
> + .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
> conceptually as lifecycle hooks
>
> + I think the private/shared check for prepare can also be folded into
> preparation.
> + Preparation perhaps doesn't need a should_prepare equivalent since
> there's no iteration and getting the gfn is just doing some math?
> + In another patch series?
Agreed, separate series.
Thank you Ackerley!
/fuad
>
> > E.g. this? There will still be a looming rename conflict, but that's easy enough
> > to handle.
> >
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index 9ce5be7843f2..8aead0abd788 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> > return safe;
> > }
> >
> > -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> > -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> > +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> > +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
> > {
> > struct folio_batch fbatch;
> > pgoff_t next = start;
> > @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> > }
> > }
> > #else
> > -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> > +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
> > #endif
> >
> > static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > kvm_gmem_invalidate_start(inode, start, end);
> >
> > if (!to_private)
> > - kvm_gmem_invalidate(inode, start, end);
> > + kvm_gmem_make_shared(inode, start, end);
> >
> > mas_store_prealloc(&mas, xa_mk_value(attrs));
^ permalink raw reply
* RE: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Duan, Zhenzhong @ 2026-06-25 6:38 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: marcandre.lureau@redhat.com, david@kernel.org, Edgecombe, Rick P,
prsampat@amd.com, pbonzini@redhat.com, mst@redhat.com,
peterx@redhat.com, Qiang, Chenyi, Reshetova, Elena,
michael.roth@amd.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
virtualization@lists.linux.dev, x86@kernel.org, Xu, Yilun,
Li, Xiaoyao, Peng, Chao P
In-Reply-To: <ajvNXyYwb7FXAJhP@thinkstation>
>-----Original Message-----
>From: Kiryl Shutsemau <kas@kernel.org>
>Subject: Re: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug
>memory
>
>On Tue, Jun 23, 2026 at 06:17:33AM -0400, Zhenzhong Duan wrote:
>> In coco guests, hotpluggable memory ranges are initially unaccepted.
>> While a previous change expanded the unaccepted memory bitmap boundaries
>> to include these hotplug spaces, the actual bits inside the bitmap are
>> not yet marked as unaccepted.
>>
>> Walks SRAT a second time after the bitmap is allocated and sets the bits
>> corresponding to hotpluggable ranges.
>>
>> This ensures the bitmap state accurately reflects all static and hotplug
>> memory ranges before booting kernel.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> .../firmware/efi/libstub/unaccepted_memory.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c
>b/drivers/firmware/efi/libstub/unaccepted_memory.c
>> index bfbb78bd7b8a..01bed8e751ca 100644
>> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
>> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
>> @@ -92,6 +92,23 @@ static void update_mem_boundaries(struct
>acpi_srat_mem_affinity *mem, struct sra
>> *(ctx->mem_end) = range_end;
>> }
>>
>> +static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity
>*mem,
>> + struct srat_parse_ctx *ctx)
>> +{
>> + u64 unit_size = unaccepted_table->unit_size;
>> + u64 start, end;
>> +
>> + start = round_up(mem->base_address, unit_size);
>> + end = round_down(mem->base_address + mem->length, unit_size);
>
>We can get here with start > end if srat range is less then unit_size.
Will add a check to ignore small range less than unit_size:
+ if (start >= end)
+ return;
+
Thanks
Zhenzhong
^ permalink raw reply
* Re: [PATCH v2 02/17] x86/virt/tdx: Configure add-on features on TDX module init and update
From: Xu Yilun @ 2026-06-25 6:33 UTC (permalink / raw)
To: Peter Fang
Cc: Dave Hansen, x86, kvm, linux-coco, linux-kernel, djbw, kas,
rick.p.edgecombe, yilun.xu, xiaoyao.li, sohil.mehta,
adrian.hunter, kishen.maloor, tony.lindgren, baolu.lu,
zhenzhong.duan, dave.hansen, seanjc
In-Reply-To: <20260624221037.GD923079@pedri>
On Wed, Jun 24, 2026 at 03:10:37PM -0700, Peter Fang wrote:
> On Wed, Jun 24, 2026 at 08:00:39PM +0800, Xu Yilun wrote:
> > > There's also zero stopping us from putting version in args:
> > >
> > > struct tdx_module_args args = {};
> > > int ret;
> > >
> > > if (tdx_addon_feature0) {
> > > args.r9 = tdx_addon_feature0;
> > > args.version = 1;
> > > }
> > >
> > > ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
> > >
> > > Eh?
> > >
> > > That gives args.version==0 in all the normal cases which just happens to
> > > be the exact behavior we want. It also avoids having to plumb version
> > > through all the seamcall*() wrappers.
> >
> > Ah, on 2nd reading, I'm pretty sure now I understand your logical argument in
> > patch 1 and 2. It's good to me. I append my diff at the end.
> >
>
> [ ... ]
>
> > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > index 016a2a1ec1d6..d1d3d40c5614 100644
> > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > @@ -48,6 +48,14 @@
> > /* Move Leaf ID to RAX */
> > mov %rdi, %rax
> >
> > + /*
> > + * Extract the version from 'struct tdx_module_args', append it to
> > + * RAX[23:16]
> > + */
> > + movzbl TDX_MODULE_version(%rsi), %ecx
> > + shll $16, %ecx
> > + orq %rcx, %rax
> > +
> > /* Move other input regs from 'struct tdx_module_args' */
> > movq TDX_MODULE_rcx(%rsi), %rcx
> > movq TDX_MODULE_rdx(%rsi), %rdx
>
> This approach looks much cleaner to me. Would it be better to have a
> small C helper to encode the final RAX value instead of operating on RAX
> directly in asm? Looking at the May 2026 edition of the ABI spec,
> SEAMCALL RAX encoding is starting to get quite complex. Just thinking
> about this from a readability standpoint.
I'm also good to it. I made some diff for your proposal, Some additional
effort here is to update some comments and parameter names, to reflect
the differences between "function/func/fn" (the unversioned number) and
the final composite "fn_code" for RAX.
-----8<-------
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index f20e91d7ac35..c26eca18fded 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -143,6 +143,8 @@ struct tdx_module_args {
u64 rbx;
u64 rdi;
u64 rsi;
+ /* for leaf encoding */
+ u8 version;
};
/* Used to communicate with the TDX module */
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index 6854c52c374b..5cf3993e98f4 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -10,8 +10,8 @@
*
* __seamcall() function ABI:
*
- * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
- * @args (RSI) - struct tdx_module_args for input
+ * @fn_code (RDI) - SEAMCALL composite leaf code, moved to RAX
+ * @args (RSI) - struct tdx_module_args for input
*
* Only RCX/RDX/R8-R11 are used as input registers.
*
@@ -29,8 +29,8 @@ SYM_FUNC_END(__seamcall)
*
* __seamcall_ret() function ABI:
*
- * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
- * @args (RSI) - struct tdx_module_args for input and output
+ * @fn_code (RDI) - SEAMCALL composite leaf code, moved to RAX
+ * @args (RSI) - struct tdx_module_args for input and output
*
* Only RCX/RDX/R8-R11 are used as input/output registers.
*
@@ -51,8 +51,8 @@ SYM_FUNC_END(__seamcall_ret)
*
* __seamcall_saved_ret() function ABI:
*
- * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
- * @args (RSI) - struct tdx_module_args for input and output
+ * @fn_code (RDI) - SEAMCALL composite leaf code, moved to RAX
+ * @args (RSI) - struct tdx_module_args for input and output
*
* All registers in @args are used as input/output registers.
*
diff --git a/arch/x86/virt/vmx/tdx/seamcall_internal.h b/arch/x86/virt/vmx/tdx/seamcall_internal.h
index be5f446467df..bb17d965b453 100644
--- a/arch/x86/virt/vmx/tdx/seamcall_internal.h
+++ b/arch/x86/virt/vmx/tdx/seamcall_internal.h
@@ -11,17 +11,28 @@
#ifndef _X86_VIRT_SEAMCALL_INTERNAL_H
#define _X86_VIRT_SEAMCALL_INTERNAL_H
+#include <linux/bitfield.h>
#include <linux/printk.h>
#include <linux/types.h>
#include <asm/archrandom.h>
#include <asm/processor.h>
#include <asm/tdx.h>
-u64 __seamcall(u64 fn, struct tdx_module_args *args);
-u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
-u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
+u64 __seamcall(u64 fn_code, struct tdx_module_args *args);
+u64 __seamcall_ret(u64 fn_code, struct tdx_module_args *args);
+u64 __seamcall_saved_ret(u64 fn_code, struct tdx_module_args *args);
-typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
+typedef u64 (*sc_func_t)(u64 fn_code, struct tdx_module_args *args);
+
+#define SEAMCALL_VERSION_MASK GENMASK_U64(23, 16)
+
+static __always_inline u64 __seamcall_fn_encoding(sc_func_t func, u64 fn,
+ struct tdx_module_args *args)
+{
+ FIELD_MODIFY(SEAMCALL_VERSION_MASK, &fn, args->version);
+
+ return func(fn, args);
+}
static __always_inline u64 __seamcall_dirty_cache(sc_func_t func, u64 fn,
struct tdx_module_args *args)
@@ -39,7 +50,7 @@ static __always_inline u64 __seamcall_dirty_cache(sc_func_t func, u64 fn,
*/
this_cpu_write(cache_state_incoherent, true);
- return func(fn, args);
+ return __seamcall_fn_encoding(func, fn, args);
}
static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 016a2a1ec1d6..b0f7867bcd1c 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -24,7 +24,7 @@
*-------------------------------------------------------------------------
* Input Registers:
*
- * RAX - TDCALL/SEAMCALL Leaf number.
+ * RAX - TDCALL/SEAMCALL composite Leaf code.
* RCX,RDX,RDI,RSI,RBX,R8-R15 - TDCALL/SEAMCALL Leaf specific input registers.
*
* Output Registers:
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6a1c4fe202bb..8c1a5b7f603a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1019,7 +1019,6 @@ static __init void set_tdx_addon_features(void)
static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
u64 global_keyid)
{
- u64 seamcall_fn = TDH_SYS_CONFIG_V0;
struct tdx_module_args args = {};
u64 *tdmr_pa_array;
size_t array_sz;
@@ -1042,18 +1041,18 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
+ set_tdx_addon_features();
+
args.rcx = __pa(tdmr_pa_array);
args.rdx = tdmr_list->nr_consumed_tdmrs;
args.r8 = global_keyid;
- set_tdx_addon_features();
-
if (tdx_addon_feature0) {
args.r9 = tdx_addon_feature0;
- seamcall_fn = TDH_SYS_CONFIG;
+ args.version = 1;
}
- ret = seamcall_prerr(seamcall_fn, &args);
+ ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
/* Free the array as it is not required anymore. */
kfree(tdmr_pa_array);
@@ -1515,16 +1514,15 @@ int tdx_module_shutdown(void)
int tdx_module_run_update(void)
{
- u64 seamcall_fn = TDH_SYS_UPDATE_V0;
struct tdx_module_args args = {};
int ret;
if (tdx_addon_feature0) {
args.r9 = tdx_addon_feature0;
- seamcall_fn = TDH_SYS_UPDATE;
+ args.version = 1;
}
- ret = seamcall_prerr(seamcall_fn, &args);
+ ret = seamcall_prerr(TDH_SYS_UPDATE, &args);
if (ret)
return ret;
@@ -2112,6 +2110,7 @@ u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid)
.rcx = vp->tdvpr_pa,
.rdx = initial_rcx,
.r8 = x2apicid,
+ .version = 1,
};
return seamcall(TDH_VP_INIT, &args);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 2deb0a5c902e..1f43d2eb2345 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -2,7 +2,6 @@
#ifndef _X86_VIRT_TDX_H
#define _X86_VIRT_TDX_H
-#include <linux/bitfield.h>
#include <linux/bits.h>
/*
@@ -12,18 +11,6 @@
* architectural definitions come first.
*/
-/*
- * SEAMCALL leaf:
- *
- * Bit 15:0 Leaf number
- * Bit 23:16 Version number
- */
-#define SEAMCALL_LEAF GENMASK(15, 0)
-#define SEAMCALL_VER GENMASK(23, 16)
-
-#define SEAMCALL_LEAF_VER(l, v) (FIELD_PREP(SEAMCALL_LEAF, l) | \
- FIELD_PREP(SEAMCALL_VER, v))
-
/*
* TDX module SEAMCALL leaf functions
*/
@@ -44,7 +31,7 @@
#define TDH_VP_CREATE 10
#define TDH_MNG_KEY_FREEID 20
#define TDH_MNG_INIT 21
-#define TDH_VP_INIT SEAMCALL_LEAF_VER(22, 1)
+#define TDH_VP_INIT 22
#define TDH_PHYMEM_PAGE_RDMD 24
#define TDH_VP_RD 26
#define TDH_PHYMEM_PAGE_RECLAIM 28
@@ -58,11 +45,9 @@
#define TDH_PHYMEM_CACHE_WB 40
#define TDH_PHYMEM_PAGE_WBINVD 41
#define TDH_VP_WR 43
-#define TDH_SYS_CONFIG_V0 45
-#define TDH_SYS_CONFIG SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
+#define TDH_SYS_CONFIG 45
#define TDH_SYS_SHUTDOWN 52
-#define TDH_SYS_UPDATE_V0 53
-#define TDH_SYS_UPDATE SEAMCALL_LEAF_VER(TDH_SYS_UPDATE_V0, 1)
+#define TDH_SYS_UPDATE 53
#define TDH_EXT_INIT 60
#define TDH_EXT_MEM_ADD 61
#define TDH_SYS_DISABLE 69
^ permalink raw reply related
* Re: [PATCH v2 17/17] KVM: TDX: Support event-notify interrupts only with userspace Quoting
From: Tony Lindgren @ 2026-06-25 6:28 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-18-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:55PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -202,8 +202,15 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
>
> caps->cpuid.nent = td_conf->num_cpuid_config;
>
> - caps->user_tdvmcallinfo_1_r11 =
> - TDVMCALLINFO_SETUP_EVENT_NOTIFY_INTERRUPT;
> + /*
> + * Don't advertise userspace event-notify interrupt support if TDX
> + * quoting service is enabled, as quote generation will be handled
> + * entirely in the kernel. Support in the kernel can be added later.
> + */
> + if (!tdx_quote_enabled()) {
> + caps->user_tdvmcallinfo_1_r11 |=
> + TDVMCALLINFO_SETUP_EVENT_NOTIFY_INTERRUPT;
> + }
Can you use kvm_tdx->get_quote_in_kernel also above? Or should it maybe
be initialized here if not used earlier?
> @@ -1684,9 +1691,16 @@ static int tdx_get_quote(struct kvm_vcpu *vcpu)
>
> static int tdx_setup_event_notify_interrupt(struct kvm_vcpu *vcpu)
> {
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> u64 vector = tdx->vp_enter_args.r12;
>
> + /* See comment in init_kvm_tdx_caps() */
> + if (kvm_tdx->get_quote_in_kernel) {
> + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED);
> + return 1;
> + }
> +
Since you're using kvm_tdx->get_quote_in_kernel here.
^ permalink raw reply
* Re: [PATCH v2 15/17] KVM: TDX: Factor out userspace return path from tdx_get_quote()
From: Tony Lindgren @ 2026-06-25 6:16 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-16-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:53PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> Separate the logic that returns the GetQuote TDVMCALL exit to userspace
> so that tdx_get_quote() can be extended to support in-kernel Quote
> generation.
>
> No functional change intended.
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v2 14/17] x86/tdx: Move and rename Quote request structure
From: Tony Lindgren @ 2026-06-25 6:15 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-15-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:52PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> Move struct tdx_quote_buf to tdx.h so it can be shared by the guest
> driver and core TDX code, as the host will also need the Quote buffer
> format for in-kernel Quote generation.
>
> Rename the struct to tdx_quote_req to better reflect its purpose, and
> replace "quote_buf" with "quote_req" in tdx-guest.c.
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v2 13/17] x86/virt/tdx: Enable Quoting extension
From: Tony Lindgren @ 2026-06-25 6:13 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-14-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:51PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> The Quoting extension generates TDX attestation Quotes in the TDX
> module, without using a discrete Quoting engine. Enable this feature by
> requesting it in TDH.SYS.CONFIG and TDH.SYS.UPDATE.
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v2 12/17] x86/virt/tdx: Reinitialize the Quoting extension after TDX module update
From: Tony Lindgren @ 2026-06-25 6:12 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-13-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:50PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> Invoke TDH.QUOTE.INIT again after a runtime module update to trigger the
> necessary rekey procedure in the TDX module.
>
> Keep the existing Quote buffer since memory allocation is not permitted
> during the update. Compatible TDX module updates must not increase the
> Quote buffer size, or an undersized buffer might cause Quote generation
> to fail. See [1] for module update details.
>
> [1] Documentation/arch/x86/tdx.rst, Section "TDX module Runtime Update"
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v2 10/17] x86/virt/tdx: Move tdx_tdr_pa() up in the file
From: Tony Lindgren @ 2026-06-25 6:10 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-11-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:48PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> Move tdx_tdr_pa() earlier in the file to prepare for upcoming changes
> that add a new Extension-SEAMCALL.
>
> No functional change intended.
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v2 09/17] x86/virt/tdx: Add interface to check Quoting availability
From: Tony Lindgren @ 2026-06-25 6:09 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-10-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:47PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> KVM needs to know if the Quoting extension is available to determine
> whether userspace must be involved in Quote generation.
>
> Since the Quote buffer is always created during Quoting extension
> bringup, checking whether the buffer exists is sufficient.
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v2 08/17] x86/virt/tdx: Prepare Quote buffer during extension bringup
From: Tony Lindgren @ 2026-06-25 6:08 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-9-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:46PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> For simplicity, let all guests share a global buffer. Build the buffer's
> HPA_LINKED_LIST at Quoting extension bringup. This saves a bunch of
> va-to-pa conversions at runtime.
To me it seems the pre-generated parts can be made into a template that
can be copied to an allocated buffer looking at patch 11/17.
^ permalink raw reply
* Re: [PATCH v2 11/17] x86/virt/tdx: Add interface to generate a Quote
From: Tony Lindgren @ 2026-06-25 6:05 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-12-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:49PM +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
...
> +void *tdx_quote_generate(struct tdx_td *td, void *in_data, u32 in_data_len,
> + u32 *quote_len)
> +{
> + struct tdx_quote_data *qdata = &tdx_quote;
> + void *quote_dup = NULL;
> + u64 r, out_len;
> +
> + if (!tdx_quote_enabled())
> + return NULL;
> +
> + mutex_lock(&tdx_quote_lock);
How about make the pre-generated static tdx_quote a template page that only
gets read and copied to an allocated bufer here?
If the tdx_quote template is only read for copying here, seems you're not
going to need the mutex at all? That is assuming tdx_quote template does
not change after init.
^ permalink raw reply
* RE: [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Duan, Zhenzhong @ 2026-06-25 5:56 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: marcandre.lureau@redhat.com, david@kernel.org, Edgecombe, Rick P,
prsampat@amd.com, pbonzini@redhat.com, mst@redhat.com,
peterx@redhat.com, Qiang, Chenyi, Reshetova, Elena,
michael.roth@amd.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
virtualization@lists.linux.dev, x86@kernel.org, Xu, Yilun,
Li, Xiaoyao, Peng, Chao P
In-Reply-To: <ajvOP9NwqNstgNvl@thinkstation>
>-----Original Message-----
>From: Kiryl Shutsemau <kas@kernel.org>
>Subject: Re: [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI
>hotplug/unplug for coco guest
>
>On Tue, Jun 23, 2026 at 06:17:36AM -0400, Zhenzhong Duan wrote:
>> + spin_lock_irqsave(&unaccepted_memory_lock, flags);
>> + for (; range_start < bitmap_size; range_start = range_end) {
>> + unsigned long phys_start, phys_end;
>> + unsigned long unaccepted_one, plugged_zero;
>> +
>> + range_start = find_next_andnot_bit(plugged_bitmap,
>unaccepted->bitmap,
>> + bitmap_size, range_start);
>> +
>> + if (range_start >= bitmap_size)
>> + break;
>> +
>> + unaccepted_one = find_next_bit(unaccepted->bitmap,
>bitmap_size, range_start);
>> + plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size,
>range_start);
>> + range_end = min(unaccepted_one, plugged_zero);
>> +
>> + phys_start = range_start * unit_size + unaccepted->phys_base;
>> + phys_end = range_end * unit_size + unaccepted->phys_base;
>> +
>> + arch_unaccept_memory(phys_start, phys_end);
>> + bitmap_set(unaccepted->bitmap, range_start, range_end -
>range_start);
>> + }
>> + spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
>
>Accept TDCALL under the spin lock will kill scalability.
OK, I can drop the lock during arch_unaccept_memory() and avoid race
by checking the accepting_list just like in accept_memory().
I initially wrapped this in the spinlock because TDG.MEM.PAGE.RELEASE
is a quick local TDX module call to transition pages back to PENDING state,
without the heavy VMM trapping/faulting overhead associated with
memory acceptance paths.
Thanks
Zhenzhong
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-25 5:38 UTC (permalink / raw)
To: K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <fe9927ad-a06a-4a4b-8122-12644513ed14@amd.com>
Hello Prateek,
On 6/24/2026 10:45 PM, K Prateek Nayak wrote:
> Hello Ashish,
>
> On 6/25/2026 3:26 AM, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> While SNP is active, every memory write is checked against the RMP to
>> protect the integrity of SEV-SNP guest memory. By the SNP architecture
>> these checks cannot be disabled on a subset of CPUs: they are gated
>> per-core by SYSCFG[SNP_EN], which the SEV firmware requires to be set on
>> every present CPU before SNP initialization. A CPU that does not have
>> SNP_EN set and was not initialized via SNP_INIT performs no RMP checks at
>> all, so there is no valid configuration with SNP active and any CPU exempt
>> from RMP checks.
>>
>> The firmware determines which CPUs are present from the processor and the
>> BIOS/UEFI configuration (e.g. SMT disabled in the BIOS) and enumerates
>> them at SNP init; it is not aware of the OS bringing CPUs online or
>> offline afterwards. A CPU brought online after SNP init was not
>> enumerated at SNP_INIT and does not have SNP_EN set, so writes from it are
>> not RMP-checked and could corrupt SEV-SNP guest memory, and there is no
>> way to keep work off such a CPU once it is online. OS CPU hotplug can thus
>> diverge from the firmware's expectations and break SNP.
>
> If this is true ...
>
> [..snip..]
>
>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>> index 217b6b19802e..66475145b3fa 100644
>> --- a/drivers/crypto/ccp/sev-dev.c
>> +++ b/drivers/crypto/ccp/sev-dev.c
>> @@ -1479,6 +1479,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
>>
>> snp_hv_fixed_pages_state_update(sev, HV_FIXED);
>>
>> + /* Disable CPU hotplug while SNP is active (see snp_disable_cpu_hotplug). */
>> + snp_disable_cpu_hotplug();
>
> ... then this should be done at snp_prepare() before
> on_each_cpu(snp_enable) right?
>
> If not, then any CPU hotplug between the cpus_read_unlock() there and
> the snp_disable_cpu_hotplug() here will not have the SNP_EN set.
>
> Isn't that a concern?
yes — it's a concern, and i agree the disable belongs in snp_prepare() before SNP_EN is programmed.
snp_enable runs via on_each_cpu() over the set that is online at snp_prepare() time, and SNP_INIT_EX runs
right after. With the disable where it is now (after SNP_INIT_EX/DF_FLUSH), there's a window starting at
snp_prepare()'s cpus_read_unlock() in which a CPU can come online that never had snp_enable run on it, i.e.
with SNP_EN clear. .
So hotplug needs to be frozen before SNP_EN is programmed, so the online set that gets SNP_EN cannot change underneath us.
I'll move the disable into snp_prepare(), before cpus_read_lock() rather than just before on_each_cpu(snp_enable):
cpu_hotplug_disable() takes cpu_add_remove_lock, which nests above cpu_hotplug_lock, so calling it under
cpus_read_lock() would invert the order, causing deadlock.
On the failure paths where SNP does not end up active, i.e., SNP_INIT_EX/DF_FLUSH error, then I'll
re-enable hotplug so a failed init doesn't leave it permanently disabled; the success path continues to re-enable
only on the full shutdown path.
Will fix in v10.
Thanks,
Ashish
>
> Also, this patch can probably go first since the FW assumptions on
> hotplug exists independent of RMPOPT bits.
>
>> +
>> snp_setup_rmpopt();
>>
>> sev->snp_initialized = true;
>
^ permalink raw reply
* Re: [PATCH v2 03/17] x86/virt/tdx: Detect if the extensions initialization is required
From: Tony Lindgren @ 2026-06-25 5:19 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <20260618081355.3253581-4-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026 at 04:13:41PM +0800, Xu Yilun wrote:
> TDX module extensions support extension SEAMCALLs that are preemptible
> and resumable, unlike normal SEAMCALLs that run to completion while
> monopolizing the CPU. This allows for higher-level API constructions,
> so better supports some add-on features that implement higher order
> security protocols.
How about "TDX module extension SEAMCALLs are preemptible and resumable..."
above to make it easier to read?
Other than that:
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: K Prateek Nayak @ 2026-06-25 3:45 UTC (permalink / raw)
To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <ba146ca15b7f76eee386c8c073fb3f1cc36e5781.1782336473.git.ashish.kalra@amd.com>
Hello Ashish,
On 6/25/2026 3:26 AM, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> While SNP is active, every memory write is checked against the RMP to
> protect the integrity of SEV-SNP guest memory. By the SNP architecture
> these checks cannot be disabled on a subset of CPUs: they are gated
> per-core by SYSCFG[SNP_EN], which the SEV firmware requires to be set on
> every present CPU before SNP initialization. A CPU that does not have
> SNP_EN set and was not initialized via SNP_INIT performs no RMP checks at
> all, so there is no valid configuration with SNP active and any CPU exempt
> from RMP checks.
>
> The firmware determines which CPUs are present from the processor and the
> BIOS/UEFI configuration (e.g. SMT disabled in the BIOS) and enumerates
> them at SNP init; it is not aware of the OS bringing CPUs online or
> offline afterwards. A CPU brought online after SNP init was not
> enumerated at SNP_INIT and does not have SNP_EN set, so writes from it are
> not RMP-checked and could corrupt SEV-SNP guest memory, and there is no
> way to keep work off such a CPU once it is online. OS CPU hotplug can thus
> diverge from the firmware's expectations and break SNP.
If this is true ...
[..snip..]
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 217b6b19802e..66475145b3fa 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1479,6 +1479,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
>
> snp_hv_fixed_pages_state_update(sev, HV_FIXED);
>
> + /* Disable CPU hotplug while SNP is active (see snp_disable_cpu_hotplug). */
> + snp_disable_cpu_hotplug();
... then this should be done at snp_prepare() before
on_each_cpu(snp_enable) right?
If not, then any CPU hotplug between the cpus_read_unlock() there and
the snp_disable_cpu_hotplug() here will not have the SNP_EN set.
Isn't that a concern?
Also, this patch can probably go first since the FW assumptions on
hotplug exists independent of RMPOPT bits.
> +
> snp_setup_rmpopt();
>
> sev->snp_initialized = true;
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Xiaoyao Li @ 2026-06-25 2:37 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <ajxtMdOFCPiMU0Pw@google.com>
On 6/25/2026 7:50 AM, Lisa Wang wrote:
> On Wed, Jun 17, 2026 at 11:21:49AM +0800, Xiaoyao Li wrote:
>>> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
>>> @@ -11,4 +11,34 @@ static inline bool is_tdx_vm(struct kvm_vm *vm)
>>> return vm->type == KVM_X86_TDX_VM;
>>> }
>
> Thanks for the review. I will change all comments for this patch I did
> not reply here in the next version.
>
>>> +/*
>>> + * TDX ioctls
>>> + * Use underscores to avoid collisions with struct member names.
>>> + */
>>> +#define __tdx_vm_ioctl(vm, cmd, _flags, arg) \
>>> +({ \
>>> + int r; \
>>> + \
>>> + union { \
>>> + struct kvm_tdx_cmd c; \
>>> + unsigned long raw; \
>>> + } tdx_cmd = { .c = { \
>>> + .id = (cmd), \
>>> + .flags = (u32)(_flags), \
>>> + .data = (u64)(arg), \
>>> + } }; \
>>> + \
>>> + r = __vm_ioctl(vm, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd.raw); \
>>> + r ?: tdx_cmd.c.hw_error; \
>>> +})
>>
>> It looks __tdx_vm_ioctl() can be implemented as the static inline function.
>>
>> Given all the existing xxx_ioctl() are implmeneted as MACRO, I'm OK with it.
>
> Most vm_ioctl() and vcpu_ioctl() used MACRO.
> I prefer to use MACRO for `tdx_vm_ioctl()` for the `#cmd`, but I am ok
> to change `__tdx_vm_ioctl()` to static inline function.
> If we can only change part of it, do you think it is better to change it
> as the static inline fucntion ?
Personally, I like function over MARCO, but it doesn't make a huge
difference either way. Let's keep it as-is and move on.
>>> +}
>>> +
>>> +static struct kvm_cpuid_entry2 *tdx_find_cpuid_config(struct kvm_tdx_capabilities *cap,
>>> + u32 leaf, u32 sub_leaf)
>>> +{
>>> + struct kvm_cpuid_entry2 *config;
>>> + u32 i;
>>> +
>>> + for (i = 0; i < cap->cpuid.nent; i++) {
>>> + config = &cap->cpuid.entries[i];
>>> +
>>> + if (config->function == leaf && config->index == sub_leaf)
>>> + return config;
>>> + }
>>> +
>>> + return NULL;
>>> +}
>>
>> No need to introduce a new fucntin. We can use get_cpuid_entry().
>
> I think we cannot use `get_cpuid_entry()` directly here.
> This function is called by `tdx_filter_cpuid()` to check if KVM's
> supported CPUID entries are present in the TDX capabilities list. Since
> the TDX list only contains a subset of CPUID leaves, some queries will
> naturally return NULL (which we want to gracefully filter out) However,
> `get_cpuid_entry()` has an embedded TEST_FAIL(), which would abort for
> any missing leaf.
> How about I add a `__get_cpuid_entry()` to share the same part of them?
yes, please.
> Lisa
>
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-25 2:25 UTC (permalink / raw)
To: Ackerley Tng
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CAEvNRgG1nHipzw4=eBgwhvyXi8xYo7FQD_sy9Ax6FDf7YDu3Og@mail.gmail.com>
On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Tue, Jun 23, 2026, Yan Zhao wrote:
> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> >> > > > > return -EIO;
> >> > > > >
> >> > > > > - if (!src_page)
> >> > > > > - return -EOPNOTSUPP;
> >> > > > > + if (!src_page) {
> >> > > > > + if (!gmem_in_place_conversion)
> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
> >> > > > error.
> >> > >
> >> > > Why MMAP?
> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
> >> >
> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
> >> > > initialize the memory.
> >> > Do you mean using up-to-date flag as below?
> >
> > Yes? I didn't actually look at the implementation details.
> >
> >> > if (!src_page) {
> >> > src_page = pfn_to_page(pfn);
> >> > if (!folio_test_uptodate(page_folio(src_page)))
> >> > return -EOPNOTSUPP;
> >> > }
>
> Yan is right that with the earlier patch "Zero page while getting pfn",
> folio_test_uptodate() here will always return true.
>
> Actually, this is an alternative fix for the issue Sashiko pointed out
> on v7 where userspace can do a populate() (either TDX or SNP) without
> first allocating the page, with src_address == NULL, and leak
> uninitialized memory into the guest.
>
> Advantage of using the uptodate check in populate: if the host never
> allocates the page, populate doesn't incur zeroing before writing the
> page anyway in populate().
>
> Disadvantage: Both TDX and SNP will have to implement this uptodate
> check. guest_memfd can't check centrally because for SNP, for a
> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
> firmware will zero and there's no leakage of uninitialized host memory?
Another disadvantage: the uptodate flag is per-folio. What if the folio
is only partially initialized by the userspace especially after huge page is
supported?
> >> Another concern with this fix is that:
> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
> >> folio uptodate before reaching post_populate().
> >>
> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
> >>
> >> > One concern is that TDX now does not much care about the up-to-date flag since
> >> > TDX doesn't rely on the flag to clear pages on conversions.
> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
> >> > user access.
> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
> >> > to have more effort than introducing a dedicated flag for TDX.
> >> >
> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> >> > > > 1. create guest_memfd with MMAP flag
> >> > > > 2. mmap the guest_memfd.
> >> > > > 3. convert the initial memory range to shared.
> >> > > > 4. copy initial content to the source page.
> >> > > > 5. convert the initial memory range to private
> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> >> > > > 7. do not unmap the source backend.
> >> > > >
> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
> >> > >
> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>
> Yan, is your concern that userspace forgot to update the code and
> forgets to provide a src_page, and if we keep the "Zero page while
Yes. Previously, it would be rejected after GUP fails.
> getting pfn" patch, ends up with the guest silently having a zero page?
> I think that would be found quite early in userspace VMM testing...
I actually encountered this during testing this patch.
I update most code path to follow this sequence. However, still some corner ones
for TDVF HOB, which are less obvious and harder to update.
The TD just booted up and hang silently.
> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
> >> > kernel to detect this mistake, similar to how it validates whether source_addr
> >> > is PAGE_ALIGNED.
> >
> > The alignment case is different. If userspace provides an unaligned value, KVM
> > *can't* do what userspace is asking because hardware and thus KVM only supports
> > converting on page boundaries.
> >
> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
> > request would then be making assumptions about what userspace wants.
> >
>
> Also, +1 on this, what if userspace, knowing that pages are zeroed on
> allocation, actually wants to rely on that to get a zero page in the guest?
What if 0 uaddr is a valid address? :)
> >> > Since userspace already needs to perform additional steps to enable in-place
> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
> >> > intentional seems like a reasonable burden.
> >
> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
> > just wen up with code that does:
> >
> > if (in-place) {
> > src = NULL;
> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
> > }
>
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25 1:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajx5Vrz9ma--hrGH@google.com>
On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > This means this module parameter only enables per-gmem memory attribute and does
> > > not guarantee that gmem in-place conversion will actually occur.
>
> KVM module params are pretty much always about what KVM supports, not what is
> guaranteed to happen.
>
> - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> because maybe the guest never accesses emulated MMIO.
> - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> not to advertise one.
> - and so on and so forth...
>
> Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> to "I need to set memory attributes on the guest_memfd instance, not the VM",
> but I don't see that as a big hurdle, certainly not in the long term. And once
> the VMM code is written, I really do think most people are going to care about
> whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
Sorry, I just saw this mail after posting my reply in [1].
I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
conversion, while we can still create VMs with shared memory not from gmem.
Though it still feels a bit odd to require TDX huge pages to depend on
gmem_in_place_conversion=true when shared memory is not currently allocated from
gmem, it should become more natural over time once gmem supports in-place
conversions for huge page.
[1] https://lore.kernel.org/all/ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com
> > > To avoid confusion, could we rename this module parameter to something more
> > > accurate, such as gmem_memory_attribute?
> >
> > I asked Sean about this after getting some fixes off list. Sean said
> > gmem_in_place_conversion is named for a host admin to use, and something
> > like gmem_memory_attributes is too much implementation details for the
> > admin.
> >
> > Sean, would you reconsider since Yan also asked? If the admin compiled
> > the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> > admin would also be able to use a param like gmem_memory_attributes?
>
> No, because it's not all memory attributes, it's very specifically the PRIVATE
> attribute that will get moved to guest_memfd. I don't want to pick a name that
> will become stale and confusing when RWX attributes come along. The RWX bits
> will be per-VM, while PRIVATE will be per-guest_memfd.
^ permalink raw reply
* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Binbin Wu @ 2026-06-25 1:39 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CAEvNRgG-WDzHp-15Mig4hiU5Dag0pFCu70-R-9b=PkD69W=ZMg@mail.gmail.com>
On 6/24/2026 10:38 PM, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
>
>>
>> [...snip...]
>>
>>> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
>>> +{
>>> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>>> + struct inode *inode;
>>> +
>>> + /*
>>> + * If this gfn has no associated memslot, there's no chance of the gfn
>>> + * being backed by private memory, since guest_memfd must be used for
>>> + * private memory,
>>
>> "guest_memfd must be used for private memory" is a bit confusing to me.
>>
>
> Hmm good point. Is the source of confusion that guest_memfd can be used
> for both shared and private memory?
Yes.
>
> Perhaps this can be rephrased as:
>
> guest_memfd is the only provider of private memory and guest_memfd must
> be used with a memslot, hence if there's no associated memslot, there's
> no chance of this gfn being private.
LGTM.
>
>>> and guest_memfd must be associated with some memslot.
>>> + */
>>> + if (!slot)
>>> + return 0;
>>> +
>>>
>>> [...snip...]
>>>
>
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25 1:21 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgHYTFnHbsLLgMTCSitmnp1_j9Pomikm9qmpGTh1w8YE5Q@mail.gmail.com>
On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> >
> > [...snip...]
> >
> >>
> >> #ifdef kvm_arch_has_private_mem
> >> -bool __ro_after_init gmem_in_place_conversion = false;
> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> >> +module_param(gmem_in_place_conversion, bool, 0444);
> >
> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > MMAP flag. In such cases, shared memory is allocated from different backends.
> > This means this module parameter only enables per-gmem memory attribute and does
> > not guarantee that gmem in-place conversion will actually occur.
> >
> > To avoid confusion, could we rename this module parameter to something more
> > accurate, such as gmem_memory_attribute?
> >
>
> I asked Sean about this after getting some fixes off list. Sean said
> gmem_in_place_conversion is named for a host admin to use, and something
> like gmem_memory_attributes is too much implementation details for the
> admin.
Thanks for this background.
Some more context on why I'm asking:
Currently, I'm testing TDX huge pages with the following two gmem components:
1. The gmem memory attribute in this gmem in-place conversion v8.
2. The gmem 2MB from buddy allocator. (for development/testing only).
The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
memory, while shared memory is allocated from a different backend.
(To avoid fragmentation, only private mappings are split during private-to-shared
conversions. In this approach, the 2MB folios are always retained in the gmem
inode filemap cache without splitting.)
Since shared memory is not allocated from gmem, there're no in-place conversions.
The reason I'm using "gmem memory attribute" is that the per-VM attribute is
being deprecated, as suggested by Sean [1].
Besides my current usage, there may be other scenarios where gmem memory
attributes is preferred without allocating shared memory from gmem.
(e.g., PAGE.ADD from a temp extra shared source memory).
For such use cases, I'm concerns that the admins may find it confusing if they
enable gmem_in_place_conversion but still observe extra memory consumptions for
shared memory.
[1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
> Sean, would you reconsider since Yan also asked? If the admin compiled
> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> admin would also be able to use a param like gmem_memory_attributes?
>
> There's the additional benefit that the similar naming aids in
> understanding for both the admin and software engineers.
>
> Either way, in the next revision, I'll also add this documentation for
> this module_param:
>
> Setting the module parameter gmem_in_place_conversion to true will
> enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
> the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
> true, the private/shared attribute will be tracked per-guest_memfd
> instead of per-VM.
>
> Let me know what y'all think of the wording!
>
> >>
> >> [...snip...]
> >>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox