* Re: [PATCH v8 40/46] KVM: selftests: Reset shared memory after hole-punching
From: Fuad Tabba @ 2026-06-25 8:46 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-40-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> private_mem_conversions_test used to reset the shared memory that was used
> for the test to an initial pattern at the end of each test iteration. Then,
> it would punch out the pages, which would zero memory.
>
> Without in-place conversion, the resetting would write shared memory, and
> hole-punching will zero private memory, hence resetting the test to the
> state at the beginning of the for loop.
>
> With in-place conversion, resetting writes memory as shared, and
> hole-punching zeroes the same physical memory, hence undoing the reset
> done before the hole punch.
>
> Move the resetting after the hole-punching, and reset the entire
> PER_CPU_DATA_SIZE instead of just the tested range.
>
> With in-place conversion, this zeroes and then resets the same physical
> memory. Without in-place conversion, the private memory is zeroed, and the
> shared memory is reset to init_p.
>
> This is sufficient since at each test stage, the memory is assumed to start
> as shared, and private memory is always assumed to start zeroed. Conversion
> zeroes memory, so the future test stages will work as expected.
>
> Fixes: 43f623f350ce1 ("KVM: selftests: Add x86-only selftest for private memory conversions")
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/x86/private_mem_conversions_test.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> index 861baff201e78..289ad10063fca 100644
> --- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> @@ -202,15 +202,18 @@ static void guest_test_explicit_conversion(u64 base_gpa, bool do_fallocate)
> guest_sync_shared(gpa, size, p3, p4);
> memcmp_g(gpa, p4, size);
>
> - /* Reset the shared memory back to the initial pattern. */
> - memset((void *)gpa, init_p, size);
> -
> /*
> * Free (via PUNCH_HOLE) *all* private memory so that the next
> * iteration starts from a clean slate, e.g. with respect to
> * whether or not there are pages/folios in guest_mem.
> */
> guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
> +
> + /*
> + * Hole-punching above zeroed private memory. Reset shared
> + * memory in preparation for the next GUEST_STAGE.
> + */
> + memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
> }
> }
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 41/46] KVM: selftests: Provide function to look up guest_memfd details from gpa
From: Fuad Tabba @ 2026-06-25 8:58 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-41-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce a new helper, kvm_gpa_to_guest_memfd(), to find the
> guest_memfd-related details of a memory region that contains a given guest
> physical address (GPA).
>
> The function returns the file descriptor for the memfd, the offset into
> the file that corresponds to the GPA, and the number of bytes remaining
> in the region from that GPA.
>
> kvm_gpa_to_guest_memfd() was factored out from vm_guest_mem_fallocate();
> refactor vm_guest_mem_fallocate() to use the new helper.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/include/kvm_util.h | 3 +++
> tools/testing/selftests/kvm/lib/kvm_util.c | 37 ++++++++++++++++----------
> 2 files changed, 26 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 79ab64ac8b869..3a6b1fa7f26ef 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -428,6 +428,9 @@ static inline void vm_enable_cap(struct kvm_vm *vm, u32 cap, u64 arg0)
> vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
> }
>
> +int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, gpa_t gpa, off_t *fd_offset,
> + size_t *nr_bytes);
> +
> /*
> * KVM_SET_MEMORY_ATTRIBUTES{,2} overwrites _all_ attributes. These
> * flows need significant enhancements to support multiple attributes.
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 524ef97d634bf..0b2256ea65ff9 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -1305,27 +1305,20 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, u64 base, u64 size,
> bool punch_hole)
> {
> const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
> - struct userspace_mem_region *region;
> u64 end = base + size;
> - gpa_t gpa, len;
> off_t fd_offset;
> - int ret;
> + int fd, ret;
> + size_t len;
> + gpa_t gpa;
>
> for (gpa = base; gpa < end; gpa += len) {
> - u64 offset;
> -
> - region = userspace_mem_region_find(vm, gpa, gpa);
> - TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
> - "Private memory region not found for GPA 0x%lx", gpa);
> + fd = kvm_gpa_to_guest_memfd(vm, gpa, &fd_offset, &len);
> + len = min(end - gpa, len);
>
> - offset = gpa - region->region.guest_phys_addr;
> - fd_offset = region->region.guest_memfd_offset + offset;
> - len = min_t(u64, end - gpa, region->region.memory_size - offset);
> -
> - ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
> + ret = fallocate(fd, mode, fd_offset, len);
> TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx",
> punch_hole ? "punch hole" : "allocate", gpa, len,
> - region->region.guest_memfd, mode, fd_offset);
> + fd, mode, fd_offset);
> }
> }
>
> @@ -1662,6 +1655,22 @@ void *addr_gpa2alias(struct kvm_vm *vm, gpa_t gpa)
> return (void *) ((uintptr_t) region->host_alias + offset);
> }
>
> +int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, gpa_t gpa, off_t *fd_offset,
> + size_t *nr_bytes)
> +{
> + struct userspace_mem_region *region;
> + gpa_t gpa_offset;
> +
> + region = userspace_mem_region_find(vm, gpa, gpa);
> + TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
> + "guest_memfd memory region not found for GPA 0x%lx", gpa);
> +
> + gpa_offset = gpa - region->region.guest_phys_addr;
> + *fd_offset = region->region.guest_memfd_offset + gpa_offset;
> + *nr_bytes = region->region.memory_size - gpa_offset;
> + return region->region.guest_memfd;
> +}
> +
> /* Create an interrupt controller chip for the specified VM. */
> void vm_create_irqchip(struct kvm_vm *vm)
> {
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 42/46] KVM: selftests: Provide common function to set memory attributes
From: Fuad Tabba @ 2026-06-25 9:09 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-42-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce vm_mem_set_memory_attributes(), which handles setting of memory
> attributes for a range of guest physical addresses, regardless of whether
> the attributes should be set via guest_memfd or via the memory attributes
> at the VM level.
>
> Refactor existing vm_mem_set_{shared,private} functions to use the new
> function. Opportunistically update the size parameter to use size_t instead
> of u64.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/include/kvm_util.h | 46 +++++++++++++++++++-------
> 1 file changed, 34 insertions(+), 12 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 3a6b1fa7f26ef..db1442da21bb1 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -454,18 +454,6 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
> vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
> }
>
> -static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
> - u64 size)
> -{
> - vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> -}
> -
> -static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
> - u64 size)
> -{
> - vm_set_memory_attributes(vm, gpa, size, 0);
> -}
> -
> static inline int __gmem_set_memory_attributes(int fd, u64 offset,
> size_t size, u64 attributes,
> u64 *error_offset)
> @@ -532,6 +520,40 @@ static inline void gmem_set_shared(int fd, u64 offset, size_t size)
> gmem_set_memory_attributes(fd, offset, size, 0);
> }
>
> +static inline void vm_mem_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
> + size_t size, u64 attrs)
> +{
> + if (kvm_has_gmem_attributes) {
> + gpa_t end = gpa + size;
> + off_t fd_offset;
> + gpa_t addr;
> + size_t len;
> + int fd;
> +
> + for (addr = gpa; addr < end; addr += len) {
> + fd = kvm_gpa_to_guest_memfd(vm, addr, &fd_offset, &len);
> + len = min(end - addr, len);
> +
> + gmem_set_memory_attributes(fd, fd_offset, len, attrs);
> + }
> + } else {
> + vm_set_memory_attributes(vm, gpa, size, attrs);
> + }
> +}
> +
> +static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
> + size_t size)
> +{
> + vm_mem_set_memory_attributes(vm, gpa, size,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
> + size_t size)
> +{
> + vm_mem_set_memory_attributes(vm, gpa, size, 0);
> +}
> +
> void vm_guest_mem_fallocate(struct kvm_vm *vm, gpa_t gpa, u64 size,
> bool punch_hole);
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 43/46] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
From: Fuad Tabba @ 2026-06-25 9:20 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-43-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Check that a valid fd provided to mmap() must be accompanied by MAP_SHARED.
>
> With an invalid fd (usually used for anonymous mappings), there are no
> constraints on mmap() flags.
>
> Add this check to make sure that when a guest_memfd is used as region->fd,
> the flag provided to mmap() will include MAP_SHARED.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Rephrase assertion message.]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/lib/kvm_util.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 0b2256ea65ff9..6b304e8a0e0d5 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -1110,6 +1110,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
> src_type == VM_MEM_SRC_SHARED_HUGETLB);
> }
>
> + TEST_ASSERT(region->fd == -1 || backing_src_is_shared(src_type),
> + "A valid fd provided to mmap() must be accompanied by MAP_SHARED.");
> +
> region->mmap_start = __kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
> vm_mem_backing_src_alias(src_type)->flag,
> region->fd, mmap_offset);
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 44/46] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
From: Fuad Tabba @ 2026-06-25 9:30 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-44-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> The TEST_EXPECT_SIGBUS macro is not thread-safe as it uses a global
> sigjmp_buf and installs a global SIGBUS signal handler. If multiple threads
> execute the macro concurrently, they will race on installing the signal
> handler and stomp on other threads' jump buffers, leading to incorrect test
> behavior.
>
> Make TEST_EXPECT_SIGBUS thread-safe with the following changes:
>
> Share the KVM tests' global signal handler. sigaction() applies to all
> threads; without sharing a global signal handler, one thread may have
> removed the signal handler that another thread added, hence leading to
> unexpected signals.
>
> The alternative of layering signal handlers was considered, but calling
> sigaction() within TEST_EXPECT_SIGBUS() necessarily creates a race. To
> avoid adding new setup and teardown routines to do sigaction() and keep
> usage of TEST_EXPECT_SIGBUS() simple, share the KVM tests' global signal
> handler.
>
> Opportunistically rename report_unexpected_signal to
> catchall_signal_handler.
>
> To continue to only expect SIGBUS within specific regions of code, use a
> thread-specific variable, expecting_sigbus, to replace installing and
> removing signal handlers.
>
> Make the execution environment for the thread, sigjmp_buf, a
> thread-specific variable.
>
> As part of TEST_EXPECT_SIGBUS(), assert the prerequisite for this setup,
> that the current signal handler is the catchall_signal_handler.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/include/test_util.h | 32 +++++++++++++------------
> tools/testing/selftests/kvm/lib/kvm_util.c | 18 ++++++++++----
> tools/testing/selftests/kvm/lib/test_util.c | 7 ------
> 3 files changed, 30 insertions(+), 27 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
> index 51287fac8138a..bd75162ec868d 100644
> --- a/tools/testing/selftests/kvm/include/test_util.h
> +++ b/tools/testing/selftests/kvm/include/test_util.h
> @@ -82,21 +82,23 @@ do { \
> __builtin_unreachable(); \
> } while (0)
>
> -extern sigjmp_buf expect_sigbus_jmpbuf;
> -void expect_sigbus_handler(int signum);
> -
> -#define TEST_EXPECT_SIGBUS(action) \
> -do { \
> - struct sigaction sa_old, sa_new = { \
> - .sa_handler = expect_sigbus_handler, \
> - }; \
> - \
> - sigaction(SIGBUS, &sa_new, &sa_old); \
> - if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) { \
> - action; \
> - TEST_FAIL("'%s' should have triggered SIGBUS", #action); \
> - } \
> - sigaction(SIGBUS, &sa_old, NULL); \
> +extern __thread sigjmp_buf expect_sigbus_jmpbuf;
> +extern __thread volatile sig_atomic_t expecting_sigbus;
> +extern void catchall_signal_handler(int signum);
> +
> +#define TEST_EXPECT_SIGBUS(action) \
> +do { \
> + struct sigaction __sa = {}; \
> + \
> + TEST_ASSERT_EQ(sigaction(SIGBUS, NULL, &__sa), 0); \
> + TEST_ASSERT_EQ(__sa.sa_handler, &catchall_signal_handler); \
> + \
> + expecting_sigbus = true; \
> + if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) { \
> + action; \
> + TEST_FAIL("'%s' should have triggered SIGBUS", #action);\
> + } \
> + expecting_sigbus = false; \
> } while (0)
>
> size_t parse_size(const char *size);
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 6b304e8a0e0d5..b4f104436875b 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2292,13 +2292,20 @@ __weak void kvm_selftest_arch_init(void)
> {
> }
>
> -static void report_unexpected_signal(int signum)
> +__thread sigjmp_buf expect_sigbus_jmpbuf;
> +__thread volatile sig_atomic_t expecting_sigbus;
> +
> +void catchall_signal_handler(int signum)
> {
> + switch (signum) {
> + case SIGBUS: {
> + if (expecting_sigbus)
> + siglongjmp(expect_sigbus_jmpbuf, 1);
> +
> + TEST_FAIL("Unexpected SIGBUS (%d)\n", signum);
> + }
> #define KVM_CASE_SIGNUM(sig) \
> case sig: TEST_FAIL("Unexpected " #sig " (%d)\n", signum)
> -
> - switch (signum) {
> - KVM_CASE_SIGNUM(SIGBUS);
> KVM_CASE_SIGNUM(SIGSEGV);
> KVM_CASE_SIGNUM(SIGILL);
> KVM_CASE_SIGNUM(SIGFPE);
> @@ -2310,12 +2317,13 @@ static void report_unexpected_signal(int signum)
> void __attribute((constructor)) kvm_selftest_init(void)
> {
> struct sigaction sig_sa = {
> - .sa_handler = report_unexpected_signal,
> + .sa_handler = catchall_signal_handler,
> };
>
> /* Tell stdout not to buffer its content. */
> setbuf(stdout, NULL);
>
> + expecting_sigbus = false;
> sigaction(SIGBUS, &sig_sa, NULL);
> sigaction(SIGSEGV, &sig_sa, NULL);
> sigaction(SIGILL, &sig_sa, NULL);
> diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
> index bab1bd2b775b6..30eb701e4becd 100644
> --- a/tools/testing/selftests/kvm/lib/test_util.c
> +++ b/tools/testing/selftests/kvm/lib/test_util.c
> @@ -18,13 +18,6 @@
>
> #include "test_util.h"
>
> -sigjmp_buf expect_sigbus_jmpbuf;
> -
> -void __attribute__((used)) expect_sigbus_handler(int signum)
> -{
> - siglongjmp(expect_sigbus_jmpbuf, 1);
> -}
> -
> /*
> * Random number generator that is usable from guest code. This is the
> * Park-Miller LCG using standard constants.
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 45/46] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
From: Fuad Tabba @ 2026-06-25 9:43 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-45-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Update the private memory conversions selftest to also test conversions
> that are done "in-place" via per-guest_memfd memory attributes. In-place
> conversions require the host to be able to mmap() the guest_memfd so that
> the host and guest can share the same backing physical memory.
>
> This includes several updates, that are conditioned on the system
> supporting per-guest_memfd attributes (kvm_has_gmem_attributes):
>
> 1. Set up guest_memfd requesting MMAP and INIT_SHARED.
>
> 2. With in-place conversions, the host's mapping points directly to the
> guest's memory. When the guest converts a region to private, host access
> to that region is blocked. Update the test to expect a SIGBUS when
> attempting to access the host virtual address (HVA) of private memory.
>
> 3. Use vm_mem_set_memory_attributes(), which chooses how to set memory
> attributes based on whether kvm_has_gmem_attributes.
>
> Restrict the test to using VM_MEM_SRC_SHMEM because guest_memfd's required
> mmap() flags and page sizes happens to align with those of
> VM_MEM_SRC_SHMEM. As long as VM_MEM_SRC_SHMEM is used for src_type,
> vm_mem_add() works as intended.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../kvm/x86/private_mem_conversions_test.c | 44 ++++++++++++++++++----
> 1 file changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> index 289ad10063fca..4308c67952310 100644
> --- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> @@ -306,9 +306,12 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
> if (do_fallocate)
> vm_guest_mem_fallocate(vm, gpa, size, map_shared);
>
> - if (set_attributes)
> - vm_set_memory_attributes(vm, gpa, size,
> - map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
> + if (set_attributes) {
> + u64 attrs = map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> + vm_mem_set_memory_attributes(vm, gpa, size, attrs);
> + }
> +
> run->hypercall.ret = 0;
> }
>
> @@ -352,8 +355,20 @@ static void *__test_mem_conversions(void *__vcpu)
> size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
> u8 *hva = addr_gpa2hva(vm, gpa + i);
>
> - /* In all cases, the host should observe the shared data. */
> - memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
> + /*
> + * When using per-guest_memfd memory attributes,
> + * i.e. in-place conversion, host accesses will
> + * point at guest memory and should SIGBUS when
> + * guest memory is private. When using per-VM
> + * attributes, i.e. separate backing for shared
> + * vs. private, the host should always observe
> + * the shared data.
> + */
> + if (kvm_has_gmem_attributes &&
> + uc.args[0] == SYNC_PRIVATE)
> + TEST_EXPECT_SIGBUS(READ_ONCE(*hva));
> + else
> + memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
>
> /* For shared, write the new pattern to guest memory. */
> if (uc.args[0] == SYNC_SHARED)
> @@ -382,6 +397,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, u32 nr_v
> const size_t slot_size = memfd_size / nr_memslots;
> struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> pthread_t threads[KVM_MAX_VCPUS];
> + u64 gmem_flags;
> struct kvm_vm *vm;
> int memfd, i;
>
> @@ -397,12 +413,17 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, u32 nr_v
>
> vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
>
> - memfd = vm_create_guest_memfd(vm, memfd_size, 0);
> + if (kvm_has_gmem_attributes)
> + gmem_flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED;
> + else
> + gmem_flags = 0;
> +
> + memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
>
> for (i = 0; i < nr_memslots; i++)
> vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
> BASE_DATA_SLOT + i, slot_size / vm->page_size,
> - KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
> + KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, gmem_flags);
>
> for (i = 0; i < nr_vcpus; i++) {
> gpa_t gpa = BASE_DATA_GPA + i * per_cpu_size;
> @@ -452,17 +473,24 @@ static void usage(const char *cmd)
>
> int main(int argc, char *argv[])
> {
> - enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
> + enum vm_mem_backing_src_type src_type;
> u32 nr_memslots = 1;
> u32 nr_vcpus = 1;
> int opt;
>
> TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
>
> + src_type = kvm_has_gmem_attributes ? VM_MEM_SRC_SHMEM :
> + DEFAULT_VM_MEM_SRC;
> +
> while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
> switch (opt) {
> case 's':
> src_type = parse_backing_src_type(optarg);
> + TEST_ASSERT(!kvm_has_gmem_attributes ||
> + src_type == VM_MEM_SRC_SHMEM,
> + "Testing in-place conversions, only %s mem_type supported\n",
> + vm_mem_backing_src_alias(VM_MEM_SRC_SHMEM)->name);
> break;
> case 'n':
> nr_vcpus = atoi_positive("nr_vcpus", optarg);
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 46/46] KVM: selftests: Update private memory exits test to work with per-gmem attributes
From: Fuad Tabba @ 2026-06-25 9:56 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-46-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Skip setting memory to private in the private memory exits test when using
> per-gmem memory attributes, as memory is initialized to private by default
> for guest_memfd, and using vm_mem_set_private() on a guest_memfd instance
> requires creating guest_memfd with GUEST_MEMFD_FLAG_MMAP (which is totally
> doable, but would need to be conditional and is ultimately unnecessary).
>
> Expect an emulated MMIO instead of a memory fault exit when attributes are
> per-gmem, as deleting the memslot effectively drops the private status,
> i.e. the GPA becomes shared and thus supports emulated MMIO.
>
> Skip the "memslot not private" test entirely, as private vs. shared state
> for x86 software-protected VMs comes from the memory attributes themselves,
> and so when doing in-place conversions there can never be a disconnect
> between the expected and actual states.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../selftests/kvm/x86/private_mem_kvm_exits_test.c | 36 ++++++++++++++++++----
> 1 file changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
> index 10db9fe6d9063..70ed16066c63e 100644
> --- a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
> +++ b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
> @@ -62,8 +62,9 @@ static void test_private_access_memslot_deleted(void)
>
> virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
>
> - /* Request to access page privately */
> - vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
> + /* Request to access page privately. */
> + if (!kvm_has_gmem_attributes)
> + vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
>
> pthread_create(&vm_thread, NULL,
> (void *(*)(void *))run_vcpu_get_exit_reason,
> @@ -74,10 +75,26 @@ static void test_private_access_memslot_deleted(void)
> pthread_join(vm_thread, &thread_return);
> exit_reason = (u32)(u64)thread_return;
>
> - TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
> - TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
> - TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
> - TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
> + /*
> + * If attributes are tracked per-gmem, deleting the memslot that points
> + * at the gmem instance effectively makes the memory shared, and so the
> + * read should trigger emulated MMIO.
> + *
> + * If attributes are tracked per-VM, deleting the memslot shouldn't
> + * affect the private attribute, and so KVM should generate a memory
> + * fault exit (emulated MMIO on private GPAs is disallowed).
> + */
> + if (kvm_has_gmem_attributes) {
> + TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MMIO);
> + TEST_ASSERT_EQ(vcpu->run->mmio.phys_addr, EXITS_TEST_GPA);
> + TEST_ASSERT_EQ(vcpu->run->mmio.len, sizeof(u64));
> + TEST_ASSERT_EQ(vcpu->run->mmio.is_write, false);
> + } else {
> + TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
> + TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
> + TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
> + TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
> + }
>
> kvm_vm_free(vm);
> }
> @@ -88,6 +105,13 @@ static void test_private_access_memslot_not_private(void)
> struct kvm_vcpu *vcpu;
> u32 exit_reason;
>
> + /*
> + * Accessing non-private memory as private with a software-protected VM
> + * isn't possible when doing in-place conversions.
> + */
> + if (kvm_has_gmem_attributes)
> + return;
> +
> vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
> guest_repeatedly_read);
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v2 02/17] x86/virt/tdx: Configure add-on features on TDX module init and update
From: Xu Yilun @ 2026-06-25 10:50 UTC (permalink / raw)
To: Chao Gao
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
tony.lindgren, peter.fang, baolu.lu, zhenzhong.duan, dave.hansen,
dave.hansen, seanjc
In-Reply-To: <ajpHRNaq+z5bdn+R@intel.com>
> >For runtime update, Linux applies a policy that no newer features should
> >be added after update to avoid disrupting live TDX operations. To adhere
> >to this, TDH.SYS.UPDATE must configure the same features as the
> >TDH.SYS.CONFIG. Record the kernel required add-on feature bitmap in a
> >global var so that both phases can use it.
>
> Actually, we do not need another global variable here. tdx_features0 is cached
> and is not updated across a runtime update, so the derived add-on feature
> bitmap will be the same before and after the update.
I think a global var "static u64 tdx_addon_feature0 *__ro_after_init*;"
better illustrates the policy that add-on feature bitmap should be decided at
boot up and never change later. It will also be used to decide if a specific
add-on feature initialization is needed. We don't want to calculate the bitmap
again and again, though the result must be the same.
Maybe I should strenghthen the commit message:
... both phases can use it. This actually mirrors a TDX module internal state
so that kernel knows which add-on TDX operations (for example, quoting
SEAMCALLs, which will be added in later patches) are valid.
>
>
> > static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
> > u64 global_keyid)
> > {
> >+ u64 seamcall_fn = TDH_SYS_CONFIG_V0;
> > struct tdx_module_args args = {};
> > u64 *tdmr_pa_array;
> > size_t array_sz;
> >@@ -1032,7 +1042,15 @@ static __init int config_tdx_module(struct tdmr_info_list *tdmr_list,
> > args.rcx = __pa(tdmr_pa_array);
> > args.rdx = tdmr_list->nr_consumed_tdmrs;
> > args.r8 = global_keyid;
> >- ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
> >+
> >+ set_tdx_addon_features();
> >+
> >+ if (tdx_addon_feature0) {
> >+ args.r9 = tdx_addon_feature0;
>
> How about moving this r9 assignment out of the if block and placing it next to
> 'args.r8 = global_keyid;'? There is no need to guard it, because args.r9 will
> be 0 when no add-on features are enabled, which is perfectly fine.
I tend to keep r9 assignment in the block, it clearly shows which
SEAMCALL version needs what parameters, help people map the code to TDX
module spec.
>
> >+ seamcall_fn = TDH_SYS_CONFIG;
> >+ }
> >+
> >+ ret = seamcall_prerr(seamcall_fn, &args);
^ permalink raw reply
* Re: [PATCH v2 03/17] x86/virt/tdx: Detect if the extensions initialization is required
From: Xu Yilun @ 2026-06-25 10:57 UTC (permalink / raw)
To: Tony Lindgren
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
peter.fang, baolu.lu, zhenzhong.duan, dave.hansen, dave.hansen,
seanjc
In-Reply-To: <ajy6VMlPK08K7kIT@tlindgre-MOBL1>
On Thu, Jun 25, 2026 at 08:19:16AM +0300, Tony Lindgren wrote:
> On Thu, Jun 18, 2026 at 04:13:41PM +0800, Xu Yilun wrote:
> > TDX module extensions support extension SEAMCALLs that are preemptible
> > and resumable, unlike normal SEAMCALLs that run to completion while
> > monopolizing the CPU. This allows for higher-level API constructions,
> > so better supports some add-on features that implement higher order
> > security protocols.
>
> How about "TDX module extension SEAMCALLs are preemptible and resumable..."
> above to make it easier to read?
Included, thanks.
>
> Other than that:
>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25 10:57 UTC (permalink / raw)
To: Sean Christopherson, Ackerley Tng, aik, andrew.jones, binbin.wu,
brauner, chao.p.peng, david, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <ajyJhZcgfYFtGfS2@yzhao56-desk.sh.intel.com>
On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > not guarantee that gmem in-place conversion will actually occur.
> >
> > KVM module params are pretty much always about what KVM supports, not what is
> > guaranteed to happen.
> >
> > - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > because maybe the guest never accesses emulated MMIO.
> > - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > not to advertise one.
> > - and so on and so forth...
> >
> > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > but I don't see that as a big hurdle, certainly not in the long term. And once
> > the VMM code is written, I really do think most people are going to care about
> > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> Sorry, I just saw this mail after posting my reply in [1].
>
> I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> conversion, while we can still create VMs with shared memory not from gmem.
Or what about "allow_gmem_in_place_conversion" ?
> Though it still feels a bit odd to require TDX huge pages to depend on
> gmem_in_place_conversion=true when shared memory is not currently allocated from
> gmem, it should become more natural over time once gmem supports in-place
> conversions for huge page.
>
> [1] https://lore.kernel.org/all/ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com
>
>
> > > > To avoid confusion, could we rename this module parameter to something more
> > > > accurate, such as gmem_memory_attribute?
> > >
> > > I asked Sean about this after getting some fixes off list. Sean said
> > > gmem_in_place_conversion is named for a host admin to use, and something
> > > like gmem_memory_attributes is too much implementation details for the
> > > admin.
> > >
> > > Sean, would you reconsider since Yan also asked? If the admin compiled
> > > the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> > > admin would also be able to use a param like gmem_memory_attributes?
> >
> > No, because it's not all memory attributes, it's very specifically the PRIVATE
> > attribute that will get moved to guest_memfd. I don't want to pick a name that
> > will become stale and confusing when RWX attributes come along. The RWX bits
> > will be per-VM, while PRIVATE will be per-guest_memfd.
^ permalink raw reply
* [Invitation] bi-weekly guest_memfd upstream call on 2026-06-25
From: David Hildenbrand (Arm) @ 2026-06-25 12:12 UTC (permalink / raw)
To: linux-coco@lists.linux.dev, linux-mm@kvack.org, KVM
Hi,
very late reminder :/
Our next guest_memfd upstream call is scheduled for today, Thursday,
2026-06-25 8:00 - 9:00am (GMT-07:00) Pacific Time - Vancouver.
So far we don't have a lot of topics, so maybe this could be one of these rare
short meetings :)
If we have the right people in the call, I would like to continue the discussion
on proposed memory hot(un)plug/virtio-mem support.
We'll be using the following Google meet:
http://meet.google.com/wxp-wtju-jzw
The meeting notes can be found at [1], where we also link recordings and
collect current guest_memfd upstream proposals. If you want an google
calendar invitation that also covers all future meetings, just write me
or Ackerley a mail.
To put something to discuss onto the agenda, reply to this mail or add
them to the "Topics/questions for next meeting(s)" section in the
meeting notes as a comment.
[1]
https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?usp=sharing
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v7 10/42] KVM: guest_memfd: Ensure pages are not in use before conversion
From: David Hildenbrand (Arm) @ 2026-06-25 12:36 UTC (permalink / raw)
To: Ackerley Tng, Vlastimil Babka (SUSE), aik, andrew.jones,
binbin.wu, brauner, chao.p.peng, ira.weiny, jmattson, jthoughton,
michael.roth, oupton, pankaj.gupta, qperret, rick.p.edgecombe,
rientjes, shivankg, steven.price, tabba, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Jason Gunthorpe
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgHM4a66Jx9++6iioQLpFY-KgPvjY5+bg_X97DfSjpXzRQ@mail.gmail.com>
On 6/19/26 02:17, Ackerley Tng wrote:
> "Vlastimil Babka (SUSE)" <vbabka@kernel.org> writes:
>
>> On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
>>> From: Ackerley Tng <ackerleytng@google.com>
>>>
>>> When converting memory to private in guest_memfd, it is necessary to ensure
>>> that the pages are not currently being accessed by any other part of the
>>> kernel or userspace to avoid any current user writing to guest private
>>> memory.
>>>
>>> guest_memfd checks for unexpected refcounts to determine whether a page is
>>> still in use. The only expected refcounts after unmapping the range
>>> requested for conversion are those that are held by guest_memfd itself.
>>
>> Is it sufficient to only check, and not also freeze the refcount? (i.e.
>> using folio_ref_freeze()), because without freezing, anything (e.g.
>> compaction's pfn-based scanner) could do a speculative folio_try_get() and
>> the checked refcount becomes stale.
>>
>
> I believe there's no issue here, since the main thing here is to check
> for long-term pins on the folio. Perhaps David can help me verify. :)
I think I raised this in the past as well: ideally, we'd be freezing the
refcount, then, there is no need to worry about any concurrent access.
However, we could really only get additional page references through PFN walkers
(or speculative references), not through page tables or GUP pins, which is what
we care about.
So if we can tolerate a speculative bump+release of a folio reference, likely
we're good.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: David Hildenbrand (Arm) @ 2026-06-25 12:57 UTC (permalink / raw)
To: Sean Christopherson, Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajx3vmNPRf-M9kR6@google.com>
On 6/25/26 02:35, Sean Christopherson wrote:
> On Wed, Jun 24, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>>>
>>> Under what circumstances does this happen,
>>
>> It happened 100% of the time in selftests. Perhaps it's because in the
>> selftests the pages are almost always freshly allocated and so the
>> lru_add fbatch isn't full yet? (and that the host isn't super busy so
>> lru_add fbatch doesn't get drained yet).
>
> I chatted with Ackerley about this. What I wanted to understand is why guest_memfd
> pages were getting put onto per-CPU batches for lru_add(), given that guest_memfd
> pages are unevictable. The answer (assuming I read the code right), is that
> lruvec_add_folio() updates stats and other per-lru metadata for the unevictable
> lru, and does so under a per-lru lock. I.e. we don't want to skip that stuff
> entirely.
Hm. Our pages don't participate in any LRU activity (including
isolation+migration). Isolation+migration would only apply once we'd support
page migration.
But yes, secretmem also does it like that: filemap_add_folio() will call
folio_add_lru().
Traditionally we used the unevictable LRU only for mlock purposes.
But yeah, there are "unevictable" stats involved ....
>
> One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
> was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
> something into folio_may_be_lru_cached(). But due to taking a per-lru lock,
> that would penalize the relatively hot path and definitely common operation of
> faulting in guest memory. On the other hand, memory conversion is already a
> relatively slow operation and is relatively uncommon compared to page faults,
> (and likely very uncommon for real world setups). I.e. having to drain all
> caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
> path.
Yeah, the lru_add_drain_all is rather messy.
We have similar code in
collect_longterm_unpinnable_folios(), where we first try a lru_add_drain(), to
then escalate to a lru_add_drain_all().
Maybe we could factor that (suboptimal code) out to not have to reinvent the
same thing multiple times?
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-25 13:53 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <aiLes2ecZSr17UwZ@lpieralisi>
On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>>
>>> [...]
>>>
>>>>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>>>>> + kvm_pfn_t pfn, unsigned long map_size,
>>>>> + enum kvm_pgtable_prot prot,
>>>>> + struct kvm_mmu_memory_cache *memcache)
>>>>> +{
>>>>> + struct realm *realm = &kvm->arch.realm;
>>>>> +
>>>>> + /*
>>>>> + * Write permission is required for now even though it's possible to
>>>>> + * map unprotected pages (granules) as read-only. It's impossible to
>>>>> + * map protected pages (granules) as read-only.
>>>>> + */
>>>>> + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>>>>> + return -EFAULT;
>>>>> +
>>>>
>>>> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
>>>> if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
>>>> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
>>>> working any more.
>>>>
>>>>> + ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>>>>> + if (!kvm_realm_is_private_address(realm, ipa))
>>>>> + return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>>>>> + memcache);
>>>>> +
>>>>> + return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>>>>> +}
>>>>> +
>>>>> static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>>>> {
>>>>> switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
>>>>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>>> bool write_fault, exec_fault;
>>>>> enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>>>> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>>>> - struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>>>>> + struct kvm_vcpu *vcpu = s2fd->vcpu;
>>>>> + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>>>>> + gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>>>> unsigned long mmu_seq;
>>>>> struct page *page;
>>>>> - struct kvm *kvm = s2fd->vcpu->kvm;
>>>>> + struct kvm *kvm = vcpu->kvm;
>>>>> void *memcache;
>>>>> kvm_pfn_t pfn;
>>>>> gfn_t gfn;
>>>>> int ret;
>>>>> - memcache = get_mmu_memcache(s2fd->vcpu);
>>>>> - ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>>>>> + if (kvm_is_realm(vcpu->kvm)) {
>>>>> + /* check for memory attribute mismatch */
>>>>> + bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>>>>> + /*
>>>>> + * For Realms, the shared address is an alias of the private
>>>>> + * PA with the top bit set. Thus if the fault address matches
>>>>> + * the GPA then it is the private alias.
>>>>> + */
>>>>> + bool is_priv_fault = (gpa == s2fd->fault_ipa);
>>>>> +
>>>>> + if (is_priv_gfn != is_priv_fault) {
>>>>> + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>> + kvm_is_write_fault(vcpu),
>>>>> + false,
>>>>> + is_priv_fault);
>>>>> + /*
>>>>> + * KVM_EXIT_MEMORY_FAULT requires an return code of
>>>>> + * -EFAULT, see the API documentation
>>>>> + */
>>>>> + return -EFAULT;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + memcache = get_mmu_memcache(vcpu);
>>>>> + ret = topup_mmu_memcache(vcpu, memcache);
>>>>> if (ret)
>>>>> return ret;
>>>>> if (s2fd->nested)
>>>>> gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>>>> else
>>>>> - gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>>>>> + gfn = gpa >> PAGE_SHIFT;
>>>>> - write_fault = kvm_is_write_fault(s2fd->vcpu);
>>>>> - exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>>>>> + write_fault = kvm_is_write_fault(vcpu);
>>>>> + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>>>> VM_WARN_ON_ONCE(write_fault && exec_fault);
>>>>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>>> ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>>>>> if (ret) {
>>>>> - kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
>>>>> + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>> write_fault, exec_fault, false);
>>>>> return ret;
>>>>> }
>>>>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>>> kvm_fault_lock(kvm);
>>>>> if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>>>> ret = -EAGAIN;
>>>>> - goto out_unlock;
>>>>> + goto out_release_page;
>>>>> + }
>>>>> +
>>>>> + if (kvm_is_realm(kvm)) {
>>>>> + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>>>>> + PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
>>>>> + goto out_release_page;
>>>>> }
>>>>> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
>>>>> __pfn_to_phys(pfn), prot,
>>>>> memcache, flags);
>>>>> -out_unlock:
>>>>> +out_release_page:
>>>>> kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
>>>>> kvm_fault_unlock(kvm);
>>>>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
>>>>> * mapping size to ensure we find the right PFN and lay down the
>>>>> * mapping in the right place.
>>>>> */
>>>>> - s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
>>>>> + s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>>>> s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>>>>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
>>>>> prot &= ~KVM_NV_GUEST_MAP_SZ;
>>>>> ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
>>>>> prot, flags);
>>>>> + } else if (kvm_is_realm(kvm)) {
>>>>> + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>>>>> + prot, memcache);
>>>>> } else {
>>>>> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
>>>>> __pfn_to_phys(pfn), prot,
>>>>
>>>> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
>>>> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
>>>> transparent_hugepage_adjust() to be aligned with huge page size. If the
>>>> adjustment happened in transparent_hugepage_adjust(), we need to align
>>>> s2fd->fault_ipa down to the huge page size either.
>>>
>>> All of the above + some RMM changes are needed to get QEmu VMM going
>>> with anon pages guest memory backing - currently testing various
>>> configurations in the background.
>>>
>>
>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>> that memory slots backed by THP are broken. With THP disabled on the host and
>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>> fxies on the top.
>>
>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/latest)
>> [2] https://git.qemu.org/git/qemu.git (branch: cca/gavin)
>>
>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
>
> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
> issues you pointed out above.
>
>> If so, I'm not sure if there is a QEMU repository for me to try?
>
> We should be able to submit patches by end of June - we shall let you know
> whether we can make something available earlier.
>
Not sure if there are other known issues in this series. It seems the stage2
page fault handling on the shared space isn't working well. In my test, the
vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
resolved in the s2 page fault handler for shared (unprotected) space.
- I rebased Jean's latest qemu branch to the upstream qemu;
- On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
disabled.
- On the guest, I can see the virtio vring (struct vring_desc) is updated. The
S1 page-table entry looks correct because the corresponding physical address
0x10046880000 is a sane shared (unprotected) space address.
[ 52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
[ 52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
[ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
- On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
[ 109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
[ 109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
[ 109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
[ 109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
[ 109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
[ 109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
- On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't seen. All the
data in that adress are zeros.
====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, queue_index=0x0, vq->vring.num=0x100
virtqueue_split_pop: last_avail_idx=0x0, head=0x0
address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, buf=0xffffeee34880, len=0x10
address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
address_space_read_cached_slow: translated to mr=<mach-virt.ram>, mr_addr=0x6880000, l=0x10
flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 00000000
qemu-system-aarch64: virtio: zero sized buffers are not allowed
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-25 14:36 UTC (permalink / raw)
To: Yan Zhao
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aj0Jf30PS2f7x1nt@yzhao56-desk.sh.intel.com>
On Thu, Jun 25, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> > On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > > not guarantee that gmem in-place conversion will actually occur.
> > >
> > > KVM module params are pretty much always about what KVM supports, not what is
> > > guaranteed to happen.
> > >
> > > - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > > because maybe the guest never accesses emulated MMIO.
> > > - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > > not to advertise one.
> > > - and so on and so forth...
> > >
> > > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > > but I don't see that as a big hurdle, certainly not in the long term. And once
> > > the VMM code is written, I really do think most people are going to care about
> > > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> > Sorry, I just saw this mail after posting my reply in [1].
> >
> > I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> > conversion, while we can still create VMs with shared memory not from gmem.
> Or what about "allow_gmem_in_place_conversion" ?
No, because turning on the param also disallows setting PRIVATE in the VM-scoped
KVM_SET_MEMORY_ATTRIBUTES ioctl.
> > Though it still feels a bit odd to require TDX huge pages to depend on
> > gmem_in_place_conversion=true when shared memory is not currently allocated
> > from gmem,
I fully expect that to be a transient state, and in all likelihood not something
that is *ever* shipped in production. Landing TDX hugepages without guest_memfd
hugepage support is all about avoiding unnecessary serialization of series and
features that aren't strictly dependent on each other.
> > it should become more natural over time once gmem supports in-place
> > conversions for huge page.
Yes, and I want to prioritize the steady state for end users, not the in-progress
state for developers. Once all of this settles out, I fully expect the majority
of deployments to only support in-place conversion, at which point the end user
is only going to care whether or not in-place conversion is enabled in KVM, not
the subtle detail that it's still possible to do out-of-place conversions (and
that will always hold true, it's not like VMA-based memslots are being deprecated).
> > Besides my current usage, there may be other scenarios where gmem memory
> > attributes is preferred without allocating shared memory from gmem.
> > (e.g., PAGE.ADD from a temp extra shared source memory).
> >
> > For such use cases, I'm concerns that the admins may find it confusing if they
> > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > shared memory.
KVM can help with documentation, but beyond that, it's not KVM's problem to solve.
If a VMM *and* platform owner chooses to deploy a setup that utilizes out-of-place
conversions, then it's on the VMM and/or plaform owner to understand and communicate
the implications to the end user.
And I'm not remotely convinced that prepending allow_ to the param will help
end users diagnose "unexpected" memory consumption, in quotes because anyone that
is deploying a stack that utilizes out-of-place conversion absolutely needs to
understand and plan for the additional memory consumption. I.e. if the memory
consumption is "unexpected" to the end user, they likely have far bigger problems.
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Borislav Petkov @ 2026-06-25 15:02 UTC (permalink / raw)
To: Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <ba146ca15b7f76eee386c8c073fb3f1cc36e5781.1782336473.git.ashish.kalra@amd.com>
On Wed, Jun 24, 2026 at 09:56:49PM +0000, Ashish Kalra wrote:
> +/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
> +static bool snp_cpu_hotplug_disabled;
Do you really need this?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-25 15:40 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <6ed7d12a-c3a1-4572-8385-754e6d5b8b44@kernel.org>
On Thu, Jun 25, 2026, David Hildenbrand (Arm) wrote:
> On 6/25/26 02:35, Sean Christopherson wrote:
> > One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
> > was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
> > something into folio_may_be_lru_cached(). But due to taking a per-lru lock,
> > that would penalize the relatively hot path and definitely common operation of
> > faulting in guest memory. On the other hand, memory conversion is already a
> > relatively slow operation and is relatively uncommon compared to page faults,
> > (and likely very uncommon for real world setups). I.e. having to drain all
> > caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
> > path.
>
> Yeah, the lru_add_drain_all is rather messy.
>
> We have similar code in
>
> collect_longterm_unpinnable_folios(), where we first try a lru_add_drain(), to
> then escalate to a lru_add_drain_all().
>
> Maybe we could factor that (suboptimal code) out to not have to reinvent the
> same thing multiple times?
As discussed in the guest_memfd call, we should do this straightaway, i.e. instead
of merging this series as-is, so that we don't export lru_add_drain_all() only to
drop the export a kernel or two later, and can instead export the helper to drain
any batches for a folio (or set of folios/pages).
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-25 15:58 UTC (permalink / raw)
To: Gavin Shan, Lorenzo Pieralisi
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <1e39094f-7fa3-4ef1-be54-53d7a8643506@redhat.com>
On 25/06/2026 14:53, Gavin Shan wrote:
> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>>>
>>>> [...]
>>>>
>>>>>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>>>>>> + kvm_pfn_t pfn, unsigned long map_size,
>>>>>> + enum kvm_pgtable_prot prot,
>>>>>> + struct kvm_mmu_memory_cache *memcache)
>>>>>> +{
>>>>>> + struct realm *realm = &kvm->arch.realm;
>>>>>> +
>>>>>> + /*
>>>>>> + * Write permission is required for now even though it's
>>>>>> possible to
>>>>>> + * map unprotected pages (granules) as read-only. It's
>>>>>> impossible to
>>>>>> + * map protected pages (granules) as read-only.
>>>>>> + */
>>>>>> + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>>>>>> + return -EFAULT;
>>>>>> +
>>>>>
>>>>> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set
>>>>> in @prot
>>>>> if the stage2 fault is raised due to memory read. With -EFAULT
>>>>> returned to VMM
>>>>> (e.g. QEMU), the vCPU continuous execution is stopped and system
>>>>> won't be
>>>>> working any more.
>>>>>
>>>>>> + ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>>>>>> + if (!kvm_realm_is_private_address(realm, ipa))
>>>>>> + return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>>>>>> + memcache);
>>>>>> +
>>>>>> + return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>>>>>> +}
>>>>>> +
>>>>>> static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>>>>> {
>>>>>> switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma-
>>>>>> >vm_page_prot))) {
>>>>>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct
>>>>>> kvm_s2_fault_desc *s2fd)
>>>>>> bool write_fault, exec_fault;
>>>>>> enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>>>>> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>>>>> - struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>>>>>> + struct kvm_vcpu *vcpu = s2fd->vcpu;
>>>>>> + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>>>>>> + gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>>>>> unsigned long mmu_seq;
>>>>>> struct page *page;
>>>>>> - struct kvm *kvm = s2fd->vcpu->kvm;
>>>>>> + struct kvm *kvm = vcpu->kvm;
>>>>>> void *memcache;
>>>>>> kvm_pfn_t pfn;
>>>>>> gfn_t gfn;
>>>>>> int ret;
>>>>>> - memcache = get_mmu_memcache(s2fd->vcpu);
>>>>>> - ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>>>>>> + if (kvm_is_realm(vcpu->kvm)) {
>>>>>> + /* check for memory attribute mismatch */
>>>>>> + bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >>
>>>>>> PAGE_SHIFT);
>>>>>> + /*
>>>>>> + * For Realms, the shared address is an alias of the private
>>>>>> + * PA with the top bit set. Thus if the fault address
>>>>>> matches
>>>>>> + * the GPA then it is the private alias.
>>>>>> + */
>>>>>> + bool is_priv_fault = (gpa == s2fd->fault_ipa);
>>>>>> +
>>>>>> + if (is_priv_gfn != is_priv_fault) {
>>>>>> + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>>> + kvm_is_write_fault(vcpu),
>>>>>> + false,
>>>>>> + is_priv_fault);
>>>>>> + /*
>>>>>> + * KVM_EXIT_MEMORY_FAULT requires an return code of
>>>>>> + * -EFAULT, see the API documentation
>>>>>> + */
>>>>>> + return -EFAULT;
>>>>>> + }
>>>>>> + }
>>>>>> +
>>>>>> + memcache = get_mmu_memcache(vcpu);
>>>>>> + ret = topup_mmu_memcache(vcpu, memcache);
>>>>>> if (ret)
>>>>>> return ret;
>>>>>> if (s2fd->nested)
>>>>>> gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>>>>> else
>>>>>> - gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>>>>>> + gfn = gpa >> PAGE_SHIFT;
>>>>>> - write_fault = kvm_is_write_fault(s2fd->vcpu);
>>>>>> - exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>>>>>> + write_fault = kvm_is_write_fault(vcpu);
>>>>>> + exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>>>>> VM_WARN_ON_ONCE(write_fault && exec_fault);
>>>>>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct
>>>>>> kvm_s2_fault_desc *s2fd)
>>>>>> ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn,
>>>>>> &page, NULL);
>>>>>> if (ret) {
>>>>>> - kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd-
>>>>>> >fault_ipa, PAGE_SIZE,
>>>>>> + kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>>> write_fault, exec_fault, false);
>>>>>> return ret;
>>>>>> }
>>>>>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct
>>>>>> kvm_s2_fault_desc *s2fd)
>>>>>> kvm_fault_lock(kvm);
>>>>>> if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>>>>> ret = -EAGAIN;
>>>>>> - goto out_unlock;
>>>>>> + goto out_release_page;
>>>>>> + }
>>>>>> +
>>>>>> + if (kvm_is_realm(kvm)) {
>>>>>> + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>>>>>> + PAGE_SIZE, KVM_PGTABLE_PROT_R |
>>>>>> KVM_PGTABLE_PROT_W, memcache);
>>>>>> + goto out_release_page;
>>>>>> }
>>>>>> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd-
>>>>>> >fault_ipa, PAGE_SIZE,
>>>>>> __pfn_to_phys(pfn), prot,
>>>>>> memcache, flags);
>>>>>> -out_unlock:
>>>>>> +out_release_page:
>>>>>> kvm_release_faultin_page(kvm, page, !!ret, prot &
>>>>>> KVM_PGTABLE_PROT_W);
>>>>>> kvm_fault_unlock(kvm);
>>>>>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const
>>>>>> struct kvm_s2_fault_desc *s2fd,
>>>>>> * mapping size to ensure we find the right PFN and lay
>>>>>> down the
>>>>>> * mapping in the right place.
>>>>>> */
>>>>>> - s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)
>>>>>> >> PAGE_SHIFT;
>>>>>> + s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd-
>>>>>> >fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>>>>> s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>>>>>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct
>>>>>> kvm_s2_fault_desc *s2fd,
>>>>>> prot &= ~KVM_NV_GUEST_MAP_SZ;
>>>>>> ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt,
>>>>>> gfn_to_gpa(gfn),
>>>>>> prot, flags);
>>>>>> + } else if (kvm_is_realm(kvm)) {
>>>>>> + ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>>>>>> + prot, memcache);
>>>>>> } else {
>>>>>> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt,
>>>>>> gfn_to_gpa(gfn), mapping_size,
>>>>>> __pfn_to_phys(pfn), prot,
>>>>>
>>>>> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for
>>>>> the sake of
>>>>> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been
>>>>> adjusted by
>>>>> transparent_hugepage_adjust() to be aligned with huge page size. If
>>>>> the
>>>>> adjustment happened in transparent_hugepage_adjust(), we need to align
>>>>> s2fd->fault_ipa down to the huge page size either.
>>>>
>>>> All of the above + some RMM changes are needed to get QEmu VMM going
>>>> with anon pages guest memory backing - currently testing various
>>>> configurations in the background.
>>>>
>>>
>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and
>>> found
>>> that memory slots backed by THP are broken. With THP disabled on the
>>> host and
>>> other fixes (mentioned in my prevous replies) applied on the top of
>>> this (v14)
>>> series, I'm able to boot a realm guest with rebased QEMU series [2],
>>> plus more
>>> fxies on the top.
>>>
>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/
>>> latest)
>>> [2] https://git.qemu.org/git/qemu.git (branch: cca/gavin)
>>>
>>> Lorenzo, You may be saying there is someone making QEMU to support
>>> ARM/CCA?
>>
>> Mathieu and I are working on that yes and with Steven/Suzuki to fix
>> the THP
>> issues you pointed out above.
>>
>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>
>> We should be able to submit patches by end of June - we shall let you
>> know
>> whether we can make something available earlier.
>>
>
> Not sure if there are other known issues in this series. It seems the
> stage2
> page fault handling on the shared space isn't working well. In my test, the
> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and
> the
> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is
> properly
> resolved in the s2 page fault handler for shared (unprotected) space.
>
> - I rebased Jean's latest qemu branch to the upstream qemu;
>
> - On the host, which is emulated by qemu/tcg, the THP (transparent huge
> page) is
> disabled.
>
> - On the guest, I can see the virtio vring (struct vring_desc) is
> updated. The
> S1 page-table entry looks correct because the corresponding physical
> address
> 0x10046880000 is a sane shared (unprotected) space address.
>
> [ 52.094143] software IO TLB: Memory encryption is active and
> system is using DMA bounce buffers
> [ 52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000,
> [00000100b983f000 00000640 0002 0001]
> [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>
> - On the host, the s2 page-table-entry is unmapped due to attribute
> transition (private -> shared).
> A subsequent S2 page fault is raised against the adress and the s2
> page-table-entry is built.
>
> [ 109.259077] ====> realm_unmap_shared_range:
> tracked_unprot_addr=0x10046880000
> [ 109.260249] realm_unmap_shared_range: unmapped shared range at
> 0x10046880000
> [ 109.317786] realm_unmap_shared_range: unmapped shared range at
> 0x10046880000
> [ 109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000,
> esr=0x92000007
> [ 109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59,
> size=0x1000, prot=0xf
> [ 109.630331] realm_map_non_secure: ipa=0x10046880000,
> ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
Are you able to correlate the order of the transitions and the Guest
access with RMM log ? We haven't seen this from our end. We are aware
of permission fault issues with Unprotected IPA when backing the memslot
with MAP_PRIVATE areas. But this looks different.
Lorenzo, have you run into this ?
Suzuki
>
> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't
> seen. All the
> data in that adress are zeros.
>
> ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38,
> queue_index=0x0, vq->vring.num=0x100
> virtqueue_split_pop: last_avail_idx=0x0, head=0x0
> address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0,
> buf=0xffffeee34880, len=0x10
> address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000,
> len=0x1000, mrs=<realm-dma-region>, is_write=no
> address_space_read_cached_slow: translated to mr=<mach-virt.ram>,
> mr_addr=0x6880000, l=0x10
> flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000,
> mr_addr=0x6880000, ram_ptr=0xffff2a680000
> virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 -
> 00000000
> qemu-system-aarch64: virtio: zero sized buffers are not allowed
>
>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Suzuki K Poulose @ 2026-06-25 16:19 UTC (permalink / raw)
To: Steven Price, Gavin Shan, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <9631be66-c757-488d-bb66-a62698aa26b8@arm.com>
On 08/06/2026 14:53, Steven Price wrote:
> On 08/06/2026 10:41, Suzuki K Poulose wrote:
>> On 08/06/2026 10:36, Steven Price wrote:
>>> On 28/05/2026 06:30, Gavin Shan wrote:
>>>> Hi Steve,
>>>>
>>>> On 5/13/26 11:17 PM, Steven Price wrote:
>>>>> The VMM needs to populate the realm with some data before starting
>>>>> (e.g.
>>>>> a kernel and initrd). This is measured by the RMM and used as part of
>>>>> the attestation later on.
>>>>>
>>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>
>> ...
>>
>>>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>>>> index a89873a5eb77..209087bcf399 100644
>>>>> --- a/arch/arm64/kvm/rmi.c
>>>>> +++ b/arch/arm64/kvm/rmi.c
>>>>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>>>>> unsigned long start,
>>>>> realm_unmap_private_range(kvm, start, end, may_block);
>>>>> }
>>>>> +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>>>> + kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>>>>> + unsigned long flags)
>>>>> +{
>>>>> + struct realm *realm = &kvm->arch.realm;
>>>>> + phys_addr_t rd = virt_to_phys(realm->rd);
>>>>> + phys_addr_t dst_phys, src_phys;
>>>>> + int ret;
>>>>> +
>>>>> + dst_phys = __pfn_to_phys(dst_pfn);
>>>>> + src_phys = __pfn_to_phys(src_pfn);
>>>>> +
>>>>> + if (rmi_delegate_page(dst_phys))
>>>>> + return -ENXIO;
>>>>> +
>>>>> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>>>>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>>>> + /* Create missing RTTs and retry */
>>>>> + int level = RMI_RETURN_INDEX(ret);
>>>>> +
>>>>> + KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
>>>>
>>>> KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);
>>>
>>> Ack.
>>>
>>
>> Thinking more about this, I guess a buggy VMM can trigger this
>> by populating twice ? (level == KVM_PGTABLE_LAST_LEVEL). So, we should
>> return the error back, than warning here and suppressing the error ?
>
> Populating twice causes rmi_delegate_page() to be run twice on the same
> page and the second one will then fail. So I don't think this is
> possible (please correct me if I've missed something!)
Good point, but I think this may not fail to allow the hugepages in the
future. The DELEGATE_RANGE would skip the granules in DELEGATED/DATA
state. I am getting this clarified in the spec.
Suzuki
>
> Thanks,
> Steve
^ permalink raw reply
* Re: [PATCH v2 16/17] KVM: TDX: Add in-kernel Quote generation
From: Sean Christopherson @ 2026-06-25 18:01 UTC (permalink / raw)
To: Xu Yilun
Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
tony.lindgren, peter.fang, baolu.lu, zhenzhong.duan, dave.hansen,
dave.hansen
In-Reply-To: <20260618081355.3253581-17-yilun.xu@linux.intel.com>
On Thu, Jun 18, 2026, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
>
> Provide an in-kernel path for Quote generation when handling
> TDG.VP.VMCALL<GetQuote>, without requiring an exit to userspace.
Why?
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Ackerley Tng @ 2026-06-25 18:20 UTC (permalink / raw)
To: Yan Zhao
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com>
Yan Zhao <yan.y.zhao@intel.com> writes:
> On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> >>
>> >> #ifdef kvm_arch_has_private_mem
>> >> -bool __ro_after_init gmem_in_place_conversion = false;
>> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
>> >> +module_param(gmem_in_place_conversion, bool, 0444);
>> >
>> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
>> > MMAP flag. In such cases, shared memory is allocated from different backends.
>> > This means this module parameter only enables per-gmem memory attribute and does
>> > not guarantee that gmem in-place conversion will actually occur.
>> >
>> > To avoid confusion, could we rename this module parameter to something more
>> > accurate, such as gmem_memory_attribute?
>> >
>>
>> I asked Sean about this after getting some fixes off list. Sean said
>> gmem_in_place_conversion is named for a host admin to use, and something
>> like gmem_memory_attributes is too much implementation details for the
>> admin.
> Thanks for this background.
>
> Some more context on why I'm asking:
>
> Currently, I'm testing TDX huge pages with the following two gmem components:
> 1. The gmem memory attribute in this gmem in-place conversion v8.
> 2. The gmem 2MB from buddy allocator. (for development/testing only).
>
> The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
> memory, while shared memory is allocated from a different backend.
> (To avoid fragmentation, only private mappings are split during private-to-shared
> conversions. In this approach, the 2MB folios are always retained in the gmem
> inode filemap cache without splitting.)
>
> Since shared memory is not allocated from gmem, there're no in-place conversions.
> The reason I'm using "gmem memory attribute" is that the per-VM attribute is
> being deprecated, as suggested by Sean [1].
>
v8 of conversions series changed that slightly, per-VM attributes is
going to stay around (because of work on RWX attributes, coming up) and
RWX will stay tracked at the VM level.
For v8 and beyond, only tracking of private/shared in per-VM attributes
is being deprecated.
By extension the entire thing about using guest_memfd for private memory
and a different backing memory for shared memory is being deprecated.
> Besides my current usage,
I think you can set up guest_memfd+2M for private memory and shared
memory from some other source, and that's the deprecated usage pattern.
> there may be other scenarios where gmem memory
> attributes is preferred without allocating shared memory from gmem.
> (e.g., PAGE.ADD from a temp extra shared source memory).
>
Is this TDH.MEM.PAGE.ADD, used indirectly from
tdx_gmem_post_populate()? This use case isn't blocked. Even if
gmem_in_place_conversion=true, you can still set src_address to
non-guest_memfd memory and load from anywhere you like.
Please let me know if that is broken! I think I accidentally used that
setup in selftests and it worked. The selftests are now defaulting to
in-place conversion.
> For such use cases, I'm concerns that the admins may find it confusing if they
> enable gmem_in_place_conversion but still observe extra memory consumptions for
> shared memory.
>
Hmm but I guess if someone enables gmem_in_place_conversion but still
allocates from elsewhere, they'd have to figure it out?
> [1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
>
>> Sean, would you reconsider since Yan also asked? If the admin compiled
>> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
>> admin would also be able to use a param like gmem_memory_attributes?
>>
>> There's the additional benefit that the similar naming aids in
>> understanding for both the admin and software engineers.
>>
>> Either way, in the next revision, I'll also add this documentation for
>> this module_param:
>>
>> Setting the module parameter gmem_in_place_conversion to true will
>> enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
>> the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
>> true, the private/shared attribute will be tracked per-guest_memfd
>> instead of per-VM.
>>
>> Let me know what y'all think of the wording!
>>
>> >>
>> >> [...snip...]
>> >>
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-25 19:42 UTC (permalink / raw)
To: Borislav Petkov
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260625150253.GAaj1DHZC8ULg6PzbI@fat_crate.local>
Hello Boris,
On 6/25/2026 10:02 AM, Borislav Petkov wrote:
> On Wed, Jun 24, 2026 at 09:56:49PM +0000, Ashish Kalra wrote:
>> +/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
>> +static bool snp_cpu_hotplug_disabled;
>
> Do you really need this?
>
Yes.
cpu_hotplug_disable()/cpu_hotplug_enable() are refcounted (cpu_hotplug_disabled++/--,
with a WARN on underflow), so they have to be balanced. This flag collapses them to
exactly one outstanding disable per SNP-active window, because the disable and enable
sites are not reached a symmetric number of times:
- On firmware without SNP_X86_SHUTDOWN_SUPPORTED, __sev_snp_shutdown_locked() does not
call snp_shutdown() (it's gated on data.x86_snp_shutdown), so SNP stays enabled in
hardware — SNP_EN stays set and hotplug stays disabled — while sev->snp_initialized is
cleared. Re-init after that is routine, the SNP ioctls self-bracket init and shutdown
(e.g. SNP_COMMIT, SNP_SET_CONFIG, SNP_VLEK_LOAD):
if (!sev->snp_initialized)
snp_move_to_init_state(...); /* -> __sev_snp_init_locked -> snp_prepare() */
... SNP_CMD ...
if (shutdown_required)
__sev_snp_shutdown_locked(...);
- So whenever SNP isn't already initialized (psp_init_on_probe off, or after a prior
legacy shutdown), every such ioctl does init -> command -> legacy shutdown. Each init
reaches snp_prepare() with SNP_EN already set, and the disable now sits at the top of
snp_prepare(), so it fires on every cycle. Without this flag that keeps bumping
cpu_hotplug_disabled while the legacy shutdown never re-enables — hotplug ends up stuck
disabled. This flag makes all but the first disable a no-op.
- Also, importantly, kvm-amd module reload on legacy firmware is the same pattern:
unload leaves SNP_EN set, reload re-inits.)
- On the enable side it avoids an unbalanced cpu_hotplug_enable() when the teardown/failure
paths run without an outstanding disable (e.g. shutdown of a never-fully-initialized SNP).
So it's not redundant with cpu_hotplug_disabled — it tracks whether the outstanding disable
belongs to this SNP-active window in this kernel, which keeps the single disable/enable
balanced across the asymmetric legacy-vs-full SNP teardown paths and re-init.
Thanks,
Ashish
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: K Prateek Nayak @ 2026-06-25 22:16 UTC (permalink / raw)
To: Kalra, Ashish, Borislav Petkov
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, Tycho.Andersen, Nathan.Fontenot, ackerleytng,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <7c64d96f-f932-4db9-8119-b9e40d5b7fd9@amd.com>
Hello Ashish,
On 6/26/2026 1:12 AM, Kalra, Ashish wrote:
> Hello Boris,
>
> On 6/25/2026 10:02 AM, Borislav Petkov wrote:
>> On Wed, Jun 24, 2026 at 09:56:49PM +0000, Ashish Kalra wrote:
>>> +/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
>>> +static bool snp_cpu_hotplug_disabled;
>>
>> Do you really need this?
>>
>
> Yes.
>
> cpu_hotplug_disable()/cpu_hotplug_enable() are refcounted (cpu_hotplug_disabled++/--,
> with a WARN on underflow), so they have to be balanced. This flag collapses them to
> exactly one outstanding disable per SNP-active window, because the disable and enable
> sites are not reached a symmetric number of times:
> > - On firmware without SNP_X86_SHUTDOWN_SUPPORTED, __sev_snp_shutdown_locked() does not
> call snp_shutdown() (it's gated on data.x86_snp_shutdown), so SNP stays enabled in
> hardware — SNP_EN stays set and hotplug stays disabled — while sev->snp_initialized is
> cleared. Re-init after that is routine, the SNP ioctls self-bracket init and shutdown
> (e.g. SNP_COMMIT, SNP_SET_CONFIG, SNP_VLEK_LOAD):
>
> if (!sev->snp_initialized)
> snp_move_to_init_state(...); /* -> __sev_snp_init_locked -> snp_prepare() */
> ... SNP_CMD ...
> if (shutdown_required)
> __sev_snp_shutdown_locked(...);
> - So whenever SNP isn't already initialized (psp_init_on_probe off, or after a prior
> legacy shutdown), every such ioctl does init -> command -> legacy shutdown. Each init
> reaches snp_prepare() with SNP_EN already set, and the disable now sits at the top of
> snp_prepare(), so it fires on every cycle. Without this flag that keeps bumping
> cpu_hotplug_disabled while the legacy shutdown never re-enables — hotplug ends up stuck
> disabled. This flag makes all but the first disable a no-op.
>
> - Also, importantly, kvm-amd module reload on legacy firmware is the same pattern:
> unload leaves SNP_EN set, reload re-inits.)
Looking at snp_prepare(), we have an early-bailout for
rdmsrq(MSR_AMD64_SYSCFG, val);
if (val & MSR_AMD64_SYSCFG_SNP_EN)
return;
Does executing SHUTDOWN command lead to the firmware clearing SNP_EN in
SYSCFG on all CPUS?
If SNP_EN remains set (and Linux can't clear it since it is
"Write-1-only" bit), then a subsequent snp_prepare() will skip setting
SYSCFG if it sees SNP_EN on local CPU.
It can so happen that we enable hotlpug at shutdown, CPUs come online
without setting SNP_EN in SYSCFG, subsequent snp_prepare() runs on a CPU
where SNP_EN is still set and skips configuring it for the CPUs that
don't have it set, and we'll be in a pickle still.
The comment above that bailout saying "this can happen in case of kexec
boot" makes me believe that SNP_EN remains set until a full system
reset.
The only safe way to do this is to ensure all possible CPUs are online
during snp_prepare() and do snp_enable() regardless of whether local CPU
has SNP_EN or not.
Am I missing something?
>
> - On the enable side it avoids an unbalanced cpu_hotplug_enable() when the teardown/failure
> paths run without an outstanding disable (e.g. shutdown of a never-fully-initialized SNP).
>
> So it's not redundant with cpu_hotplug_disabled — it tracks whether the outstanding disable
> belongs to this SNP-active window in this kernel, which keeps the single disable/enable
> balanced across the asymmetric legacy-vs-full SNP teardown paths and re-init.
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-26 0:04 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgFfgV0FbQLzP8hhNH5hMGaQao6OFQin4cb3TAmC7SVhfA@mail.gmail.com>
On Thu, Jun 25, 2026 at 11:20:30AM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> > On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >>
> >> >
> >> > [...snip...]
> >> >
> >> >>
> >> >> #ifdef kvm_arch_has_private_mem
> >> >> -bool __ro_after_init gmem_in_place_conversion = false;
> >> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> >> >> +module_param(gmem_in_place_conversion, bool, 0444);
> >> >
> >> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> >> > MMAP flag. In such cases, shared memory is allocated from different backends.
> >> > This means this module parameter only enables per-gmem memory attribute and does
> >> > not guarantee that gmem in-place conversion will actually occur.
> >> >
> >> > To avoid confusion, could we rename this module parameter to something more
> >> > accurate, such as gmem_memory_attribute?
> >> >
> >>
> >> I asked Sean about this after getting some fixes off list. Sean said
> >> gmem_in_place_conversion is named for a host admin to use, and something
> >> like gmem_memory_attributes is too much implementation details for the
> >> admin.
> > Thanks for this background.
> >
> > Some more context on why I'm asking:
> >
> > Currently, I'm testing TDX huge pages with the following two gmem components:
> > 1. The gmem memory attribute in this gmem in-place conversion v8.
> > 2. The gmem 2MB from buddy allocator. (for development/testing only).
> >
> > The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
> > memory, while shared memory is allocated from a different backend.
> > (To avoid fragmentation, only private mappings are split during private-to-shared
> > conversions. In this approach, the 2MB folios are always retained in the gmem
> > inode filemap cache without splitting.)
> >
> > Since shared memory is not allocated from gmem, there're no in-place conversions.
> > The reason I'm using "gmem memory attribute" is that the per-VM attribute is
> > being deprecated, as suggested by Sean [1].
> >
>
> v8 of conversions series changed that slightly, per-VM attributes is
> going to stay around (because of work on RWX attributes, coming up) and
> RWX will stay tracked at the VM level.
>
> For v8 and beyond, only tracking of private/shared in per-VM attributes
> is being deprecated.
>
> By extension the entire thing about using guest_memfd for private memory
> and a different backing memory for shared memory is being deprecated.
Thanks for the info. I was actually referring to the per-VM shared/private
attribute, which is being deprecated. Sean hoped TDX huge page would be the
first mandated user of the per-gmem shared/private attribute.
> > Besides my current usage,
>
> I think you can set up guest_memfd+2M for private memory and shared
> memory from some other source, and that's the deprecated usage pattern.
Yes, though this is the deprecated usage pattern, gmem_in_place_conversion=true
allows it.
In fact, even without huge pages, v8 allows userspace to have shared memory
allocated from other source when gmem_in_place_conversion=true.
(My default testing of this series for the 4KB setting is with this
configuration).
> > there may be other scenarios where gmem memory
> > attributes is preferred without allocating shared memory from gmem.
> > (e.g., PAGE.ADD from a temp extra shared source memory).
> >
>
> Is this TDH.MEM.PAGE.ADD, used indirectly from
> tdx_gmem_post_populate()? This use case isn't blocked. Even if
> gmem_in_place_conversion=true, you can still set src_address to
> non-guest_memfd memory and load from anywhere you like.
>
> Please let me know if that is broken! I think I accidentally used that
It's not broken. I tested it with my hacked-up QEMU.
> setup in selftests and it worked. The selftests are now defaulting to
> in-place conversion.
>
> > For such use cases, I'm concerns that the admins may find it confusing if they
> > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > shared memory.
> >
>
> Hmm but I guess if someone enables gmem_in_place_conversion but still
> allocates from elsewhere, they'd have to figure it out?
If gmem_in_place_conversion=true means gmem in place conversion is allowed (but
not enforced), I agree.
I'm wondering if we could rename it to "allow_gmem_in_place_conversion":)
> > [1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
> >
> >> Sean, would you reconsider since Yan also asked? If the admin compiled
> >> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> >> admin would also be able to use a param like gmem_memory_attributes?
> >>
> >> There's the additional benefit that the similar naming aids in
> >> understanding for both the admin and software engineers.
> >>
> >> Either way, in the next revision, I'll also add this documentation for
> >> this module_param:
> >>
> >> Setting the module parameter gmem_in_place_conversion to true will
> >> enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
> >> the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
> >> true, the private/shared attribute will be tracked per-guest_memfd
> >> instead of per-VM.
> >>
> >> Let me know what y'all think of the wording!
> >>
> >> >>
> >> >> [...snip...]
> >> >>
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-26 0:07 UTC (permalink / raw)
To: Yan Zhao
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <ajyRg3BwGu5dCfOn@yzhao56-desk.sh.intel.com>
Yan Zhao <yan.y.zhao@intel.com> writes:
> On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> > > > > return -EIO;
>> >> > > > >
>> >> > > > > - if (!src_page)
>> >> > > > > - return -EOPNOTSUPP;
>> >> > > > > + if (!src_page) {
>> >> > > > > + if (!gmem_in_place_conversion)
>> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> >> > > > error.
>> >> > >
>> >> > > Why MMAP?
>> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >> >
>> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
>> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
>> >> > > initialize the memory.
>> >> > Do you mean using up-to-date flag as below?
>> >
>> > Yes? I didn't actually look at the implementation details.
>> >
>> >> > if (!src_page) {
>> >> > src_page = pfn_to_page(pfn);
>> >> > if (!folio_test_uptodate(page_folio(src_page)))
>> >> > return -EOPNOTSUPP;
>> >> > }
>>
>> Yan is right that with the earlier patch "Zero page while getting pfn",
>> folio_test_uptodate() here will always return true.
>>
>> Actually, this is an alternative fix for the issue Sashiko pointed out
>> on v7 where userspace can do a populate() (either TDX or SNP) without
>> first allocating the page, with src_address == NULL, and leak
>> uninitialized memory into the guest.
>>
>> Advantage of using the uptodate check in populate: if the host never
>> allocates the page, populate doesn't incur zeroing before writing the
>> page anyway in populate().
>>
>> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> check. guest_memfd can't check centrally because for SNP, for a
>> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> firmware will zero and there's no leakage of uninitialized host memory?
> Another disadvantage: the uptodate flag is per-folio. What if the folio
> is only partially initialized by the userspace especially after huge page is
> supported?
>
Good point on huge pages!
The uptodate flag on the folio in guest_memfd means "this folio has been
written to". As of now (before patch at [1]), this happens when
+ folio is zeroed on first use by userspace
+ folio is zeroed on first use of the guest
+ folio is populated
When huge pages are supported, the folio can't partially be initialized?
On allocation, if any part is shared, we split the page. The parts are
separate folios that have their own uptodate flags.
On splitting, if the huge page is uptodate, the split pages will also be
uptodate. If the huge page is not uptodate, the split pages won't be
uptodate, but that's ok since they will be marked uptodate on first use.
On merging, the non-uptodate parts have to be zeroed and then marked
uptodate. Any parts that are in use would have been marked uptodate
already, so there's no overwriting data that is in use. I'll need to
think more about when it's safe to zero.
I'm still on the fence between the two options
1. Using uptodate check in populate to reject src_pages that have never
been written to or
2. Always zero before populate
but whether the uptodate flag is per-folio or not doesn't affect these
two options in terms of fixing the leak of uninitialized host memory,
right?
>
>> >> Another concern with this fix is that:
>> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> >> folio uptodate before reaching post_populate().
>> >>
>> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>> >>
>> >> > One concern is that TDX now does not much care about the up-to-date flag since
>> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> >> > user access.
>> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >
>> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> >> > > > 1. create guest_memfd with MMAP flag
>> >> > > > 2. mmap the guest_memfd.
>> >> > > > 3. convert the initial memory range to shared.
>> >> > > > 4. copy initial content to the source page.
>> >> > > > 5. convert the initial memory range to private
>> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> > > > 7. do not unmap the source backend.
>> >> > > >
>> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> > >
>> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
>> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>>
>> Yan, is your concern that userspace forgot to update the code and
>> forgets to provide a src_page, and if we keep the "Zero page while
> Yes. Previously, it would be rejected after GUP fails.
>
I see, didn't realize previously it would be rejected because GUP
fails. GUP failed because it wasn't faulted into the host?
That's kind of orthogonal, I don't think GUP fail leading to rejecting
populate was meant to help userspace catch these issues. GUP would also
fail if the user did mmap(), write to it, unmap using
madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
>> getting pfn" patch, ends up with the guest silently having a zero page?
>> I think that would be found quite early in userspace VMM testing...
> I actually encountered this during testing this patch.
> I update most code path to follow this sequence. However, still some corner ones
> for TDVF HOB, which are less obvious and harder to update.
> The TD just booted up and hang silently.
>
I think this is just the life of a close-to-hardware software engineer
:P no errors, got stuck somewhere, root cause is some unitialized
thing.
>> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> >> > kernel to detect this mistake, similar to how it validates whether source_addr
>> >> > is PAGE_ALIGNED.
>> >
>> > The alignment case is different. If userspace provides an unaligned value, KVM
>> > *can't* do what userspace is asking because hardware and thus KVM only supports
>> > converting on page boundaries.
>> >
>> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
>> > request would then be making assumptions about what userspace wants.
>> >
>>
>> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> allocation, actually wants to rely on that to get a zero page in the guest?
> What if 0 uaddr is a valid address? :)
>
>> >> > Since userspace already needs to perform additional steps to enable in-place
>> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> >> > intentional seems like a reasonable burden.
>> >
>> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
>> > just wen up with code that does:
>> >
>> > if (in-place) {
>> > src = NULL;
>> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> > }
>>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox