[PATCH RFC 00/12] guest_memfd: support in-place memory conversion

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 00/12] guest_memfd: support in-place memory conversion
@ 2026-05-28  0:03 Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 01/12] accel/kvm: Decouple guest_memfd checks from memory attribute checks Michael Roth
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

This patchset is also available at:

  https://github.com/amdese/qemu/commits/snp-inplace-rfc1

which is in turn based on the following series:

  [PATCH 0/4] "guest_memfd: Fix handling for conversions of MMIO ranges"
  https://lists.gnu.org/archive/html/qemu-devel/2026-05/msg07547.html

OVERVIEW
--------

This series adds guest_memfd support for in-place conversion of memory
between private/shared, and enables it for SEV-SNP guests. It is based
on recently-added kernel support for mmap()-able guest_memfd
instances[1], which allow it to be used for shared memory, and the
following patchset[2], which adds additional guest_memfd interfaces to
allow it to be used to perform in-place conversion:

  "[PATCH v7 00/42] guest_memfd: In-place conversion support"
  https://lore.kernel.org/kvm/20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com/

That series also introduces a new 'vm_memory_attributes' KVM
module option, which sets whether memory attributes are tracked
VM-wide by KVM (vm_memory_attributes=1: the existing 'legacy' mode),
or per-guest_memfd instance (vm_memory_attributes=0: the new mode
which allows for in-place conversion). The latter is intended to
eventually deprecate the legacy mode, at which point in-place
conversion would become the primarily-supported mode.

MOTIVATION
----------

Today, SEV-SNP guests (and other CoCo VM types using guest_memfd) keep
shared and private memory on separate physical backings: a userspace
memory-backend object for shared pages, and a kernel-allocated
guest_memfd file descriptor for private pages. KVM_SET_MEMORY_ATTRIBUTES
flips which backing the guest sees for a given GPA range, and the old
backing is typically discarded / hole-punched on conversion to avoid
doubled memory usage.

That model works, but has a number of downsides that impact certain
use-cases:

  - Each conversion involves discarding pages on one side and faulting
    them in on the other, which incurs allocation overheads in the
    host kernel for every conversion.

  - Some use-cases, like pKVM[3], rely on memory isolation rather than
    encryption and rely on in-place conversion to pass through things
    like secured framebuffer memory without needing to bounce data
    through separate shared/private HPAs, which would introduce
    unacceptable latency for that sort of workload.

  - Hugetlb support[4] for guest_memfd will rely on it, since things like
    1GB hugepages with a mix of shared/private sub-ranges would generally
    require 2 1GB hugetlb pages to remain available to handle shared vs.
    private accesses, which quickly causes doubling of guest memory usage.

Recent kernel work[2] makes guest_memfd mmap()-able and lets the *same*
physical pages be used for both shared and private states for a given
GPA range, allowing the above pitfalls to be naturally avoided.

This series wires that support up in QEMU.

DESIGN
------

A new dedicated memory backend, memory-backend-guest-memfd, allocates
its memory via a guest_memfd file descriptor obtained from KVM with
the GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED flags. The fd
is mmap()ed so userspace can access pages directly while they are in
the shared state. For a normal/non-confidential VM, this backend can
be used in a similar fashion as the existing memory-backend-memfd.

For confidential VMs, a new 'convert-in-place' flag is added to switch
on in-place conversion support. When running in this mode, the user
*MUST* use memory-backend-guest-memfd for backing guest RAM. A new
RAM_GUEST_MEMFD_SHARED RAMBlock flag is added to track/enforce the
dependency. Additionally, QEMU is modified to use mmap()-able
guest_memfd and set this flag for other cases where it allocates RAM
internally. As a result, block->fd will generally always a guest_memfd,
and when RAM_GUEST_MEMFD_SHARED is set then that block->fd will be
qemu_dup()'d as the FD handle for private memory is well (which is
currently what block->guest_memfd point to). This allows the prior
non-in-place handling around block->guest_memfd to be kept mostly
unchanged.

When running with convert-in-place=true, shared/private conversions
are no longer handled directly by KVM, but instead by a new guest_memfd
ioctl, KVM_SET_MEMORY_ATTRIBUTES2, which purposely provides similar
naming/implementation to the KVM_SET_MEMORY_ATTRIBUTES KVM ioctl that
it replaces. This series adds handling to route conversion requests to
the appropriate ioctls based on whether or not in-place conversion is
enabled.

Since guest_memfd ioctls need to be called against the specific
guest_memfd inode associated with each memory slot/region, some
refactoring is needed to handle conversions on a per-section. Much of
that is inherited from the bugfix series this patchset is based on top
of, which adds the initial logic for handling multiple sections within
a range that gets heavily re-used here.

USAGE
-----

After applying this series against a kernel with the RFC patches above
present, an SEV-SNP guest can be started with in-place conversion via:

    qemu-system-x86_64 \
        -machine q35,confidential-guest-support=sev0,memory-backend=ram0 \
        -object memory-backend-guest-memfd,id=ram0,size=8G,share=on \
        -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,\
                convert-in-place=on \
        ...

The new memory-backend-guest-memfd can also be used by normal VMs:

    qemu-system-x86_64 \
        -machine q35,memory-backend=ram0 \
        -object memory-backend-guest-memfd,id=ram0,size=8G,share=on \
        ...

This is mainly only useful atm for testing, but in the future there may
be more use-cases around using guest_memfd as a general-purpose backend
for non-confidential VMs, so it is intended to work in this manner as
well.

NOTES/TODO
----------

  - the CPR handling to support resetting of confidential VMs is
    currently disabled when in-place conversion is enabled.
  - TDX testing would be great, in theory it can be enabled with this
    series (similarly to the top patch) but I'm not sure if there are
    other special requirements before we can switch it on.
  - kernel patches are still in-flight, but fairly mature at this point
    and nearing upstream

REFERENCES
----------

[1] https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com/
[2] https://lore.kernel.org/kvm/20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com/
[3] https://www.youtube.com/watch?v=MMfAGNW9RVg
[4] 1GB hugetlb v2

Thoughts, feedback, and testing are very much appreciated.

Thanks,

Mike

----------------------------------------------------------------
Michael Roth (12):
      accel/kvm: Decouple guest_memfd checks from memory attribute checks
      hostmem: Introduce dedicated memory backend for guest_memfd
      linux-headers: Update headers for v7 of in-place conversion kernel support
      accel/kvm: Add CGS option to control in-place conversion support
      system/memory: Re-use memory-backend-guest-memfd inode for private memory
      system/memory: Default to guest_memfd for RAM for in-place conversion
      accel/kvm: Move post-conversion updates to a separate helper
      accel/kvm: Re-order attribute notifications for in-place conversion
      accel/kvm: Support shared/private conversions via guest_memfd ioctls
      accel/kvm: Don't default to private attributes for in-place conversion
      i386/sev: Update SNP_LAUNCH_UPDATE for in-place conversion
      i386/sev: Allow in-place conversion for SEV-SNP guests

 accel/kvm/kvm-all.c                                | 286 +++++++++++--
 accel/stubs/kvm-stub.c                             |   9 +-
 backends/confidential-guest-support.c              |  25 ++
 backends/hostmem-guest-memfd.c                     |  93 +++++
 backends/meson.build                               |   1 +
 include/standard-headers/drm/drm_fourcc.h          |  28 +-
 include/standard-headers/linux/const.h             |  18 +
 include/standard-headers/linux/ethtool.h           |  28 +-
 include/standard-headers/linux/input-event-codes.h |  13 +
 include/standard-headers/linux/pci_regs.h          |  71 +++-
 include/standard-headers/linux/typelimits.h        |   8 +
 include/standard-headers/linux/virtio_ring.h       |   5 +-
 include/standard-headers/linux/virtio_rtc.h        | 237 +++++++++++
 include/standard-headers/linux/vmclock-abi.h       |  20 +
 include/system/confidential-guest-support.h        |  14 +
 include/system/hostmem.h                           |   1 +
 include/system/kvm.h                               |   3 +-
 include/system/memory.h                            |   8 +-
 linux-headers/asm-arm64/kvm.h                      |   1 +
 linux-headers/asm-arm64/unistd_64.h                |   1 +
 linux-headers/asm-generic/unistd.h                 |   5 +-
 linux-headers/asm-loongarch/kvm.h                  |   5 +
 linux-headers/asm-loongarch/kvm_para.h             |   1 +
 linux-headers/asm-loongarch/unistd_64.h            |   2 +
 linux-headers/asm-mips/unistd_n32.h                |   1 +
 linux-headers/asm-mips/unistd_n64.h                |   1 +
 linux-headers/asm-mips/unistd_o32.h                |   1 +
 linux-headers/asm-powerpc/unistd_32.h              |   1 +
 linux-headers/asm-powerpc/unistd_64.h              |   1 +
 linux-headers/asm-riscv/kvm.h                      |  11 +-
 linux-headers/asm-riscv/ptrace.h                   |  37 ++
 linux-headers/asm-riscv/unistd_32.h                |   1 +
 linux-headers/asm-riscv/unistd_64.h                |   1 +
 linux-headers/asm-s390/unistd_32.h                 | 446 ---------------------
 linux-headers/asm-s390/unistd_64.h                 |   1 +
 linux-headers/asm-x86/kvm.h                        |  21 +-
 linux-headers/asm-x86/unistd_32.h                  |   1 +
 linux-headers/asm-x86/unistd_64.h                  |   1 +
 linux-headers/asm-x86/unistd_x32.h                 |   1 +
 linux-headers/linux/const.h                        |  18 +
 linux-headers/linux/iommufd.h                      |  48 +++
 linux-headers/linux/kvm.h                          |  62 ++-
 linux-headers/linux/mshv.h                         |   4 +-
 linux-headers/linux/psp-sev.h                      |   2 +-
 linux-headers/linux/stddef.h                       |   4 +
 linux-headers/linux/vduse.h                        |  85 +++-
 linux-headers/linux/vfio.h                         |  30 +-
 qapi/qom.json                                      |  35 +-
 qemu-options.hx                                    |   5 +
 system/memory.c                                    |  22 +-
 system/physmem.c                                   |  50 ++-
 target/i386/sev.c                                  |  12 +-
 52 files changed, 1253 insertions(+), 533 deletions(-)
 create mode 100644 backends/hostmem-guest-memfd.c
 create mode 100644 include/standard-headers/linux/typelimits.h
 create mode 100644 include/standard-headers/linux/virtio_rtc.h
 delete mode 100644 linux-headers/asm-s390/unistd_32.h

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 01/12] accel/kvm: Decouple guest_memfd checks from memory attribute checks
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd Michael Roth
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

Currently QEMU supports using guest_memfd internally (separately from
user-specified memory backends) to handle private memory for
confidential VMs, and as a result has checks for guest_memfd support
merged with checks to see if KVM can handle mapping private memory (as
determined by KVM_MEMORY_ATTRIBUTE_PRIVATE).

Future QEMU support will allow using guest_memfd not just for private
memory, but as mmap()'able memory that can be used by non-confidential
guests as well.

In prep for this, split the checks for guest_memfd out from the check
for KVM_MEMORY_ATTRIBUTE_PRIVATE, and rename the current
kvm_create_guest_memfd() to kvm_create_guest_memfd_private() to
self-document current behavior/expectations and disambiguate from future
helpers intended for creating a guest_memfd to handle non-private/shared
memory. While there, fix up the missing error_setg() handling in the
stub functions.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c     | 20 +++++++++++++++++---
 accel/stubs/kvm-stub.c  |  3 ++-
 include/system/kvm.h    |  2 +-
 include/system/memory.h |  5 +++--
 system/physmem.c        |  8 ++++----
 5 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 585f1cea35..02911ff6e3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -795,6 +795,11 @@ static int kvm_mem_flags(MemoryRegion *mr)
     }
     if (memory_region_has_guest_memfd(mr)) {
         assert(kvm_guest_memfd_supported);
+        /*
+         * memory_region_has_guest_memfd() is specifically pertaining to
+         * using guest_memfd to handle private memory use cases.
+         */
+        assert(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE);
         flags |= KVM_MEM_GUEST_MEMFD;
     }
     return flags;
@@ -3066,8 +3071,7 @@ static int kvm_init(AccelState *as, MachineState *ms)
     kvm_supported_memory_attributes = kvm_vm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
     kvm_guest_memfd_supported =
         kvm_vm_check_extension(s, KVM_CAP_GUEST_MEMFD) &&
-        kvm_vm_check_extension(s, KVM_CAP_USER_MEMORY2) &&
-        (kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE);
+        kvm_vm_check_extension(s, KVM_CAP_USER_MEMORY2);
     kvm_pre_fault_memory_supported = kvm_vm_check_extension(s, KVM_CAP_PRE_FAULT_MEMORY);
 
     if (s->kernel_irqchip_split == ON_OFF_AUTO_AUTO) {
@@ -4854,7 +4858,7 @@ void kvm_mark_guest_state_protected(void)
     kvm_state->guest_state_protected = true;
 }
 
-int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
+static int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
 {
     int fd;
     struct kvm_create_guest_memfd guest_memfd = {
@@ -4875,3 +4879,13 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
 
     return fd;
 }
+
+int kvm_create_guest_memfd_private(uint64_t size, Error **errp)
+{
+    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+        error_setg(errp, "KVM does not support using guest_memfd for private memory");
+        return -1;
+    }
+
+    return kvm_create_guest_memfd(size, 0, errp);
+}
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index c4617caac6..1940bcbd2c 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -139,7 +139,8 @@ bool kvm_hwpoisoned_mem(void)
     return false;
 }
 
-int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
+int kvm_create_guest_memfd_private(uint64_t size, Error **errp)
 {
+    error_setg(errp, "guest_memfd is not supported for this configuration");
     return -ENOSYS;
 }
diff --git a/include/system/kvm.h b/include/system/kvm.h
index 5fa33eddda..aeb0c7ca8f 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -561,7 +561,7 @@ void kvm_mark_guest_state_protected(void);
  */
 bool kvm_hwpoisoned_mem(void);
 
-int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
+int kvm_create_guest_memfd_private(uint64_t size, Error **errp);
 
 int kvm_set_memory_attributes_private(hwaddr start, uint64_t size);
 int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size);
diff --git a/include/system/memory.h b/include/system/memory.h
index 1417132f6d..24c68720aa 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -1745,9 +1745,10 @@ bool memory_region_is_protected(const MemoryRegion *mr);
 
 /**
  * memory_region_has_guest_memfd: check whether a memory region has guest_memfd
- *     associated
+ *     associated with it for handling private memory
  *
- * Returns %true if a memory region's ram_block has valid guest_memfd assigned.
+ * Returns %true if a memory region's ram_block has valid guest_memfd assigned
+ * for handling private memory.
  *
  * @mr: the memory region being queried
  */
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf87573..04c7c38721 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2202,8 +2202,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             goto out_free;
         }
 
-        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
-                                                        0, errp);
+        new_block->guest_memfd = kvm_create_guest_memfd_private(new_block->max_length,
+                                                                errp);
         if (new_block->guest_memfd < 0) {
             qemu_mutex_unlock_ramlist();
             goto out_free;
@@ -2835,8 +2835,8 @@ int ram_block_rebind(Error **errp)
             if (block->guest_memfd >= 0) {
                 close(block->guest_memfd);
             }
-            block->guest_memfd = kvm_create_guest_memfd(block->max_length,
-                                                        0, errp);
+            block->guest_memfd = kvm_create_guest_memfd_private(block->max_length,
+                                                                errp);
             if (block->guest_memfd < 0) {
                 qemu_mutex_unlock_ramlist();
                 return -1;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 01/12] accel/kvm: Decouple guest_memfd checks from memory attribute checks Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-06-02  8:22   ` Markus Armbruster
  2026-05-28  0:03 ` [PATCH RFC 03/12] linux-headers: Update headers for v7 of in-place conversion kernel support Michael Roth
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

In the initial implementation of guest_memfd in the linux kernel, it
was not possible to map memory into userspace for direct access; instead
the memory provided by the memory backend would be used for cases where
a confidential VM wants to access normal/unprotected/unencrypted memory
that can be used for shared memory use cases, and for access to private
memory a guest_memfd could be associated with the same memslot. A memory
'private' attribute set via KVM_SET_MEMORY_ATTRIBUTES could then be used
to have KVM route to the approprate backing memory.

In that model, it didn't make sense to introduce a specific backend for
guest_memfd, since there was always a generally need to have a separate
backend type to handle shared memory access/allocation. Instead, QEMU
configures the guest_memfd support for the associated memslots
internally for cases where it is running a confidential VM.

However, with recent changes in guest_memfd kernel support, it is now
possible to mmap() a guest_memfd FD into userspace and use it for shared
memory, as well as continue to use the same physical pages for the same
GPA ranges after they are converted to private ("in-place conversion").

To enable the use of this mmap()-able/guest_memfd-provided memory to be
used for normal/shared memory instead of just for private memory,
introduce a dedicated guest_memfd memory backend that can be used both
for confidential VMs that wish to make use of in-place conversion, as
well as for non-confidential VMs that just want to make use of
guest_memfd for normal memory (which can be useful both for testing as
well as a stepping stone to things like software-protected VMs where the
host can be trusted to provided some additional degree of isolation for
the VM independently of hardware support).

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c            | 15 ++++++
 accel/stubs/kvm-stub.c         |  6 +++
 backends/hostmem-guest-memfd.c | 92 ++++++++++++++++++++++++++++++++++
 backends/meson.build           |  1 +
 include/system/hostmem.h       |  1 +
 include/system/kvm.h           |  1 +
 qapi/qom.json                  | 19 ++++++-
 qemu-options.hx                |  5 ++
 8 files changed, 139 insertions(+), 1 deletion(-)
 create mode 100644 backends/hostmem-guest-memfd.c

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 02911ff6e3..e6ae2e8ced 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -108,6 +108,7 @@ static bool kvm_has_guest_debug;
 static int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static uint64_t kvm_supported_memory_attributes;
+static uint64_t kvm_supported_guest_memfd_flags;
 static bool kvm_guest_memfd_supported;
 static hwaddr kvm_max_slot_size = ~0;
 
@@ -3069,6 +3070,7 @@ static int kvm_init(AccelState *as, MachineState *ms)
     }
 
     kvm_supported_memory_attributes = kvm_vm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
+    kvm_supported_guest_memfd_flags = kvm_vm_check_extension(s, KVM_CAP_GUEST_MEMFD_FLAGS);
     kvm_guest_memfd_supported =
         kvm_vm_check_extension(s, KVM_CAP_GUEST_MEMFD) &&
         kvm_vm_check_extension(s, KVM_CAP_USER_MEMORY2);
@@ -4889,3 +4891,16 @@ int kvm_create_guest_memfd_private(uint64_t size, Error **errp)
 
     return kvm_create_guest_memfd(size, 0, errp);
 }
+
+int kvm_create_guest_memfd_shared(uint64_t size, Error **errp)
+{
+    if (!(kvm_supported_guest_memfd_flags & GUEST_MEMFD_FLAG_MMAP) ||
+        !(kvm_supported_guest_memfd_flags & GUEST_MEMFD_FLAG_INIT_SHARED)) {
+        error_setg(errp, "KVM does not support using guest_memfd for shared memory");
+        return -1;
+    }
+
+    return kvm_create_guest_memfd(size,
+                                  GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED,
+                                  errp);
+}
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 1940bcbd2c..e50329f26e 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -144,3 +144,9 @@ int kvm_create_guest_memfd_private(uint64_t size, Error **errp)
     error_setg(errp, "guest_memfd is not supported for this configuration");
     return -ENOSYS;
 }
+
+int kvm_create_guest_memfd_shared(uint64_t size, Error **errp)
+{
+    error_setg(errp, "guest_memfd is not supported for this configuration");
+    return -ENOSYS;
+}
diff --git a/backends/hostmem-guest-memfd.c b/backends/hostmem-guest-memfd.c
new file mode 100644
index 0000000000..deb796a6bd
--- /dev/null
+++ b/backends/hostmem-guest-memfd.c
@@ -0,0 +1,92 @@
+/*
+ * QEMU guest_memfd memory backend
+ *
+ * Copyright (C) 2026 Advanced Micro Devices, Inc.
+ *
+ * Authors:
+ *   Michael Roth <michael.roth@amd.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "system/hostmem.h"
+#include "qom/object_interfaces.h"
+#include "qemu/module.h"
+#include "qapi/error.h"
+#include "qom/object.h"
+#include "migration/cpr.h"
+#include "system/kvm.h"
+
+OBJECT_DECLARE_SIMPLE_TYPE(HostMemoryBackendGuestMemfd, MEMORY_BACKEND_GUEST_MEMFD)
+
+struct HostMemoryBackendGuestMemfd {
+    HostMemoryBackend parent_obj;
+};
+
+static bool
+guest_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
+{
+    g_autofree char *name = host_memory_backend_get_name(backend);
+    int fd = cpr_find_fd(name, 0);
+    uint32_t ram_flags;
+
+    if (!backend->size) {
+        error_setg(errp, "can't create backend with size 0");
+        return false;
+    }
+
+    if (!backend->share) {
+        error_setg(errp, "can't create backend with share=off");
+        return false;
+    }
+
+    if (fd >= 0) {
+        goto have_fd;
+    }
+
+    fd = kvm_create_guest_memfd_shared(backend->size, errp);
+    if (fd < 0) {
+        return false;
+    }
+    cpr_save_fd(name, 0, fd);
+
+have_fd:
+    backend->aligned = true;
+    ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
+    ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= backend->guest_memfd ? RAM_GUEST_MEMFD : 0;
+    return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
+                                          backend->size, ram_flags, fd, 0, errp);
+}
+
+static void
+guest_memfd_backend_instance_init(Object *obj)
+{
+    HostMemoryBackendGuestMemfd *m = MEMORY_BACKEND_GUEST_MEMFD(obj);
+
+    MEMORY_BACKEND(m)->share = true;
+}
+
+static void
+guest_memfd_backend_class_init(ObjectClass *oc, const void *data)
+{
+    HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc);
+
+    bc->alloc = guest_memfd_backend_memory_alloc;
+}
+
+static const TypeInfo guest_memfd_backend_info = {
+    .name = TYPE_MEMORY_BACKEND_GUEST_MEMFD,
+    .parent = TYPE_MEMORY_BACKEND,
+    .instance_init = guest_memfd_backend_instance_init,
+    .class_init = guest_memfd_backend_class_init,
+    .instance_size = sizeof(HostMemoryBackendGuestMemfd),
+};
+
+static void register_types(void)
+{
+    type_register_static(&guest_memfd_backend_info);
+}
+
+type_init(register_types);
diff --git a/backends/meson.build b/backends/meson.build
index 60021f45d1..6c53f4a097 100644
--- a/backends/meson.build
+++ b/backends/meson.build
@@ -20,6 +20,7 @@ endif
 if host_os == 'linux'
   system_ss.add(files('hostmem-memfd.c'))
   system_ss.add(files('host_iommu_device.c'))
+  system_ss.add(files('hostmem-guest-memfd.c'))
 endif
 if keyutils.found()
     system_ss.add(keyutils, files('cryptodev-lkcf.c'))
diff --git a/include/system/hostmem.h b/include/system/hostmem.h
index 88fa791ac7..2d0c25a43e 100644
--- a/include/system/hostmem.h
+++ b/include/system/hostmem.h
@@ -41,6 +41,7 @@ OBJECT_DECLARE_TYPE(HostMemoryBackend, HostMemoryBackendClass,
 
 #define TYPE_MEMORY_BACKEND_MEMFD "memory-backend-memfd"
 
+#define TYPE_MEMORY_BACKEND_GUEST_MEMFD "memory-backend-guest-memfd"
 
 /**
  * HostMemoryBackendClass:
diff --git a/include/system/kvm.h b/include/system/kvm.h
index aeb0c7ca8f..b959a6d3df 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -562,6 +562,7 @@ void kvm_mark_guest_state_protected(void);
 bool kvm_hwpoisoned_mem(void);
 
 int kvm_create_guest_memfd_private(uint64_t size, Error **errp);
+int kvm_create_guest_memfd_shared(uint64_t size, Error **errp);
 
 int kvm_set_memory_attributes_private(hwaddr start, uint64_t size);
 int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size);
diff --git a/qapi/qom.json b/qapi/qom.json
index dd45ac1087..502fafeb15 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -661,7 +661,8 @@
 # @share: if false, the memory is private to QEMU; if true, it is
 #     shared (default false for backends memory-backend-file and
 #     memory-backend-ram, true for backends memory-backend-epc,
-#     memory-backend-memfd, and memory-backend-shm)
+#     memory-backend-memfd, memory-backend-shm, and
+#     memory-backend-guest-memfd)
 #
 # @reserve: if true, reserve swap space (or huge pages) if applicable
 #     (default: true) (since 6.1)
@@ -780,6 +781,18 @@
             '*seal': 'bool' },
   'if': 'CONFIG_LINUX' }
 
+##
+# @MemoryBackendGuestMemfdProperties:
+#
+# Properties for memory-backend-guest-memfd objects.
+#
+# Since: 11.1
+##
+{ 'struct': 'MemoryBackendGuestMemfdProperties',
+  'base': 'MemoryBackendProperties',
+  'data': {},
+  'if': 'CONFIG_LINUX' }
+
 ##
 # @MemoryBackendShmProperties:
 #
@@ -1234,6 +1247,8 @@
     'memory-backend-file',
     { 'name': 'memory-backend-memfd',
       'if': 'CONFIG_LINUX' },
+    { 'name': 'memory-backend-guest-memfd',
+      'if': 'CONFIG_LINUX' },
     'memory-backend-ram',
     { 'name': 'memory-backend-shm',
       'if': 'CONFIG_POSIX' },
@@ -1312,6 +1327,8 @@
       'memory-backend-file':        'MemoryBackendFileProperties',
       'memory-backend-memfd':       { 'type': 'MemoryBackendMemfdProperties',
                                       'if': 'CONFIG_LINUX' },
+      'memory-backend-guest-memfd': { 'type': 'MemoryBackendGuestMemfdProperties',
+                                      'if': 'CONFIG_LINUX' },
       'memory-backend-ram':         'MemoryBackendProperties',
       'memory-backend-shm':         { 'type': 'MemoryBackendShmProperties',
                                       'if': 'CONFIG_POSIX' },
diff --git a/qemu-options.hx b/qemu-options.hx
index 96ae41f787..3c754c149f 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5858,6 +5858,11 @@ SRST
         off will cause a failure during allocation because it is not supported
         by this backend.
 
+    ``-object memory-backend-guest-memfd,id=id,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
+        Creates an anonymous memory file backend object that has similar
+        semantics to memfd, but is also usable as private memory when
+        running as a confidential VM. (Linux only)
+
     ``-object iommufd,id=id[,fd=fd]``
         Creates an iommufd backend which allows control of DMA mapping
         through the ``/dev/iommu`` device.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd
  2026-05-28  0:03 ` [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd Michael Roth
@ 2026-06-02  8:22   ` Markus Armbruster
  2026-06-03  6:19     ` Michael Roth
  0 siblings, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2026-06-02  8:22 UTC (permalink / raw)
  To: Michael Roth
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

Michael Roth <michael.roth@amd.com> writes:

> In the initial implementation of guest_memfd in the linux kernel, it
> was not possible to map memory into userspace for direct access; instead
> the memory provided by the memory backend would be used for cases where
> a confidential VM wants to access normal/unprotected/unencrypted memory
> that can be used for shared memory use cases, and for access to private
> memory a guest_memfd could be associated with the same memslot. A memory
> 'private' attribute set via KVM_SET_MEMORY_ATTRIBUTES could then be used
> to have KVM route to the approprate backing memory.
>
> In that model, it didn't make sense to introduce a specific backend for
> guest_memfd, since there was always a generally need to have a separate

a general need?

> backend type to handle shared memory access/allocation. Instead, QEMU
> configures the guest_memfd support for the associated memslots
> internally for cases where it is running a confidential VM.
>
> However, with recent changes in guest_memfd kernel support, it is now
> possible to mmap() a guest_memfd FD into userspace and use it for shared
> memory, as well as continue to use the same physical pages for the same
> GPA ranges after they are converted to private ("in-place conversion").
>
> To enable the use of this mmap()-able/guest_memfd-provided memory to be
> used for normal/shared memory instead of just for private memory,
> introduce a dedicated guest_memfd memory backend that can be used both
> for confidential VMs that wish to make use of in-place conversion, as
> well as for non-confidential VMs that just want to make use of
> guest_memfd for normal memory (which can be useful both for testing as
> well as a stepping stone to things like software-protected VMs where the
> host can be trusted to provided some additional degree of isolation for
> the VM independently of hardware support).
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>

[...]

> diff --git a/qapi/qom.json b/qapi/qom.json
> index dd45ac1087..502fafeb15 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -661,7 +661,8 @@
>  # @share: if false, the memory is private to QEMU; if true, it is
>  #     shared (default false for backends memory-backend-file and
>  #     memory-backend-ram, true for backends memory-backend-epc,
> -#     memory-backend-memfd, and memory-backend-shm)
> +#     memory-backend-memfd, memory-backend-shm, and
> +#     memory-backend-guest-memfd)
>  #
>  # @reserve: if true, reserve swap space (or huge pages) if applicable
>  #     (default: true) (since 6.1)
> @@ -780,6 +781,18 @@
>              '*seal': 'bool' },
>    'if': 'CONFIG_LINUX' }
>  
> +##
> +# @MemoryBackendGuestMemfdProperties:
> +#
> +# Properties for memory-backend-guest-memfd objects.
> +#
> +# Since: 11.1
> +##
> +{ 'struct': 'MemoryBackendGuestMemfdProperties',
> +  'base': 'MemoryBackendProperties',
> +  'data': {},
> +  'if': 'CONFIG_LINUX' }
> +

Identical to MemoryBackendProperties so far.

>  ##
>  # @MemoryBackendShmProperties:
>  #
> @@ -1234,6 +1247,8 @@
>      'memory-backend-file',
>      { 'name': 'memory-backend-memfd',
>        'if': 'CONFIG_LINUX' },
> +    { 'name': 'memory-backend-guest-memfd',
> +      'if': 'CONFIG_LINUX' },
>      'memory-backend-ram',
>      { 'name': 'memory-backend-shm',
>        'if': 'CONFIG_POSIX' },
> @@ -1312,6 +1327,8 @@
>        'memory-backend-file':        'MemoryBackendFileProperties',
>        'memory-backend-memfd':       { 'type': 'MemoryBackendMemfdProperties',
>                                        'if': 'CONFIG_LINUX' },
> +      'memory-backend-guest-memfd': { 'type': 'MemoryBackendGuestMemfdProperties',
> +                                      'if': 'CONFIG_LINUX' },

You could use MemoryBackendProperties here, and drop
MemoryBackendGuestMemfdProperties, similar to how memory-backend-ram
is done.

>        'memory-backend-ram':         'MemoryBackendProperties',
>        'memory-backend-shm':         { 'type': 'MemoryBackendShmProperties',
>                                        'if': 'CONFIG_POSIX' },

Should we provide guidance on when to use which memory backend?  The
commit message provides some clues...

> diff --git a/qemu-options.hx b/qemu-options.hx
> index 96ae41f787..3c754c149f 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -5858,6 +5858,11 @@ SRST
>          off will cause a failure during allocation because it is not supported
>          by this backend.
>  
> +    ``-object memory-backend-guest-memfd,id=id,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
> +        Creates an anonymous memory file backend object that has similar
> +        semantics to memfd, but is also usable as private memory when
> +        running as a confidential VM. (Linux only)

There is no object type "memfd".  Do you mean "memory-backend-memfd"?

If yes, that one has additional properties @hugetlb, @hugetlbsize, and
@seal.  Why are they not needed for memory-backend-guest-memfd?

> +
>      ``-object iommufd,id=id[,fd=fd]``
>          Creates an iommufd backend which allows control of DMA mapping
>          through the ``/dev/iommu`` device.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd
  2026-06-02  8:22   ` Markus Armbruster
@ 2026-06-03  6:19     ` Michael Roth
  2026-06-08  8:20       ` Markus Armbruster
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Roth @ 2026-06-03  6:19 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

On Tue, Jun 02, 2026 at 10:22:01AM +0200, Markus Armbruster wrote:
> Michael Roth <michael.roth@amd.com> writes:
> 
> > In the initial implementation of guest_memfd in the linux kernel, it
> > was not possible to map memory into userspace for direct access; instead
> > the memory provided by the memory backend would be used for cases where
> > a confidential VM wants to access normal/unprotected/unencrypted memory
> > that can be used for shared memory use cases, and for access to private
> > memory a guest_memfd could be associated with the same memslot. A memory
> > 'private' attribute set via KVM_SET_MEMORY_ATTRIBUTES could then be used
> > to have KVM route to the approprate backing memory.
> >
> > In that model, it didn't make sense to introduce a specific backend for
> > guest_memfd, since there was always a generally need to have a separate
> 
> a general need?

Much nicer :)

> 
> > backend type to handle shared memory access/allocation. Instead, QEMU
> > configures the guest_memfd support for the associated memslots
> > internally for cases where it is running a confidential VM.
> >
> > However, with recent changes in guest_memfd kernel support, it is now
> > possible to mmap() a guest_memfd FD into userspace and use it for shared
> > memory, as well as continue to use the same physical pages for the same
> > GPA ranges after they are converted to private ("in-place conversion").
> >
> > To enable the use of this mmap()-able/guest_memfd-provided memory to be
> > used for normal/shared memory instead of just for private memory,
> > introduce a dedicated guest_memfd memory backend that can be used both
> > for confidential VMs that wish to make use of in-place conversion, as
> > well as for non-confidential VMs that just want to make use of
> > guest_memfd for normal memory (which can be useful both for testing as
> > well as a stepping stone to things like software-protected VMs where the
> > host can be trusted to provided some additional degree of isolation for
> > the VM independently of hardware support).
> >
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> 
> [...]
> 
> > diff --git a/qapi/qom.json b/qapi/qom.json
> > index dd45ac1087..502fafeb15 100644
> > --- a/qapi/qom.json
> > +++ b/qapi/qom.json
> > @@ -661,7 +661,8 @@
> >  # @share: if false, the memory is private to QEMU; if true, it is
> >  #     shared (default false for backends memory-backend-file and
> >  #     memory-backend-ram, true for backends memory-backend-epc,
> > -#     memory-backend-memfd, and memory-backend-shm)
> > +#     memory-backend-memfd, memory-backend-shm, and
> > +#     memory-backend-guest-memfd)
> >  #
> >  # @reserve: if true, reserve swap space (or huge pages) if applicable
> >  #     (default: true) (since 6.1)
> > @@ -780,6 +781,18 @@
> >              '*seal': 'bool' },
> >    'if': 'CONFIG_LINUX' }
> >  
> > +##
> > +# @MemoryBackendGuestMemfdProperties:
> > +#
> > +# Properties for memory-backend-guest-memfd objects.
> > +#
> > +# Since: 11.1
> > +##
> > +{ 'struct': 'MemoryBackendGuestMemfdProperties',
> > +  'base': 'MemoryBackendProperties',
> > +  'data': {},
> > +  'if': 'CONFIG_LINUX' }
> > +
> 
> Identical to MemoryBackendProperties so far.
> 
> >  ##
> >  # @MemoryBackendShmProperties:
> >  #
> > @@ -1234,6 +1247,8 @@
> >      'memory-backend-file',
> >      { 'name': 'memory-backend-memfd',
> >        'if': 'CONFIG_LINUX' },
> > +    { 'name': 'memory-backend-guest-memfd',
> > +      'if': 'CONFIG_LINUX' },
> >      'memory-backend-ram',
> >      { 'name': 'memory-backend-shm',
> >        'if': 'CONFIG_POSIX' },
> > @@ -1312,6 +1327,8 @@
> >        'memory-backend-file':        'MemoryBackendFileProperties',
> >        'memory-backend-memfd':       { 'type': 'MemoryBackendMemfdProperties',
> >                                        'if': 'CONFIG_LINUX' },
> > +      'memory-backend-guest-memfd': { 'type': 'MemoryBackendGuestMemfdProperties',
> > +                                      'if': 'CONFIG_LINUX' },
> 
> You could use MemoryBackendProperties here, and drop
> MemoryBackendGuestMemfdProperties, similar to how memory-backend-ram
> is done.

That's true. I think I was anticipating it being warranted at some point, but
that doesn't need to happen here.

> 
> >        'memory-backend-ram':         'MemoryBackendProperties',
> >        'memory-backend-shm':         { 'type': 'MemoryBackendShmProperties',
> >                                        'if': 'CONFIG_POSIX' },
> 
> Should we provide guidance on when to use which memory backend?  The
> commit message provides some clues...

Were you thinking from a schema perspective, or something more
user-facing?

Either way, docs/system/confidential-guest-support.rst could definitely
use some sprucing up as part of this series, so I can cover this aspect
there as well.

> 
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 96ae41f787..3c754c149f 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -5858,6 +5858,11 @@ SRST
> >          off will cause a failure during allocation because it is not supported
> >          by this backend.
> >  
> > +    ``-object memory-backend-guest-memfd,id=id,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
> > +        Creates an anonymous memory file backend object that has similar
> > +        semantics to memfd, but is also usable as private memory when
> > +        running as a confidential VM. (Linux only)
> 
> There is no object type "memfd".  Do you mean "memory-backend-memfd"?

Yes, will update.

> 
> If yes, that one has additional properties @hugetlb, @hugetlbsize, and
> @seal.  Why are they not needed for memory-backend-guest-memfd?

ATM, hugetlb is not enabled for guest_memfd in the kernel. It's likely the
same set of options will apply, but there are also efforts to do things like
plumb DAX memory through guest_memfd for confidential VMs where maybe we end
up needing to be a bit more flexible/creative... not sure, but it seemed
like a good idea to give ourselves a clean slate since the support isn't
there yet anyway.

For seal, I'm not aware of any plan to support that for guest_memfd, so
it seems like unecessary baggage to pull in.

Thanks,

Mike

> 
> > +
> >      ``-object iommufd,id=id[,fd=fd]``
> >          Creates an iommufd backend which allows control of DMA mapping
> >          through the ``/dev/iommu`` device.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd
  2026-06-03  6:19     ` Michael Roth
@ 2026-06-08  8:20       ` Markus Armbruster
  2026-06-08 20:42         ` Michael Roth
  0 siblings, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2026-06-08  8:20 UTC (permalink / raw)
  To: Michael Roth
  Cc: Markus Armbruster, qemu-devel, kvm, pbonzini, berrange,
	pankaj.gupta, isaku.yamahata, xiaoyao.li, chao.p.peng, david,
	ashish.kalra, ackerleytng

Michael Roth <michael.roth@amd.com> writes:

> On Tue, Jun 02, 2026 at 10:22:01AM +0200, Markus Armbruster wrote:
>> Michael Roth <michael.roth@amd.com> writes:
>> 
>> > In the initial implementation of guest_memfd in the linux kernel, it
>> > was not possible to map memory into userspace for direct access; instead
>> > the memory provided by the memory backend would be used for cases where
>> > a confidential VM wants to access normal/unprotected/unencrypted memory
>> > that can be used for shared memory use cases, and for access to private
>> > memory a guest_memfd could be associated with the same memslot. A memory
>> > 'private' attribute set via KVM_SET_MEMORY_ATTRIBUTES could then be used
>> > to have KVM route to the approprate backing memory.
>> >
>> > In that model, it didn't make sense to introduce a specific backend for
>> > guest_memfd, since there was always a generally need to have a separate
>> 
>> a general need?
>
> Much nicer :)
>
>> 
>> > backend type to handle shared memory access/allocation. Instead, QEMU
>> > configures the guest_memfd support for the associated memslots
>> > internally for cases where it is running a confidential VM.
>> >
>> > However, with recent changes in guest_memfd kernel support, it is now
>> > possible to mmap() a guest_memfd FD into userspace and use it for shared
>> > memory, as well as continue to use the same physical pages for the same
>> > GPA ranges after they are converted to private ("in-place conversion").
>> >
>> > To enable the use of this mmap()-able/guest_memfd-provided memory to be
>> > used for normal/shared memory instead of just for private memory,
>> > introduce a dedicated guest_memfd memory backend that can be used both
>> > for confidential VMs that wish to make use of in-place conversion, as
>> > well as for non-confidential VMs that just want to make use of
>> > guest_memfd for normal memory (which can be useful both for testing as
>> > well as a stepping stone to things like software-protected VMs where the
>> > host can be trusted to provided some additional degree of isolation for
>> > the VM independently of hardware support).
>> >
>> > Signed-off-by: Michael Roth <michael.roth@amd.com>
>> 
>> [...]
>> 
>> > diff --git a/qapi/qom.json b/qapi/qom.json
>> > index dd45ac1087..502fafeb15 100644
>> > --- a/qapi/qom.json
>> > +++ b/qapi/qom.json
>> > @@ -661,7 +661,8 @@
>> >  # @share: if false, the memory is private to QEMU; if true, it is
>> >  #     shared (default false for backends memory-backend-file and
>> >  #     memory-backend-ram, true for backends memory-backend-epc,
>> > -#     memory-backend-memfd, and memory-backend-shm)
>> > +#     memory-backend-memfd, memory-backend-shm, and
>> > +#     memory-backend-guest-memfd)
>> >  #
>> >  # @reserve: if true, reserve swap space (or huge pages) if applicable
>> >  #     (default: true) (since 6.1)
>> > @@ -780,6 +781,18 @@
>> >              '*seal': 'bool' },
>> >    'if': 'CONFIG_LINUX' }
>> >  
>> > +##
>> > +# @MemoryBackendGuestMemfdProperties:
>> > +#
>> > +# Properties for memory-backend-guest-memfd objects.
>> > +#
>> > +# Since: 11.1
>> > +##
>> > +{ 'struct': 'MemoryBackendGuestMemfdProperties',
>> > +  'base': 'MemoryBackendProperties',
>> > +  'data': {},
>> > +  'if': 'CONFIG_LINUX' }
>> > +
>> 
>> Identical to MemoryBackendProperties so far.
>> 
>> >  ##
>> >  # @MemoryBackendShmProperties:
>> >  #
>> > @@ -1234,6 +1247,8 @@
>> >      'memory-backend-file',
>> >      { 'name': 'memory-backend-memfd',
>> >        'if': 'CONFIG_LINUX' },
>> > +    { 'name': 'memory-backend-guest-memfd',
>> > +      'if': 'CONFIG_LINUX' },
>> >      'memory-backend-ram',
>> >      { 'name': 'memory-backend-shm',
>> >        'if': 'CONFIG_POSIX' },
>> > @@ -1312,6 +1327,8 @@
>> >        'memory-backend-file':        'MemoryBackendFileProperties',
>> >        'memory-backend-memfd':       { 'type': 'MemoryBackendMemfdProperties',
>> >                                        'if': 'CONFIG_LINUX' },
>> > +      'memory-backend-guest-memfd': { 'type': 'MemoryBackendGuestMemfdProperties',
>> > +                                      'if': 'CONFIG_LINUX' },
>> 
>> You could use MemoryBackendProperties here, and drop
>> MemoryBackendGuestMemfdProperties, similar to how memory-backend-ram
>> is done.
>
> That's true. I think I was anticipating it being warranted at some point, but
> that doesn't need to happen here.
>
>> 
>> >        'memory-backend-ram':         'MemoryBackendProperties',
>> >        'memory-backend-shm':         { 'type': 'MemoryBackendShmProperties',
>> >                                        'if': 'CONFIG_POSIX' },
>> 
>> Should we provide guidance on when to use which memory backend?  The
>> commit message provides some clues...
>
> Were you thinking from a schema perspective, or something more
> user-facing?

The QAPI schema doc comments become the QEMU QMP Reference Manual, which
I believe is the first stop for "how do I use this?"

Sometimes, a full answer just doesn't fit there comfortably.  So we put
it elsewhere, and point to it from the QMP Reference.

> Either way, docs/system/confidential-guest-support.rst could definitely
> use some sprucing up as part of this series, so I can cover this aspect
> there as well.
>
>> 
>> > diff --git a/qemu-options.hx b/qemu-options.hx
>> > index 96ae41f787..3c754c149f 100644
>> > --- a/qemu-options.hx
>> > +++ b/qemu-options.hx
>> > @@ -5858,6 +5858,11 @@ SRST
>> >          off will cause a failure during allocation because it is not supported
>> >          by this backend.
>> >  
>> > +    ``-object memory-backend-guest-memfd,id=id,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
>> > +        Creates an anonymous memory file backend object that has similar
>> > +        semantics to memfd, but is also usable as private memory when
>> > +        running as a confidential VM. (Linux only)
>> 
>> There is no object type "memfd".  Do you mean "memory-backend-memfd"?
>
> Yes, will update.
>
>> 
>> If yes, that one has additional properties @hugetlb, @hugetlbsize, and
>> @seal.  Why are they not needed for memory-backend-guest-memfd?
>
> ATM, hugetlb is not enabled for guest_memfd in the kernel. It's likely the
> same set of options will apply, but there are also efforts to do things like
> plumb DAX memory through guest_memfd for confidential VMs where maybe we end
> up needing to be a bit more flexible/creative... not sure, but it seemed
> like a good idea to give ourselves a clean slate since the support isn't
> there yet anyway.

I gather these properties cannot work today.  I agree we shouldn't add
them until they do.

> For seal, I'm not aware of any plan to support that for guest_memfd, so
> it seems like unecessary baggage to pull in.

Likewise.

> Thanks,
>
> Mike
>
>> 
>> > +
>> >      ``-object iommufd,id=id[,fd=fd]``
>> >          Creates an iommufd backend which allows control of DMA mapping
>> >          through the ``/dev/iommu`` device.
>> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd
  2026-06-08  8:20       ` Markus Armbruster
@ 2026-06-08 20:42         ` Michael Roth
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-06-08 20:42 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

On Mon, Jun 08, 2026 at 10:20:22AM +0200, Markus Armbruster wrote:
> Michael Roth <michael.roth@amd.com> writes:
> 
> > On Tue, Jun 02, 2026 at 10:22:01AM +0200, Markus Armbruster wrote:
> >> Michael Roth <michael.roth@amd.com> writes:
> >> 
> >> > In the initial implementation of guest_memfd in the linux kernel, it
> >> > was not possible to map memory into userspace for direct access; instead
> >> > the memory provided by the memory backend would be used for cases where
> >> > a confidential VM wants to access normal/unprotected/unencrypted memory
> >> > that can be used for shared memory use cases, and for access to private
> >> > memory a guest_memfd could be associated with the same memslot. A memory
> >> > 'private' attribute set via KVM_SET_MEMORY_ATTRIBUTES could then be used
> >> > to have KVM route to the approprate backing memory.
> >> >
> >> > In that model, it didn't make sense to introduce a specific backend for
> >> > guest_memfd, since there was always a generally need to have a separate
> >> 
> >> a general need?
> >
> > Much nicer :)
> >
> >> 
> >> > backend type to handle shared memory access/allocation. Instead, QEMU
> >> > configures the guest_memfd support for the associated memslots
> >> > internally for cases where it is running a confidential VM.
> >> >
> >> > However, with recent changes in guest_memfd kernel support, it is now
> >> > possible to mmap() a guest_memfd FD into userspace and use it for shared
> >> > memory, as well as continue to use the same physical pages for the same
> >> > GPA ranges after they are converted to private ("in-place conversion").
> >> >
> >> > To enable the use of this mmap()-able/guest_memfd-provided memory to be
> >> > used for normal/shared memory instead of just for private memory,
> >> > introduce a dedicated guest_memfd memory backend that can be used both
> >> > for confidential VMs that wish to make use of in-place conversion, as
> >> > well as for non-confidential VMs that just want to make use of
> >> > guest_memfd for normal memory (which can be useful both for testing as
> >> > well as a stepping stone to things like software-protected VMs where the
> >> > host can be trusted to provided some additional degree of isolation for
> >> > the VM independently of hardware support).
> >> >
> >> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> >> 
> >> [...]
> >> 
> >> > diff --git a/qapi/qom.json b/qapi/qom.json
> >> > index dd45ac1087..502fafeb15 100644
> >> > --- a/qapi/qom.json
> >> > +++ b/qapi/qom.json
> >> > @@ -661,7 +661,8 @@
> >> >  # @share: if false, the memory is private to QEMU; if true, it is
> >> >  #     shared (default false for backends memory-backend-file and
> >> >  #     memory-backend-ram, true for backends memory-backend-epc,
> >> > -#     memory-backend-memfd, and memory-backend-shm)
> >> > +#     memory-backend-memfd, memory-backend-shm, and
> >> > +#     memory-backend-guest-memfd)
> >> >  #
> >> >  # @reserve: if true, reserve swap space (or huge pages) if applicable
> >> >  #     (default: true) (since 6.1)
> >> > @@ -780,6 +781,18 @@
> >> >              '*seal': 'bool' },
> >> >    'if': 'CONFIG_LINUX' }
> >> >  
> >> > +##
> >> > +# @MemoryBackendGuestMemfdProperties:
> >> > +#
> >> > +# Properties for memory-backend-guest-memfd objects.
> >> > +#
> >> > +# Since: 11.1
> >> > +##
> >> > +{ 'struct': 'MemoryBackendGuestMemfdProperties',
> >> > +  'base': 'MemoryBackendProperties',
> >> > +  'data': {},
> >> > +  'if': 'CONFIG_LINUX' }
> >> > +
> >> 
> >> Identical to MemoryBackendProperties so far.
> >> 
> >> >  ##
> >> >  # @MemoryBackendShmProperties:
> >> >  #
> >> > @@ -1234,6 +1247,8 @@
> >> >      'memory-backend-file',
> >> >      { 'name': 'memory-backend-memfd',
> >> >        'if': 'CONFIG_LINUX' },
> >> > +    { 'name': 'memory-backend-guest-memfd',
> >> > +      'if': 'CONFIG_LINUX' },
> >> >      'memory-backend-ram',
> >> >      { 'name': 'memory-backend-shm',
> >> >        'if': 'CONFIG_POSIX' },
> >> > @@ -1312,6 +1327,8 @@
> >> >        'memory-backend-file':        'MemoryBackendFileProperties',
> >> >        'memory-backend-memfd':       { 'type': 'MemoryBackendMemfdProperties',
> >> >                                        'if': 'CONFIG_LINUX' },
> >> > +      'memory-backend-guest-memfd': { 'type': 'MemoryBackendGuestMemfdProperties',
> >> > +                                      'if': 'CONFIG_LINUX' },
> >> 
> >> You could use MemoryBackendProperties here, and drop
> >> MemoryBackendGuestMemfdProperties, similar to how memory-backend-ram
> >> is done.
> >
> > That's true. I think I was anticipating it being warranted at some point, but
> > that doesn't need to happen here.
> >
> >> 
> >> >        'memory-backend-ram':         'MemoryBackendProperties',
> >> >        'memory-backend-shm':         { 'type': 'MemoryBackendShmProperties',
> >> >                                        'if': 'CONFIG_POSIX' },
> >> 
> >> Should we provide guidance on when to use which memory backend?  The
> >> commit message provides some clues...
> >
> > Were you thinking from a schema perspective, or something more
> > user-facing?
> 
> The QAPI schema doc comments become the QEMU QMP Reference Manual, which
> I believe is the first stop for "how do I use this?"
> 
> Sometimes, a full answer just doesn't fit there comfortably.  So we put
> it elsewhere, and point to it from the QMP Reference.

Makes sense, I'll cross reference the documentation and provide some
background on how the backends / options are used.

Thanks,

Mike

> 
> > Either way, docs/system/confidential-guest-support.rst could definitely
> > use some sprucing up as part of this series, so I can cover this aspect
> > there as well.
> >
> >> 
> >> > diff --git a/qemu-options.hx b/qemu-options.hx
> >> > index 96ae41f787..3c754c149f 100644
> >> > --- a/qemu-options.hx
> >> > +++ b/qemu-options.hx
> >> > @@ -5858,6 +5858,11 @@ SRST
> >> >          off will cause a failure during allocation because it is not supported
> >> >          by this backend.
> >> >  
> >> > +    ``-object memory-backend-guest-memfd,id=id,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
> >> > +        Creates an anonymous memory file backend object that has similar
> >> > +        semantics to memfd, but is also usable as private memory when
> >> > +        running as a confidential VM. (Linux only)
> >> 
> >> There is no object type "memfd".  Do you mean "memory-backend-memfd"?
> >
> > Yes, will update.
> >
> >> 
> >> If yes, that one has additional properties @hugetlb, @hugetlbsize, and
> >> @seal.  Why are they not needed for memory-backend-guest-memfd?
> >
> > ATM, hugetlb is not enabled for guest_memfd in the kernel. It's likely the
> > same set of options will apply, but there are also efforts to do things like
> > plumb DAX memory through guest_memfd for confidential VMs where maybe we end
> > up needing to be a bit more flexible/creative... not sure, but it seemed
> > like a good idea to give ourselves a clean slate since the support isn't
> > there yet anyway.
> 
> I gather these properties cannot work today.  I agree we shouldn't add
> them until they do.
> 
> > For seal, I'm not aware of any plan to support that for guest_memfd, so
> > it seems like unecessary baggage to pull in.
> 
> Likewise.

Sounds good, though I'm sort of now leaning more toward the
memory-backend-memfd,guest_memfd=on approach that Peter implemented[1]
since it requires less assumptions about what we'll need to do later
(i.e. if we want to introduce a backend specifically for guest_memfd
we'll still have the option, but if we do it now, but decide to go back
to re-using the existin *-memfd/*-file/*-etc backends because the
option format seems more familiar to QEMU users, then the dedicated
backend is a little bit more of a pain to turn around and try to
deprecate.

Not sure yet what we'll end up doing though, but hopefully for v2 we'll
have a plan for what to do initially at least.

Thanks,

Mike

[1] https://lore.kernel.org/qemu-devel/aiCAFWKEAHkPLCO5@x1.local/

> 
> > Thanks,
> >
> > Mike
> >
> >> 
> >> > +
> >> >      ``-object iommufd,id=id[,fd=fd]``
> >> >          Creates an iommufd backend which allows control of DMA mapping
> >> >          through the ``/dev/iommu`` device.
> >> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 03/12] linux-headers: Update headers for v7 of in-place conversion kernel support
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 01/12] accel/kvm: Decouple guest_memfd checks from memory attribute checks Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support Michael Roth
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

This will also pull in kernel 7.1.0-rc2 definitions.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 include/standard-headers/drm/drm_fourcc.h     |  28 +-
 include/standard-headers/linux/const.h        |  18 +
 include/standard-headers/linux/ethtool.h      |  28 +-
 .../linux/input-event-codes.h                 |  13 +
 include/standard-headers/linux/pci_regs.h     |  71 ++-
 include/standard-headers/linux/typelimits.h   |   8 +
 include/standard-headers/linux/virtio_ring.h  |   5 +-
 include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
 include/standard-headers/linux/vmclock-abi.h  |  20 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-arm64/unistd_64.h           |   1 +
 linux-headers/asm-generic/unistd.h            |   5 +-
 linux-headers/asm-loongarch/kvm.h             |   5 +
 linux-headers/asm-loongarch/kvm_para.h        |   1 +
 linux-headers/asm-loongarch/unistd_64.h       |   2 +
 linux-headers/asm-mips/unistd_n32.h           |   1 +
 linux-headers/asm-mips/unistd_n64.h           |   1 +
 linux-headers/asm-mips/unistd_o32.h           |   1 +
 linux-headers/asm-powerpc/unistd_32.h         |   1 +
 linux-headers/asm-powerpc/unistd_64.h         |   1 +
 linux-headers/asm-riscv/kvm.h                 |  11 +-
 linux-headers/asm-riscv/ptrace.h              |  37 ++
 linux-headers/asm-riscv/unistd_32.h           |   1 +
 linux-headers/asm-riscv/unistd_64.h           |   1 +
 linux-headers/asm-s390/unistd_32.h            | 446 ------------------
 linux-headers/asm-s390/unistd_64.h            |   1 +
 linux-headers/asm-x86/kvm.h                   |  21 +-
 linux-headers/asm-x86/unistd_32.h             |   1 +
 linux-headers/asm-x86/unistd_64.h             |   1 +
 linux-headers/asm-x86/unistd_x32.h            |   1 +
 linux-headers/linux/const.h                   |  18 +
 linux-headers/linux/iommufd.h                 |  48 ++
 linux-headers/linux/kvm.h                     |  62 ++-
 linux-headers/linux/mshv.h                    |   4 +-
 linux-headers/linux/psp-sev.h                 |   2 +-
 linux-headers/linux/stddef.h                  |   4 +
 linux-headers/linux/vduse.h                   |  85 +++-
 linux-headers/linux/vfio.h                    |  30 +-
 38 files changed, 729 insertions(+), 493 deletions(-)
 create mode 100644 include/standard-headers/linux/typelimits.h
 create mode 100644 include/standard-headers/linux/virtio_rtc.h
 delete mode 100644 linux-headers/asm-s390/unistd_32.h

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index b39e197cc7..4bad457cc2 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -400,8 +400,8 @@ extern "C" {
  * implementation can multiply the values by 2^6=64. For that reason the padding
  * must only contain zeros.
  * index 0 = Y plane, [15:0] z:Y [6:10] little endian
- * index 1 = Cr plane, [15:0] z:Cr [6:10] little endian
- * index 2 = Cb plane, [15:0] z:Cb [6:10] little endian
+ * index 1 = Cb plane, [15:0] z:Cb [6:10] little endian
+ * index 2 = Cr plane, [15:0] z:Cr [6:10] little endian
  */
 #define DRM_FORMAT_S010	fourcc_code('S', '0', '1', '0') /* 2x2 subsampled Cb (1) and Cr (2) planes 10 bits per channel */
 #define DRM_FORMAT_S210	fourcc_code('S', '2', '1', '0') /* 2x1 subsampled Cb (1) and Cr (2) planes 10 bits per channel */
@@ -413,8 +413,8 @@ extern "C" {
  * implementation can multiply the values by 2^4=16. For that reason the padding
  * must only contain zeros.
  * index 0 = Y plane, [15:0] z:Y [4:12] little endian
- * index 1 = Cr plane, [15:0] z:Cr [4:12] little endian
- * index 2 = Cb plane, [15:0] z:Cb [4:12] little endian
+ * index 1 = Cb plane, [15:0] z:Cb [4:12] little endian
+ * index 2 = Cr plane, [15:0] z:Cr [4:12] little endian
  */
 #define DRM_FORMAT_S012	fourcc_code('S', '0', '1', '2') /* 2x2 subsampled Cb (1) and Cr (2) planes 12 bits per channel */
 #define DRM_FORMAT_S212	fourcc_code('S', '2', '1', '2') /* 2x1 subsampled Cb (1) and Cr (2) planes 12 bits per channel */
@@ -423,8 +423,8 @@ extern "C" {
 /*
  * 3 plane YCbCr
  * index 0 = Y plane, [15:0] Y little endian
- * index 1 = Cr plane, [15:0] Cr little endian
- * index 2 = Cb plane, [15:0] Cb little endian
+ * index 1 = Cb plane, [15:0] Cb little endian
+ * index 2 = Cr plane, [15:0] Cr little endian
  */
 #define DRM_FORMAT_S016	fourcc_code('S', '0', '1', '6') /* 2x2 subsampled Cb (1) and Cr (2) planes 16 bits per channel */
 #define DRM_FORMAT_S216	fourcc_code('S', '2', '1', '6') /* 2x1 subsampled Cb (1) and Cr (2) planes 16 bits per channel */
@@ -1421,6 +1421,22 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t modifier)
 #define DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED \
 	DRM_FORMAT_MOD_ARM_CODE(DRM_FORMAT_MOD_ARM_TYPE_MISC, 1ULL)
 
+/*
+ * ARM 64k interleaved modifier
+ *
+ * This is used by ARM Mali v10+ GPUs. With this modifier, the plane is divided
+ * into 64k byte 1:1 or 2:1 -sided tiles. The 64k tiles are laid out linearly.
+ * Each 64k tile is divided into blocks of 16x16 texel blocks, which are
+ * themselves laid out linearly within a 64k tile. Then within each 16x16
+ * block, texel blocks are laid out according to U order, similar to
+ * 16X16_BLOCK_U_INTERLEAVED.
+ *
+ * Note that unlike 16X16_BLOCK_U_INTERLEAVED, the layout does not change
+ * depending on whether a format is compressed or not.
+ */
+#define DRM_FORMAT_MOD_ARM_INTERLEAVED_64K \
+	DRM_FORMAT_MOD_ARM_CODE(DRM_FORMAT_MOD_ARM_TYPE_MISC, 2ULL)
+
 /*
  * Allwinner tiled modifier
  *
diff --git a/include/standard-headers/linux/const.h b/include/standard-headers/linux/const.h
index 95ede23342..c6a9d0c983 100644
--- a/include/standard-headers/linux/const.h
+++ b/include/standard-headers/linux/const.h
@@ -50,4 +50,22 @@
 
 #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 
+/*
+ * Divide positive or negative dividend by positive or negative divisor
+ * and round to closest integer. Result is undefined for negative
+ * divisors if the dividend variable type is unsigned and for negative
+ * dividends if the divisor variable type is unsigned.
+ */
+#define __KERNEL_DIV_ROUND_CLOSEST(x, divisor)		\
+({							\
+	__typeof__(x) __x = x;				\
+	__typeof__(divisor) __d = divisor;		\
+							\
+	(((__typeof__(x))-1) > 0 ||			\
+	 ((__typeof__(divisor))-1) > 0 ||		\
+	 (((__x) > 0) == ((__d) > 0))) ?		\
+		(((__x) + ((__d) / 2)) / (__d)) :	\
+		(((__x) - ((__d) / 2)) / (__d));	\
+})
+
 #endif /* _LINUX_CONST_H */
diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h
index d0f7a63f10..5d82126cd7 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -17,11 +17,10 @@
 #include "net/eth.h"
 
 #include "standard-headers/linux/const.h"
+#include "standard-headers/linux/typelimits.h"
 #include "standard-headers/linux/types.h"
 #include "standard-headers/linux/if_ether.h"
 
-#include <limits.h> /* for INT_MAX */
-
 /* All structures exposed to userland should be defined such that they
  * have the same layout for 32-bit and 64-bit userland.
  */
@@ -228,7 +227,7 @@ enum tunable_id {
 	ETHTOOL_ID_UNSPEC,
 	ETHTOOL_RX_COPYBREAK,
 	ETHTOOL_TX_COPYBREAK,
-	ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
+	ETHTOOL_PFC_PREVENTION_TOUT, /* both pause and pfc, see man ethtool */
 	ETHTOOL_TX_COPYBREAK_BUF_SIZE,
 	/*
 	 * Add your fresh new tunable attribute above and remember to update
@@ -603,6 +602,8 @@ enum ethtool_link_ext_state {
 	ETHTOOL_LINK_EXT_STATE_POWER_BUDGET_EXCEEDED,
 	ETHTOOL_LINK_EXT_STATE_OVERHEAT,
 	ETHTOOL_LINK_EXT_STATE_MODULE,
+	ETHTOOL_LINK_EXT_STATE_OTP_SPEED_VIOLATION,
+	ETHTOOL_LINK_EXT_STATE_BMC_REQUEST_DOWN,
 };
 
 /* More information in addition to ETHTOOL_LINK_EXT_STATE_AUTONEG. */
@@ -1094,13 +1095,20 @@ enum ethtool_module_fw_flash_status {
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
  * @string_set: String set ID; one of &enum ethtool_stringset
- * @len: On return, the number of strings in the string set
+ * @len: Number of strings in the string set
  * @data: Buffer for strings.  Each string is null-padded to a size of
  *	%ETH_GSTRING_LEN.
  *
  * Users must use %ETHTOOL_GSSET_INFO to find the number of strings in
  * the string set.  They must allocate a buffer of the appropriate
  * size immediately following this structure.
+ *
+ * Setting @len on input is optional (though preferred), but must be zeroed
+ * otherwise.
+ * When set, @len will return the requested count if it matches the actual
+ * count; otherwise, it will be zero.
+ * This prevents issues when the number of strings is different than the
+ * userspace allocation.
  */
 struct ethtool_gstrings {
 	uint32_t	cmd;
@@ -1177,13 +1185,20 @@ struct ethtool_test {
 /**
  * struct ethtool_stats - device-specific statistics
  * @cmd: Command number = %ETHTOOL_GSTATS
- * @n_stats: On return, the number of statistics
+ * @n_stats: Number of statistics
  * @data: Array of statistics
  *
  * Users must use %ETHTOOL_GSSET_INFO or %ETHTOOL_GDRVINFO to find the
  * number of statistics that will be returned.  They must allocate a
  * buffer of the appropriate size (8 * number of statistics)
  * immediately following this structure.
+ *
+ * Setting @n_stats on input is optional (though preferred), but must be zeroed
+ * otherwise.
+ * When set, @n_stats will return the requested count if it matches the actual
+ * count; otherwise, it will be zero.
+ * This prevents issues when the number of stats is different than the
+ * userspace allocation.
  */
 struct ethtool_stats {
 	uint32_t	cmd;
@@ -2190,6 +2205,7 @@ enum ethtool_link_mode_bit_indices {
 #define SPEED_40000		40000
 #define SPEED_50000		50000
 #define SPEED_56000		56000
+#define SPEED_80000		80000
 #define SPEED_100000		100000
 #define SPEED_200000		200000
 #define SPEED_400000		400000
@@ -2200,7 +2216,7 @@ enum ethtool_link_mode_bit_indices {
 
 static inline int ethtool_validate_speed(uint32_t speed)
 {
-	return speed <= INT_MAX || speed == (uint32_t)SPEED_UNKNOWN;
+	return speed <= __KERNEL_INT_MAX || speed == (uint32_t)SPEED_UNKNOWN;
 }
 
 /* Duplex, half or full. */
diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h
index ede79c6ae4..dd7c986106 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -643,6 +643,10 @@
 #define KEY_EPRIVACY_SCREEN_ON		0x252
 #define KEY_EPRIVACY_SCREEN_OFF		0x253
 
+#define KEY_ACTION_ON_SELECTION		0x254	/* AL Action on Selection (HUTRR119) */
+#define KEY_CONTEXTUAL_INSERT		0x255	/* AL Contextual Insertion (HUTRR119) */
+#define KEY_CONTEXTUAL_QUERY		0x256	/* AL Contextual Query (HUTRR119) */
+
 #define KEY_KBDINPUTASSIST_PREV		0x260
 #define KEY_KBDINPUTASSIST_NEXT		0x261
 #define KEY_KBDINPUTASSIST_PREVGROUP		0x262
@@ -891,6 +895,7 @@
 
 #define ABS_VOLUME		0x20
 #define ABS_PROFILE		0x21
+#define ABS_SND_PROFILE		0x22
 
 #define ABS_MISC		0x28
 
@@ -1000,4 +1005,12 @@
 #define SND_MAX			0x07
 #define SND_CNT			(SND_MAX+1)
 
+/*
+ * ABS_SND_PROFILE values
+ */
+
+#define SND_PROFILE_SILENT	0x00
+#define SND_PROFILE_VIBRATE	0x01
+#define SND_PROFILE_RING	0x02
+
 #endif
diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
index 3add74ae25..14f634ab93 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -132,6 +132,11 @@
 #define PCI_SECONDARY_BUS	0x19	/* Secondary bus number */
 #define PCI_SUBORDINATE_BUS	0x1a	/* Highest bus number behind the bridge */
 #define PCI_SEC_LATENCY_TIMER	0x1b	/* Latency timer for secondary interface */
+/* Masks for dword-sized processing of Bus Number and Sec Latency Timer fields */
+#define  PCI_PRIMARY_BUS_MASK		0x000000ff
+#define  PCI_SECONDARY_BUS_MASK		0x0000ff00
+#define  PCI_SUBORDINATE_BUS_MASK	0x00ff0000
+#define  PCI_SEC_LATENCY_TIMER_MASK	0xff000000
 #define PCI_IO_BASE		0x1c	/* I/O range behind the bridge */
 #define PCI_IO_LIMIT		0x1d
 #define  PCI_IO_RANGE_TYPE_MASK	0x0fUL	/* I/O bridging type */
@@ -707,7 +712,7 @@
 #define  PCI_EXP_LNKCTL2_HASD		0x0020 /* HW Autonomous Speed Disable */
 #define PCI_EXP_LNKSTA2		0x32	/* Link Status 2 */
 #define  PCI_EXP_LNKSTA2_FLIT		0x0400 /* Flit Mode Status */
-#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x32	/* end of v2 EPs w/ link */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x34	/* end of v2 EPs w/ link */
 #define PCI_EXP_SLTCAP2		0x34	/* Slot Capabilities 2 */
 #define  PCI_EXP_SLTCAP2_IBPD	0x00000001 /* In-band PD Disable Supported */
 #define PCI_EXP_SLTCTL2		0x38	/* Slot Control 2 */
@@ -1253,11 +1258,6 @@
 #define PCI_DEV3_STA		0x0c	/* Device 3 Status Register */
 #define  PCI_DEV3_STA_SEGMENT	0x8	/* Segment Captured (end-to-end flit-mode detected) */
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
-#define PCI_DVSEC_CXL_PORT				3
-#define PCI_DVSEC_CXL_PORT_CTL				0x0c
-#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
-
 /* Integrity and Data Encryption Extended Capability */
 #define PCI_IDE_CAP			0x04
 #define  PCI_IDE_CAP_LINK		0x1  /* Link IDE Stream Supported */
@@ -1338,4 +1338,63 @@
 #define  PCI_IDE_SEL_ADDR_3(x)		(28 + (x) * PCI_IDE_SEL_ADDR_BLOCK_SIZE)
 #define PCI_IDE_SEL_BLOCK_SIZE(nr_assoc)  (20 + PCI_IDE_SEL_ADDR_BLOCK_SIZE * (nr_assoc))
 
+/*
+ * Compute Express Link (CXL r4.0, sec 8.1)
+ *
+ * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
+ * is "disconnected" (CXL r4.0, sec 9.12.3). Re-enumerate these
+ * registers on downstream link-up events.
+ */
+
+/* CXL r4.0, 8.1.3: PCIe DVSEC for CXL Device */
+#define PCI_DVSEC_CXL_DEVICE				0
+#define  PCI_DVSEC_CXL_CAP				0xA
+#define   PCI_DVSEC_CXL_MEM_CAPABLE			_BITUL(2)
+#define   PCI_DVSEC_CXL_HDM_COUNT			__GENMASK(5, 4)
+#define  PCI_DVSEC_CXL_CTRL				0xC
+#define   PCI_DVSEC_CXL_MEM_ENABLE			_BITUL(2)
+#define  PCI_DVSEC_CXL_RANGE_SIZE_HIGH(i)		(0x18 + (i * 0x10))
+#define  PCI_DVSEC_CXL_RANGE_SIZE_LOW(i)		(0x1C + (i * 0x10))
+#define   PCI_DVSEC_CXL_MEM_INFO_VALID			_BITUL(0)
+#define   PCI_DVSEC_CXL_MEM_ACTIVE			_BITUL(1)
+#define   PCI_DVSEC_CXL_MEM_SIZE_LOW			__GENMASK(31, 28)
+#define  PCI_DVSEC_CXL_RANGE_BASE_HIGH(i)		(0x20 + (i * 0x10))
+#define  PCI_DVSEC_CXL_RANGE_BASE_LOW(i)		(0x24 + (i * 0x10))
+#define   PCI_DVSEC_CXL_MEM_BASE_LOW			__GENMASK(31, 28)
+
+#define CXL_DVSEC_RANGE_MAX				2
+
+/* CXL r4.0, 8.1.4: Non-CXL Function Map DVSEC */
+#define PCI_DVSEC_CXL_FUNCTION_MAP			2
+
+/* CXL r4.0, 8.1.5: Extensions DVSEC for Ports */
+#define PCI_DVSEC_CXL_PORT				3
+#define  PCI_DVSEC_CXL_PORT_CTL				0x0c
+#define   PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+
+/* CXL r4.0, 8.1.6: GPF DVSEC for CXL Port */
+#define PCI_DVSEC_CXL_PORT_GPF				4
+#define  PCI_DVSEC_CXL_PORT_GPF_PHASE_1_CONTROL		0x0C
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_BASE	__GENMASK(3, 0)
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_SCALE	__GENMASK(11, 8)
+#define  PCI_DVSEC_CXL_PORT_GPF_PHASE_2_CONTROL		0xE
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_BASE	__GENMASK(3, 0)
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_SCALE	__GENMASK(11, 8)
+
+/* CXL r4.0, 8.1.7: GPF DVSEC for CXL Device */
+#define PCI_DVSEC_CXL_DEVICE_GPF			5
+
+/* CXL r4.0, 8.1.8: Flex Bus DVSEC */
+#define PCI_DVSEC_CXL_FLEXBUS_PORT			7
+#define  PCI_DVSEC_CXL_FLEXBUS_PORT_STATUS		0xE
+#define   PCI_DVSEC_CXL_FLEXBUS_PORT_STATUS_CACHE	_BITUL(0)
+#define   PCI_DVSEC_CXL_FLEXBUS_PORT_STATUS_MEM		_BITUL(2)
+
+/* CXL r4.0, 8.1.9: Register Locator DVSEC */
+#define PCI_DVSEC_CXL_REG_LOCATOR			8
+#define  PCI_DVSEC_CXL_REG_LOCATOR_BLOCK1		0xC
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BIR			__GENMASK(2, 0)
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_ID		__GENMASK(15, 8)
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_OFF_LOW	__GENMASK(31, 16)
+
 #endif /* LINUX_PCI_REGS_H */
diff --git a/include/standard-headers/linux/typelimits.h b/include/standard-headers/linux/typelimits.h
new file mode 100644
index 0000000000..1304520082
--- /dev/null
+++ b/include/standard-headers/linux/typelimits.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_TYPELIMITS_H
+#define _LINUX_TYPELIMITS_H
+
+#define __KERNEL_INT_MAX ((int)(~0U >> 1))
+#define __KERNEL_INT_MIN (-__KERNEL_INT_MAX - 1)
+
+#endif /* _LINUX_TYPELIMITS_H */
diff --git a/include/standard-headers/linux/virtio_ring.h b/include/standard-headers/linux/virtio_ring.h
index 22f6eb8ca7..a0f73a1c7b 100644
--- a/include/standard-headers/linux/virtio_ring.h
+++ b/include/standard-headers/linux/virtio_ring.h
@@ -1,5 +1,7 @@
 #ifndef _LINUX_VIRTIO_RING_H
 #define _LINUX_VIRTIO_RING_H
+
+#define VIRTIO_RING_NO_LEGACY
 /* An interface for efficient virtio implementation, currently for use by KVM,
  * but hopefully others soon.  Do NOT change this since it will
  * break existing servers and clients.
@@ -31,7 +33,6 @@
  * SUCH DAMAGE.
  *
  * Copyright Rusty Russell IBM Corporation 2007. */
-#include <stdint.h>
 #include "standard-headers/linux/types.h"
 #include "standard-headers/linux/virtio_types.h"
 
@@ -200,7 +201,7 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 	vr->num = num;
 	vr->desc = p;
 	vr->avail = (struct vring_avail *)((char *)p + num * sizeof(struct vring_desc));
-	vr->used = (void *)(((uintptr_t)&vr->avail->ring[num] + sizeof(__virtio16)
+	vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__virtio16)
 		+ align-1) & ~(align - 1));
 }
 
diff --git a/include/standard-headers/linux/virtio_rtc.h b/include/standard-headers/linux/virtio_rtc.h
new file mode 100644
index 0000000000..7e2c21ebff
--- /dev/null
+++ b/include/standard-headers/linux/virtio_rtc.h
@@ -0,0 +1,237 @@
+/* SPDX-License-Identifier: ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) */
+/*
+ * Copyright (C) 2022-2024 OpenSynergy GmbH
+ * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#ifndef _LINUX_VIRTIO_RTC_H
+#define _LINUX_VIRTIO_RTC_H
+
+#include "standard-headers/linux/types.h"
+
+/* alarm feature */
+#define VIRTIO_RTC_F_ALARM	0
+
+/* read request message types */
+
+#define VIRTIO_RTC_REQ_READ			0x0001
+#define VIRTIO_RTC_REQ_READ_CROSS		0x0002
+
+/* control request message types */
+
+#define VIRTIO_RTC_REQ_CFG			0x1000
+#define VIRTIO_RTC_REQ_CLOCK_CAP		0x1001
+#define VIRTIO_RTC_REQ_CROSS_CAP		0x1002
+#define VIRTIO_RTC_REQ_READ_ALARM		0x1003
+#define VIRTIO_RTC_REQ_SET_ALARM		0x1004
+#define VIRTIO_RTC_REQ_SET_ALARM_ENABLED	0x1005
+
+/* alarmq message types */
+
+#define VIRTIO_RTC_NOTIF_ALARM			0x2000
+
+/* Message headers */
+
+/** common request header */
+struct virtio_rtc_req_head {
+	uint16_t msg_type;
+	uint8_t reserved[6];
+};
+
+/** common response header */
+struct virtio_rtc_resp_head {
+#define VIRTIO_RTC_S_OK			0
+#define VIRTIO_RTC_S_EOPNOTSUPP		2
+#define VIRTIO_RTC_S_ENODEV		3
+#define VIRTIO_RTC_S_EINVAL		4
+#define VIRTIO_RTC_S_EIO		5
+	uint8_t status;
+	uint8_t reserved[7];
+};
+
+/** common notification header */
+struct virtio_rtc_notif_head {
+	uint16_t msg_type;
+	uint8_t reserved[6];
+};
+
+/* read requests */
+
+/* VIRTIO_RTC_REQ_READ message */
+
+struct virtio_rtc_req_read {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+struct virtio_rtc_resp_read {
+	struct virtio_rtc_resp_head head;
+	uint64_t clock_reading;
+};
+
+/* VIRTIO_RTC_REQ_READ_CROSS message */
+
+struct virtio_rtc_req_read_cross {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+/* Arm Generic Timer Counter-timer Virtual Count Register (CNTVCT_EL0) */
+#define VIRTIO_RTC_COUNTER_ARM_VCT	0
+/* x86 Time-Stamp Counter */
+#define VIRTIO_RTC_COUNTER_X86_TSC	1
+/* Invalid */
+#define VIRTIO_RTC_COUNTER_INVALID	0xFF
+	uint8_t hw_counter;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_read_cross {
+	struct virtio_rtc_resp_head head;
+	uint64_t clock_reading;
+	uint64_t counter_cycles;
+};
+
+/* control requests */
+
+/* VIRTIO_RTC_REQ_CFG message */
+
+struct virtio_rtc_req_cfg {
+	struct virtio_rtc_req_head head;
+	/* no request params */
+};
+
+struct virtio_rtc_resp_cfg {
+	struct virtio_rtc_resp_head head;
+	/** # of clocks -> clock ids < num_clocks are valid */
+	uint16_t num_clocks;
+	uint8_t reserved[6];
+};
+
+/* VIRTIO_RTC_REQ_CLOCK_CAP message */
+
+struct virtio_rtc_req_clock_cap {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+struct virtio_rtc_resp_clock_cap {
+	struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_CLOCK_UTC			0
+#define VIRTIO_RTC_CLOCK_TAI			1
+#define VIRTIO_RTC_CLOCK_MONOTONIC		2
+#define VIRTIO_RTC_CLOCK_UTC_SMEARED		3
+#define VIRTIO_RTC_CLOCK_UTC_MAYBE_SMEARED	4
+	uint8_t type;
+#define VIRTIO_RTC_SMEAR_UNSPECIFIED	0
+#define VIRTIO_RTC_SMEAR_NOON_LINEAR	1
+#define VIRTIO_RTC_SMEAR_UTC_SLS	2
+	uint8_t leap_second_smearing;
+#define VIRTIO_RTC_FLAG_ALARM_CAP		(1 << 0)
+	uint8_t flags;
+	uint8_t reserved[5];
+};
+
+/* VIRTIO_RTC_REQ_CROSS_CAP message */
+
+struct virtio_rtc_req_cross_cap {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t hw_counter;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_cross_cap {
+	struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_FLAG_CROSS_CAP	(1 << 0)
+	uint8_t flags;
+	uint8_t reserved[7];
+};
+
+/* VIRTIO_RTC_REQ_READ_ALARM message */
+
+struct virtio_rtc_req_read_alarm {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+struct virtio_rtc_resp_read_alarm {
+	struct virtio_rtc_resp_head head;
+	uint64_t alarm_time;
+#define VIRTIO_RTC_FLAG_ALARM_ENABLED	(1 << 0)
+	uint8_t flags;
+	uint8_t reserved[7];
+};
+
+/* VIRTIO_RTC_REQ_SET_ALARM message */
+
+struct virtio_rtc_req_set_alarm {
+	struct virtio_rtc_req_head head;
+	uint64_t alarm_time;
+	uint16_t clock_id;
+	/* flag VIRTIO_RTC_FLAG_ALARM_ENABLED */
+	uint8_t flags;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_set_alarm {
+	struct virtio_rtc_resp_head head;
+	/* no response params */
+};
+
+/* VIRTIO_RTC_REQ_SET_ALARM_ENABLED message */
+
+struct virtio_rtc_req_set_alarm_enabled {
+	struct virtio_rtc_req_head head;
+	uint16_t clock_id;
+	/* flag VIRTIO_RTC_ALARM_ENABLED */
+	uint8_t flags;
+	uint8_t reserved[5];
+};
+
+struct virtio_rtc_resp_set_alarm_enabled {
+	struct virtio_rtc_resp_head head;
+	/* no response params */
+};
+
+/** Union of request types for requestq */
+union virtio_rtc_req_requestq {
+	struct virtio_rtc_req_read read;
+	struct virtio_rtc_req_read_cross read_cross;
+	struct virtio_rtc_req_cfg cfg;
+	struct virtio_rtc_req_clock_cap clock_cap;
+	struct virtio_rtc_req_cross_cap cross_cap;
+	struct virtio_rtc_req_read_alarm read_alarm;
+	struct virtio_rtc_req_set_alarm set_alarm;
+	struct virtio_rtc_req_set_alarm_enabled set_alarm_enabled;
+};
+
+/** Union of response types for requestq */
+union virtio_rtc_resp_requestq {
+	struct virtio_rtc_resp_read read;
+	struct virtio_rtc_resp_read_cross read_cross;
+	struct virtio_rtc_resp_cfg cfg;
+	struct virtio_rtc_resp_clock_cap clock_cap;
+	struct virtio_rtc_resp_cross_cap cross_cap;
+	struct virtio_rtc_resp_read_alarm read_alarm;
+	struct virtio_rtc_resp_set_alarm set_alarm;
+	struct virtio_rtc_resp_set_alarm_enabled set_alarm_enabled;
+};
+
+/* alarmq notifications */
+
+/* VIRTIO_RTC_NOTIF_ALARM notification */
+
+struct virtio_rtc_notif_alarm {
+	struct virtio_rtc_notif_head head;
+	uint16_t clock_id;
+	uint8_t reserved[6];
+};
+
+/** Union of notification types for alarmq */
+union virtio_rtc_notif_alarmq {
+	struct virtio_rtc_notif_alarm alarm;
+};
+
+#endif /* _LINUX_VIRTIO_RTC_H */
diff --git a/include/standard-headers/linux/vmclock-abi.h b/include/standard-headers/linux/vmclock-abi.h
index 15b0316cb4..fe824badc0 100644
--- a/include/standard-headers/linux/vmclock-abi.h
+++ b/include/standard-headers/linux/vmclock-abi.h
@@ -115,6 +115,17 @@ struct vmclock_abi {
 	 * bit again after the update, using the about-to-be-valid fields.
 	 */
 #define VMCLOCK_FLAG_TIME_MONOTONIC		(1 << 7)
+	/*
+	 * If the VM_GEN_COUNTER_PRESENT flag is set, the hypervisor will
+	 * bump the vm_generation_counter field every time the guest is
+	 * loaded from some save state (restored from a snapshot).
+	 */
+#define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
+	/*
+	 * If the NOTIFICATION_PRESENT flag is set, the hypervisor will send
+	 * a notification every time it updates seq_count to a new even number.
+	 */
+#define VMCLOCK_FLAG_NOTIFICATION_PRESENT       (1 << 9)
 
 	uint8_t pad[2];
 	uint8_t clock_status;
@@ -177,6 +188,15 @@ struct vmclock_abi {
 	uint64_t time_frac_sec;		/* Units of 1/2^64 of a second */
 	uint64_t time_esterror_nanosec;
 	uint64_t time_maxerror_nanosec;
+
+	/*
+	 * This field changes to another non-repeating value when the guest
+	 * has been loaded from a snapshot. In addition to handling a
+	 * disruption in time (which will also be signalled through the
+	 * disruption_marker field), a guest may wish to discard UUIDs,
+	 * reset network connections, reseed entropy, etc.
+	 */
+	uint64_t vm_generation_counter;
 };
 
 #endif /*  __VMCLOCK_ABI_H__ */
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 46ffbddab5..6aefe79738 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -416,6 +416,7 @@ enum {
 #define   KVM_DEV_ARM_ITS_RESTORE_TABLES        2
 #define   KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES	3
 #define   KVM_DEV_ARM_ITS_CTRL_RESET		4
+#define   KVM_DEV_ARM_VGIC_USERSPACE_PPIS	5
 
 /* Device Control API on vcpu fd */
 #define KVM_ARM_VCPU_PMU_V3_CTRL	0
diff --git a/linux-headers/asm-arm64/unistd_64.h b/linux-headers/asm-arm64/unistd_64.h
index 1ef9c40813..70b3754a42 100644
--- a/linux-headers/asm-arm64/unistd_64.h
+++ b/linux-headers/asm-arm64/unistd_64.h
@@ -327,6 +327,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-generic/unistd.h b/linux-headers/asm-generic/unistd.h
index 942370b3f5..a627acc8fb 100644
--- a/linux-headers/asm-generic/unistd.h
+++ b/linux-headers/asm-generic/unistd.h
@@ -860,8 +860,11 @@ __SYSCALL(__NR_file_setattr, sys_file_setattr)
 #define __NR_listns 470
 __SYSCALL(__NR_listns, sys_listns)
 
+#define __NR_rseq_slice_yield 471
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
 #undef __NR_syscalls
-#define __NR_syscalls 471
+#define __NR_syscalls 472
 
 /*
  * 32 bit systems traditionally used different
diff --git a/linux-headers/asm-loongarch/kvm.h b/linux-headers/asm-loongarch/kvm.h
index de6c3f18e4..cd0b5c11ca 100644
--- a/linux-headers/asm-loongarch/kvm.h
+++ b/linux-headers/asm-loongarch/kvm.h
@@ -105,6 +105,7 @@ struct kvm_fpu {
 #define  KVM_LOONGARCH_VM_FEAT_PV_STEALTIME	7
 #define  KVM_LOONGARCH_VM_FEAT_PTW		8
 #define  KVM_LOONGARCH_VM_FEAT_MSGINT		9
+#define  KVM_LOONGARCH_VM_FEAT_PV_PREEMPT	10
 
 /* Device Control API on vcpu fd */
 #define KVM_LOONGARCH_VCPU_CPUCFG	0
@@ -154,4 +155,8 @@ struct kvm_iocsr_entry {
 #define KVM_DEV_LOONGARCH_PCH_PIC_GRP_CTRL	        0x40000006
 #define KVM_DEV_LOONGARCH_PCH_PIC_CTRL_INIT	        0
 
+#define KVM_DEV_LOONGARCH_DMSINTC_GRP_CTRL		0x40000007
+#define KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_BASE		0x0
+#define KVM_DEV_LOONGARCH_DMSINTC_MSG_ADDR_SIZE		0x1
+
 #endif /* __UAPI_ASM_LOONGARCH_KVM_H */
diff --git a/linux-headers/asm-loongarch/kvm_para.h b/linux-headers/asm-loongarch/kvm_para.h
index fd7f40713d..3fd87a096b 100644
--- a/linux-headers/asm-loongarch/kvm_para.h
+++ b/linux-headers/asm-loongarch/kvm_para.h
@@ -15,6 +15,7 @@
 #define CPUCFG_KVM_FEATURE		(CPUCFG_KVM_BASE + 4)
 #define  KVM_FEATURE_IPI		1
 #define  KVM_FEATURE_STEAL_TIME		2
+#define  KVM_FEATURE_PREEMPT		3
 /* BIT 24 - 31 are features configurable by user space vmm */
 #define  KVM_FEATURE_VIRT_EXTIOI	24
 #define  KVM_FEATURE_USER_HCALL		25
diff --git a/linux-headers/asm-loongarch/unistd_64.h b/linux-headers/asm-loongarch/unistd_64.h
index aa5daac4ef..3a29d86e1d 100644
--- a/linux-headers/asm-loongarch/unistd_64.h
+++ b/linux-headers/asm-loongarch/unistd_64.h
@@ -300,6 +300,7 @@
 #define __NR_landlock_create_ruleset 444
 #define __NR_landlock_add_rule 445
 #define __NR_landlock_restrict_self 446
+#define __NR_memfd_secret 447
 #define __NR_process_mrelease 448
 #define __NR_futex_waitv 449
 #define __NR_set_mempolicy_home_node 450
@@ -323,6 +324,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-mips/unistd_n32.h b/linux-headers/asm-mips/unistd_n32.h
index a33d106dca..5fa1ee0cb4 100644
--- a/linux-headers/asm-mips/unistd_n32.h
+++ b/linux-headers/asm-mips/unistd_n32.h
@@ -399,5 +399,6 @@
 #define __NR_file_getattr (__NR_Linux + 468)
 #define __NR_file_setattr (__NR_Linux + 469)
 #define __NR_listns (__NR_Linux + 470)
+#define __NR_rseq_slice_yield (__NR_Linux + 471)
 
 #endif /* _ASM_UNISTD_N32_H */
diff --git a/linux-headers/asm-mips/unistd_n64.h b/linux-headers/asm-mips/unistd_n64.h
index 1bc251e450..e1f873d83a 100644
--- a/linux-headers/asm-mips/unistd_n64.h
+++ b/linux-headers/asm-mips/unistd_n64.h
@@ -375,5 +375,6 @@
 #define __NR_file_getattr (__NR_Linux + 468)
 #define __NR_file_setattr (__NR_Linux + 469)
 #define __NR_listns (__NR_Linux + 470)
+#define __NR_rseq_slice_yield (__NR_Linux + 471)
 
 #endif /* _ASM_UNISTD_N64_H */
diff --git a/linux-headers/asm-mips/unistd_o32.h b/linux-headers/asm-mips/unistd_o32.h
index c57175d496..8207e9ca4f 100644
--- a/linux-headers/asm-mips/unistd_o32.h
+++ b/linux-headers/asm-mips/unistd_o32.h
@@ -445,5 +445,6 @@
 #define __NR_file_getattr (__NR_Linux + 468)
 #define __NR_file_setattr (__NR_Linux + 469)
 #define __NR_listns (__NR_Linux + 470)
+#define __NR_rseq_slice_yield (__NR_Linux + 471)
 
 #endif /* _ASM_UNISTD_O32_H */
diff --git a/linux-headers/asm-powerpc/unistd_32.h b/linux-headers/asm-powerpc/unistd_32.h
index a3f4aa2fe2..1f63360120 100644
--- a/linux-headers/asm-powerpc/unistd_32.h
+++ b/linux-headers/asm-powerpc/unistd_32.h
@@ -452,6 +452,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-powerpc/unistd_64.h b/linux-headers/asm-powerpc/unistd_64.h
index d4444557f1..87439c53c1 100644
--- a/linux-headers/asm-powerpc/unistd_64.h
+++ b/linux-headers/asm-powerpc/unistd_64.h
@@ -424,6 +424,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h
index 54f3ad7ed2..504e733053 100644
--- a/linux-headers/asm-riscv/kvm.h
+++ b/linux-headers/asm-riscv/kvm.h
@@ -110,6 +110,10 @@ struct kvm_riscv_timer {
 	__u64 state;
 };
 
+/* Possible states for kvm_riscv_timer */
+#define KVM_RISCV_TIMER_STATE_OFF	0
+#define KVM_RISCV_TIMER_STATE_ON	1
+
 /*
  * ISA extension IDs specific to KVM. This is not the same as the host ISA
  * extension IDs as that is internal to the host and should not be exposed
@@ -192,6 +196,9 @@ enum KVM_RISCV_ISA_EXT_ID {
 	KVM_RISCV_ISA_EXT_ZFBFMIN,
 	KVM_RISCV_ISA_EXT_ZVFBFMIN,
 	KVM_RISCV_ISA_EXT_ZVFBFWMA,
+	KVM_RISCV_ISA_EXT_ZCLSD,
+	KVM_RISCV_ISA_EXT_ZILSD,
+	KVM_RISCV_ISA_EXT_ZALASR,
 	KVM_RISCV_ISA_EXT_MAX,
 };
 
@@ -235,10 +242,6 @@ struct kvm_riscv_sbi_fwft {
 	struct kvm_riscv_sbi_fwft_feature pointer_masking;
 };
 
-/* Possible states for kvm_riscv_timer */
-#define KVM_RISCV_TIMER_STATE_OFF	0
-#define KVM_RISCV_TIMER_STATE_ON	1
-
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_RISCV_TYPE_MASK		0x00000000FF000000
 #define KVM_REG_RISCV_TYPE_SHIFT	24
diff --git a/linux-headers/asm-riscv/ptrace.h b/linux-headers/asm-riscv/ptrace.h
index a3f8211ede..cf87642994 100644
--- a/linux-headers/asm-riscv/ptrace.h
+++ b/linux-headers/asm-riscv/ptrace.h
@@ -9,6 +9,7 @@
 #ifndef __ASSEMBLER__
 
 #include <linux/types.h>
+#include <linux/const.h>
 
 #define PTRACE_GETFDPIC		33
 
@@ -127,6 +128,42 @@ struct __riscv_v_regset_state {
  */
 #define RISCV_MAX_VLENB (8192)
 
+struct __sc_riscv_cfi_state {
+	unsigned long ss_ptr;   /* shadow stack pointer */
+};
+
+#define PTRACE_CFI_BRANCH_LANDING_PAD_EN_BIT		0
+#define PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_BIT		1
+#define PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_BIT	2
+#define PTRACE_CFI_SHADOW_STACK_EN_BIT			3
+#define PTRACE_CFI_SHADOW_STACK_LOCK_BIT		4
+#define PTRACE_CFI_SHADOW_STACK_PTR_BIT			5
+
+#define PTRACE_CFI_BRANCH_LANDING_PAD_EN_STATE		_BITUL(PTRACE_CFI_BRANCH_LANDING_PAD_EN_BIT)
+#define PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_STATE	\
+	_BITUL(PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_BIT)
+#define PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_STATE	\
+	_BITUL(PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_BIT)
+#define PTRACE_CFI_SHADOW_STACK_EN_STATE		_BITUL(PTRACE_CFI_SHADOW_STACK_EN_BIT)
+#define PTRACE_CFI_SHADOW_STACK_LOCK_STATE		_BITUL(PTRACE_CFI_SHADOW_STACK_LOCK_BIT)
+#define PTRACE_CFI_SHADOW_STACK_PTR_STATE		_BITUL(PTRACE_CFI_SHADOW_STACK_PTR_BIT)
+
+#define PTRACE_CFI_STATE_INVALID_MASK	~(PTRACE_CFI_BRANCH_LANDING_PAD_EN_STATE | \
+					  PTRACE_CFI_BRANCH_LANDING_PAD_LOCK_STATE | \
+					  PTRACE_CFI_BRANCH_EXPECTED_LANDING_PAD_STATE | \
+					  PTRACE_CFI_SHADOW_STACK_EN_STATE | \
+					  PTRACE_CFI_SHADOW_STACK_LOCK_STATE | \
+					  PTRACE_CFI_SHADOW_STACK_PTR_STATE)
+
+struct __cfi_status {
+	__u64 cfi_state;
+};
+
+struct user_cfi_state {
+	struct __cfi_status	cfi_status;
+	__u64 shstk_ptr;
+};
+
 #endif /* __ASSEMBLER__ */
 
 #endif /* _ASM_RISCV_PTRACE_H */
diff --git a/linux-headers/asm-riscv/unistd_32.h b/linux-headers/asm-riscv/unistd_32.h
index 9f33956246..828f3c2b9d 100644
--- a/linux-headers/asm-riscv/unistd_32.h
+++ b/linux-headers/asm-riscv/unistd_32.h
@@ -318,6 +318,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-riscv/unistd_64.h b/linux-headers/asm-riscv/unistd_64.h
index c2e7258916..8fa59835a3 100644
--- a/linux-headers/asm-riscv/unistd_64.h
+++ b/linux-headers/asm-riscv/unistd_64.h
@@ -328,6 +328,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-s390/unistd_32.h b/linux-headers/asm-s390/unistd_32.h
deleted file mode 100644
index 37b8f6f358..0000000000
--- a/linux-headers/asm-s390/unistd_32.h
+++ /dev/null
@@ -1,446 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _ASM_S390_UNISTD_32_H
-#define _ASM_S390_UNISTD_32_H
-
-#define __NR_exit 1
-#define __NR_fork 2
-#define __NR_read 3
-#define __NR_write 4
-#define __NR_open 5
-#define __NR_close 6
-#define __NR_restart_syscall 7
-#define __NR_creat 8
-#define __NR_link 9
-#define __NR_unlink 10
-#define __NR_execve 11
-#define __NR_chdir 12
-#define __NR_time 13
-#define __NR_mknod 14
-#define __NR_chmod 15
-#define __NR_lchown 16
-#define __NR_lseek 19
-#define __NR_getpid 20
-#define __NR_mount 21
-#define __NR_umount 22
-#define __NR_setuid 23
-#define __NR_getuid 24
-#define __NR_stime 25
-#define __NR_ptrace 26
-#define __NR_alarm 27
-#define __NR_pause 29
-#define __NR_utime 30
-#define __NR_access 33
-#define __NR_nice 34
-#define __NR_sync 36
-#define __NR_kill 37
-#define __NR_rename 38
-#define __NR_mkdir 39
-#define __NR_rmdir 40
-#define __NR_dup 41
-#define __NR_pipe 42
-#define __NR_times 43
-#define __NR_brk 45
-#define __NR_setgid 46
-#define __NR_getgid 47
-#define __NR_signal 48
-#define __NR_geteuid 49
-#define __NR_getegid 50
-#define __NR_acct 51
-#define __NR_umount2 52
-#define __NR_ioctl 54
-#define __NR_fcntl 55
-#define __NR_setpgid 57
-#define __NR_umask 60
-#define __NR_chroot 61
-#define __NR_ustat 62
-#define __NR_dup2 63
-#define __NR_getppid 64
-#define __NR_getpgrp 65
-#define __NR_setsid 66
-#define __NR_sigaction 67
-#define __NR_setreuid 70
-#define __NR_setregid 71
-#define __NR_sigsuspend 72
-#define __NR_sigpending 73
-#define __NR_sethostname 74
-#define __NR_setrlimit 75
-#define __NR_getrlimit 76
-#define __NR_getrusage 77
-#define __NR_gettimeofday 78
-#define __NR_settimeofday 79
-#define __NR_getgroups 80
-#define __NR_setgroups 81
-#define __NR_symlink 83
-#define __NR_readlink 85
-#define __NR_uselib 86
-#define __NR_swapon 87
-#define __NR_reboot 88
-#define __NR_readdir 89
-#define __NR_mmap 90
-#define __NR_munmap 91
-#define __NR_truncate 92
-#define __NR_ftruncate 93
-#define __NR_fchmod 94
-#define __NR_fchown 95
-#define __NR_getpriority 96
-#define __NR_setpriority 97
-#define __NR_statfs 99
-#define __NR_fstatfs 100
-#define __NR_ioperm 101
-#define __NR_socketcall 102
-#define __NR_syslog 103
-#define __NR_setitimer 104
-#define __NR_getitimer 105
-#define __NR_stat 106
-#define __NR_lstat 107
-#define __NR_fstat 108
-#define __NR_lookup_dcookie 110
-#define __NR_vhangup 111
-#define __NR_idle 112
-#define __NR_wait4 114
-#define __NR_swapoff 115
-#define __NR_sysinfo 116
-#define __NR_ipc 117
-#define __NR_fsync 118
-#define __NR_sigreturn 119
-#define __NR_clone 120
-#define __NR_setdomainname 121
-#define __NR_uname 122
-#define __NR_adjtimex 124
-#define __NR_mprotect 125
-#define __NR_sigprocmask 126
-#define __NR_create_module 127
-#define __NR_init_module 128
-#define __NR_delete_module 129
-#define __NR_get_kernel_syms 130
-#define __NR_quotactl 131
-#define __NR_getpgid 132
-#define __NR_fchdir 133
-#define __NR_bdflush 134
-#define __NR_sysfs 135
-#define __NR_personality 136
-#define __NR_afs_syscall 137
-#define __NR_setfsuid 138
-#define __NR_setfsgid 139
-#define __NR__llseek 140
-#define __NR_getdents 141
-#define __NR__newselect 142
-#define __NR_flock 143
-#define __NR_msync 144
-#define __NR_readv 145
-#define __NR_writev 146
-#define __NR_getsid 147
-#define __NR_fdatasync 148
-#define __NR__sysctl 149
-#define __NR_mlock 150
-#define __NR_munlock 151
-#define __NR_mlockall 152
-#define __NR_munlockall 153
-#define __NR_sched_setparam 154
-#define __NR_sched_getparam 155
-#define __NR_sched_setscheduler 156
-#define __NR_sched_getscheduler 157
-#define __NR_sched_yield 158
-#define __NR_sched_get_priority_max 159
-#define __NR_sched_get_priority_min 160
-#define __NR_sched_rr_get_interval 161
-#define __NR_nanosleep 162
-#define __NR_mremap 163
-#define __NR_setresuid 164
-#define __NR_getresuid 165
-#define __NR_query_module 167
-#define __NR_poll 168
-#define __NR_nfsservctl 169
-#define __NR_setresgid 170
-#define __NR_getresgid 171
-#define __NR_prctl 172
-#define __NR_rt_sigreturn 173
-#define __NR_rt_sigaction 174
-#define __NR_rt_sigprocmask 175
-#define __NR_rt_sigpending 176
-#define __NR_rt_sigtimedwait 177
-#define __NR_rt_sigqueueinfo 178
-#define __NR_rt_sigsuspend 179
-#define __NR_pread64 180
-#define __NR_pwrite64 181
-#define __NR_chown 182
-#define __NR_getcwd 183
-#define __NR_capget 184
-#define __NR_capset 185
-#define __NR_sigaltstack 186
-#define __NR_sendfile 187
-#define __NR_getpmsg 188
-#define __NR_putpmsg 189
-#define __NR_vfork 190
-#define __NR_ugetrlimit 191
-#define __NR_mmap2 192
-#define __NR_truncate64 193
-#define __NR_ftruncate64 194
-#define __NR_stat64 195
-#define __NR_lstat64 196
-#define __NR_fstat64 197
-#define __NR_lchown32 198
-#define __NR_getuid32 199
-#define __NR_getgid32 200
-#define __NR_geteuid32 201
-#define __NR_getegid32 202
-#define __NR_setreuid32 203
-#define __NR_setregid32 204
-#define __NR_getgroups32 205
-#define __NR_setgroups32 206
-#define __NR_fchown32 207
-#define __NR_setresuid32 208
-#define __NR_getresuid32 209
-#define __NR_setresgid32 210
-#define __NR_getresgid32 211
-#define __NR_chown32 212
-#define __NR_setuid32 213
-#define __NR_setgid32 214
-#define __NR_setfsuid32 215
-#define __NR_setfsgid32 216
-#define __NR_pivot_root 217
-#define __NR_mincore 218
-#define __NR_madvise 219
-#define __NR_getdents64 220
-#define __NR_fcntl64 221
-#define __NR_readahead 222
-#define __NR_sendfile64 223
-#define __NR_setxattr 224
-#define __NR_lsetxattr 225
-#define __NR_fsetxattr 226
-#define __NR_getxattr 227
-#define __NR_lgetxattr 228
-#define __NR_fgetxattr 229
-#define __NR_listxattr 230
-#define __NR_llistxattr 231
-#define __NR_flistxattr 232
-#define __NR_removexattr 233
-#define __NR_lremovexattr 234
-#define __NR_fremovexattr 235
-#define __NR_gettid 236
-#define __NR_tkill 237
-#define __NR_futex 238
-#define __NR_sched_setaffinity 239
-#define __NR_sched_getaffinity 240
-#define __NR_tgkill 241
-#define __NR_io_setup 243
-#define __NR_io_destroy 244
-#define __NR_io_getevents 245
-#define __NR_io_submit 246
-#define __NR_io_cancel 247
-#define __NR_exit_group 248
-#define __NR_epoll_create 249
-#define __NR_epoll_ctl 250
-#define __NR_epoll_wait 251
-#define __NR_set_tid_address 252
-#define __NR_fadvise64 253
-#define __NR_timer_create 254
-#define __NR_timer_settime 255
-#define __NR_timer_gettime 256
-#define __NR_timer_getoverrun 257
-#define __NR_timer_delete 258
-#define __NR_clock_settime 259
-#define __NR_clock_gettime 260
-#define __NR_clock_getres 261
-#define __NR_clock_nanosleep 262
-#define __NR_fadvise64_64 264
-#define __NR_statfs64 265
-#define __NR_fstatfs64 266
-#define __NR_remap_file_pages 267
-#define __NR_mbind 268
-#define __NR_get_mempolicy 269
-#define __NR_set_mempolicy 270
-#define __NR_mq_open 271
-#define __NR_mq_unlink 272
-#define __NR_mq_timedsend 273
-#define __NR_mq_timedreceive 274
-#define __NR_mq_notify 275
-#define __NR_mq_getsetattr 276
-#define __NR_kexec_load 277
-#define __NR_add_key 278
-#define __NR_request_key 279
-#define __NR_keyctl 280
-#define __NR_waitid 281
-#define __NR_ioprio_set 282
-#define __NR_ioprio_get 283
-#define __NR_inotify_init 284
-#define __NR_inotify_add_watch 285
-#define __NR_inotify_rm_watch 286
-#define __NR_migrate_pages 287
-#define __NR_openat 288
-#define __NR_mkdirat 289
-#define __NR_mknodat 290
-#define __NR_fchownat 291
-#define __NR_futimesat 292
-#define __NR_fstatat64 293
-#define __NR_unlinkat 294
-#define __NR_renameat 295
-#define __NR_linkat 296
-#define __NR_symlinkat 297
-#define __NR_readlinkat 298
-#define __NR_fchmodat 299
-#define __NR_faccessat 300
-#define __NR_pselect6 301
-#define __NR_ppoll 302
-#define __NR_unshare 303
-#define __NR_set_robust_list 304
-#define __NR_get_robust_list 305
-#define __NR_splice 306
-#define __NR_sync_file_range 307
-#define __NR_tee 308
-#define __NR_vmsplice 309
-#define __NR_move_pages 310
-#define __NR_getcpu 311
-#define __NR_epoll_pwait 312
-#define __NR_utimes 313
-#define __NR_fallocate 314
-#define __NR_utimensat 315
-#define __NR_signalfd 316
-#define __NR_timerfd 317
-#define __NR_eventfd 318
-#define __NR_timerfd_create 319
-#define __NR_timerfd_settime 320
-#define __NR_timerfd_gettime 321
-#define __NR_signalfd4 322
-#define __NR_eventfd2 323
-#define __NR_inotify_init1 324
-#define __NR_pipe2 325
-#define __NR_dup3 326
-#define __NR_epoll_create1 327
-#define __NR_preadv 328
-#define __NR_pwritev 329
-#define __NR_rt_tgsigqueueinfo 330
-#define __NR_perf_event_open 331
-#define __NR_fanotify_init 332
-#define __NR_fanotify_mark 333
-#define __NR_prlimit64 334
-#define __NR_name_to_handle_at 335
-#define __NR_open_by_handle_at 336
-#define __NR_clock_adjtime 337
-#define __NR_syncfs 338
-#define __NR_setns 339
-#define __NR_process_vm_readv 340
-#define __NR_process_vm_writev 341
-#define __NR_s390_runtime_instr 342
-#define __NR_kcmp 343
-#define __NR_finit_module 344
-#define __NR_sched_setattr 345
-#define __NR_sched_getattr 346
-#define __NR_renameat2 347
-#define __NR_seccomp 348
-#define __NR_getrandom 349
-#define __NR_memfd_create 350
-#define __NR_bpf 351
-#define __NR_s390_pci_mmio_write 352
-#define __NR_s390_pci_mmio_read 353
-#define __NR_execveat 354
-#define __NR_userfaultfd 355
-#define __NR_membarrier 356
-#define __NR_recvmmsg 357
-#define __NR_sendmmsg 358
-#define __NR_socket 359
-#define __NR_socketpair 360
-#define __NR_bind 361
-#define __NR_connect 362
-#define __NR_listen 363
-#define __NR_accept4 364
-#define __NR_getsockopt 365
-#define __NR_setsockopt 366
-#define __NR_getsockname 367
-#define __NR_getpeername 368
-#define __NR_sendto 369
-#define __NR_sendmsg 370
-#define __NR_recvfrom 371
-#define __NR_recvmsg 372
-#define __NR_shutdown 373
-#define __NR_mlock2 374
-#define __NR_copy_file_range 375
-#define __NR_preadv2 376
-#define __NR_pwritev2 377
-#define __NR_s390_guarded_storage 378
-#define __NR_statx 379
-#define __NR_s390_sthyi 380
-#define __NR_kexec_file_load 381
-#define __NR_io_pgetevents 382
-#define __NR_rseq 383
-#define __NR_pkey_mprotect 384
-#define __NR_pkey_alloc 385
-#define __NR_pkey_free 386
-#define __NR_semget 393
-#define __NR_semctl 394
-#define __NR_shmget 395
-#define __NR_shmctl 396
-#define __NR_shmat 397
-#define __NR_shmdt 398
-#define __NR_msgget 399
-#define __NR_msgsnd 400
-#define __NR_msgrcv 401
-#define __NR_msgctl 402
-#define __NR_clock_gettime64 403
-#define __NR_clock_settime64 404
-#define __NR_clock_adjtime64 405
-#define __NR_clock_getres_time64 406
-#define __NR_clock_nanosleep_time64 407
-#define __NR_timer_gettime64 408
-#define __NR_timer_settime64 409
-#define __NR_timerfd_gettime64 410
-#define __NR_timerfd_settime64 411
-#define __NR_utimensat_time64 412
-#define __NR_pselect6_time64 413
-#define __NR_ppoll_time64 414
-#define __NR_io_pgetevents_time64 416
-#define __NR_recvmmsg_time64 417
-#define __NR_mq_timedsend_time64 418
-#define __NR_mq_timedreceive_time64 419
-#define __NR_semtimedop_time64 420
-#define __NR_rt_sigtimedwait_time64 421
-#define __NR_futex_time64 422
-#define __NR_sched_rr_get_interval_time64 423
-#define __NR_pidfd_send_signal 424
-#define __NR_io_uring_setup 425
-#define __NR_io_uring_enter 426
-#define __NR_io_uring_register 427
-#define __NR_open_tree 428
-#define __NR_move_mount 429
-#define __NR_fsopen 430
-#define __NR_fsconfig 431
-#define __NR_fsmount 432
-#define __NR_fspick 433
-#define __NR_pidfd_open 434
-#define __NR_clone3 435
-#define __NR_close_range 436
-#define __NR_openat2 437
-#define __NR_pidfd_getfd 438
-#define __NR_faccessat2 439
-#define __NR_process_madvise 440
-#define __NR_epoll_pwait2 441
-#define __NR_mount_setattr 442
-#define __NR_quotactl_fd 443
-#define __NR_landlock_create_ruleset 444
-#define __NR_landlock_add_rule 445
-#define __NR_landlock_restrict_self 446
-#define __NR_memfd_secret 447
-#define __NR_process_mrelease 448
-#define __NR_futex_waitv 449
-#define __NR_set_mempolicy_home_node 450
-#define __NR_cachestat 451
-#define __NR_fchmodat2 452
-#define __NR_map_shadow_stack 453
-#define __NR_futex_wake 454
-#define __NR_futex_wait 455
-#define __NR_futex_requeue 456
-#define __NR_statmount 457
-#define __NR_listmount 458
-#define __NR_lsm_get_self_attr 459
-#define __NR_lsm_set_self_attr 460
-#define __NR_lsm_list_modules 461
-#define __NR_mseal 462
-#define __NR_setxattrat 463
-#define __NR_getxattrat 464
-#define __NR_listxattrat 465
-#define __NR_removexattrat 466
-#define __NR_open_tree_attr 467
-#define __NR_file_getattr 468
-#define __NR_file_setattr 469
-
-#endif /* _ASM_S390_UNISTD_32_H */
diff --git a/linux-headers/asm-s390/unistd_64.h b/linux-headers/asm-s390/unistd_64.h
index 8d9e579ef5..01f674c1bc 100644
--- a/linux-headers/asm-s390/unistd_64.h
+++ b/linux-headers/asm-s390/unistd_64.h
@@ -390,6 +390,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index b804fd25a2..01d46e2929 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -197,13 +197,13 @@ struct kvm_msrs {
 	__u32 nmsrs; /* number of msrs in entries */
 	__u32 pad;
 
-	struct kvm_msr_entry entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_msr_entry, entries);
 };
 
 /* for KVM_GET_MSR_INDEX_LIST */
 struct kvm_msr_list {
 	__u32 nmsrs; /* number of msrs in entries */
-	__u32 indices[];
+	__DECLARE_FLEX_ARRAY(__u32, indices);
 };
 
 /* Maximum size of any access bitmap in bytes */
@@ -243,7 +243,7 @@ struct kvm_cpuid_entry {
 struct kvm_cpuid {
 	__u32 nent;
 	__u32 padding;
-	struct kvm_cpuid_entry entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_cpuid_entry, entries);
 };
 
 struct kvm_cpuid_entry2 {
@@ -265,7 +265,7 @@ struct kvm_cpuid_entry2 {
 struct kvm_cpuid2 {
 	__u32 nent;
 	__u32 padding;
-	struct kvm_cpuid_entry2 entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_cpuid_entry2, entries);
 };
 
 /* for KVM_GET_PIT and KVM_SET_PIT */
@@ -396,7 +396,7 @@ struct kvm_xsave {
 	 * the contents of CPUID leaf 0xD on the host.
 	 */
 	__u32 region[1024];
-	__u32 extra[];
+	__DECLARE_FLEX_ARRAY(__u32, extra);
 };
 
 #define KVM_MAX_XCRS	16
@@ -474,6 +474,7 @@ struct kvm_sync_regs {
 #define KVM_X86_QUIRK_SLOT_ZAP_ALL		(1 << 7)
 #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS	(1 << 8)
 #define KVM_X86_QUIRK_IGNORE_GUEST_PAT		(1 << 9)
+#define KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM (1 << 10)
 
 #define KVM_STATE_NESTED_FORMAT_VMX	0
 #define KVM_STATE_NESTED_FORMAT_SVM	1
@@ -501,6 +502,7 @@ struct kvm_sync_regs {
 #define KVM_X86_GRP_SEV			1
 #  define KVM_X86_SEV_VMSA_FEATURES	0
 #  define KVM_X86_SNP_POLICY_BITS	1
+#  define KVM_X86_SEV_SNP_REQ_CERTS	2
 
 struct kvm_vmx_nested_state_data {
 	__u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
@@ -562,7 +564,7 @@ struct kvm_pmu_event_filter {
 	__u32 fixed_counter_bitmap;
 	__u32 flags;
 	__u32 pad[4];
-	__u64 events[];
+	__DECLARE_FLEX_ARRAY(__u64, events);
 };
 
 #define KVM_PMU_EVENT_ALLOW 0
@@ -741,6 +743,7 @@ enum sev_cmd_id {
 	KVM_SEV_SNP_LAUNCH_START = 100,
 	KVM_SEV_SNP_LAUNCH_UPDATE,
 	KVM_SEV_SNP_LAUNCH_FINISH,
+	KVM_SEV_SNP_ENABLE_REQ_CERTS,
 
 	KVM_SEV_NR_MAX,
 };
@@ -912,8 +915,10 @@ struct kvm_sev_snp_launch_finish {
 	__u64 pad1[4];
 };
 
-#define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
-#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
+#define KVM_X2APIC_API_USE_32BIT_IDS			_BITULL(0)
+#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK		_BITULL(1)
+#define KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST	_BITULL(2)
+#define KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST	_BITULL(3)
 
 struct kvm_hyperv_eventfd {
 	__u32 conn_id;
diff --git a/linux-headers/asm-x86/unistd_32.h b/linux-headers/asm-x86/unistd_32.h
index 34255aac64..e945468829 100644
--- a/linux-headers/asm-x86/unistd_32.h
+++ b/linux-headers/asm-x86/unistd_32.h
@@ -461,6 +461,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-x86/unistd_64.h b/linux-headers/asm-x86/unistd_64.h
index 07f242a5fa..3c49b00ed1 100644
--- a/linux-headers/asm-x86/unistd_64.h
+++ b/linux-headers/asm-x86/unistd_64.h
@@ -385,6 +385,7 @@
 #define __NR_file_getattr 468
 #define __NR_file_setattr 469
 #define __NR_listns 470
+#define __NR_rseq_slice_yield 471
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/unistd_x32.h b/linux-headers/asm-x86/unistd_x32.h
index 08fc9da2fa..bd2af9ad08 100644
--- a/linux-headers/asm-x86/unistd_x32.h
+++ b/linux-headers/asm-x86/unistd_x32.h
@@ -338,6 +338,7 @@
 #define __NR_file_getattr (__X32_SYSCALL_BIT + 468)
 #define __NR_file_setattr (__X32_SYSCALL_BIT + 469)
 #define __NR_listns (__X32_SYSCALL_BIT + 470)
+#define __NR_rseq_slice_yield (__X32_SYSCALL_BIT + 471)
 #define __NR_rt_sigaction (__X32_SYSCALL_BIT + 512)
 #define __NR_rt_sigreturn (__X32_SYSCALL_BIT + 513)
 #define __NR_ioctl (__X32_SYSCALL_BIT + 514)
diff --git a/linux-headers/linux/const.h b/linux-headers/linux/const.h
index 95ede23342..c6a9d0c983 100644
--- a/linux-headers/linux/const.h
+++ b/linux-headers/linux/const.h
@@ -50,4 +50,22 @@
 
 #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 
+/*
+ * Divide positive or negative dividend by positive or negative divisor
+ * and round to closest integer. Result is undefined for negative
+ * divisors if the dividend variable type is unsigned and for negative
+ * dividends if the divisor variable type is unsigned.
+ */
+#define __KERNEL_DIV_ROUND_CLOSEST(x, divisor)		\
+({							\
+	__typeof__(x) __x = x;				\
+	__typeof__(divisor) __d = divisor;		\
+							\
+	(((__typeof__(x))-1) > 0 ||			\
+	 ((__typeof__(divisor))-1) > 0 ||		\
+	 (((__x) > 0) == ((__d) > 0))) ?		\
+		(((__x) + ((__d) / 2)) / (__d)) :	\
+		(((__x) - ((__d) / 2)) / (__d));	\
+})
+
 #endif /* _LINUX_CONST_H */
diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
index 384183a403..82587c7d62 100644
--- a/linux-headers/linux/iommufd.h
+++ b/linux-headers/linux/iommufd.h
@@ -465,16 +465,27 @@ struct iommu_hwpt_arm_smmuv3 {
 	__aligned_le64 ste[2];
 };
 
+/**
+ * struct iommu_hwpt_amd_guest - AMD IOMMU guest I/O page table data
+ *				 (IOMMU_HWPT_DATA_AMD_GUEST)
+ * @dte: Guest Device Table Entry (DTE)
+ */
+struct iommu_hwpt_amd_guest {
+	__aligned_u64 dte[4];
+};
+
 /**
  * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
  * @IOMMU_HWPT_DATA_NONE: no data
  * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
  * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
+ * @IOMMU_HWPT_DATA_AMD_GUEST: AMD IOMMU guest page table
  */
 enum iommu_hwpt_data_type {
 	IOMMU_HWPT_DATA_NONE = 0,
 	IOMMU_HWPT_DATA_VTD_S1 = 1,
 	IOMMU_HWPT_DATA_ARM_SMMUV3 = 2,
+	IOMMU_HWPT_DATA_AMD_GUEST = 3,
 };
 
 /**
@@ -623,6 +634,32 @@ struct iommu_hw_info_tegra241_cmdqv {
 	__u8 __reserved;
 };
 
+/**
+ * struct iommu_hw_info_amd - AMD IOMMU device info
+ *
+ * @efr : Value of AMD IOMMU Extended Feature Register (EFR)
+ * @efr2: Value of AMD IOMMU Extended Feature 2 Register (EFR2)
+ *
+ * Please See description of these registers in the following sections of
+ * the AMD I/O Virtualization Technology (IOMMU) Specification.
+ * (https://docs.amd.com/v/u/en-US/48882_3.10_PUB)
+ *
+ * - MMIO Offset 0030h IOMMU Extended Feature Register
+ * - MMIO Offset 01A0h IOMMU Extended Feature 2 Register
+ *
+ * Note: The EFR and EFR2 are raw values reported by hardware.
+ * VMM is responsible to determine the appropriate flags to be exposed to
+ * the VM since cetertain features are not currently supported by the kernel
+ * for HW-vIOMMU.
+ *
+ * Current VMM-allowed list of feature flags are:
+ * - EFR[GTSup, GASup, GioSup, PPRSup, EPHSup, GATS, GLX, PASmax]
+ */
+struct iommu_hw_info_amd {
+	__aligned_u64 efr;
+	__aligned_u64 efr2;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
@@ -632,6 +669,7 @@ struct iommu_hw_info_tegra241_cmdqv {
  * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
  *                                     SMMUv3) info type
+ * @IOMMU_HW_INFO_TYPE_AMD: AMD IOMMU info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
@@ -639,6 +677,7 @@ enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
 	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
+	IOMMU_HW_INFO_TYPE_AMD = 4,
 };
 
 /**
@@ -656,11 +695,15 @@ enum iommu_hw_info_type {
  * @IOMMU_HW_CAP_PCI_PASID_PRIV: Privileged Mode Supported, user ignores it
  *                               when the struct
  *                               iommu_hw_info::out_max_pasid_log2 is zero.
+ * @IOMMU_HW_CAP_PCI_ATS_NOT_SUPPORTED: ATS is not supported or cannot be used
+ *                                      on this device (absence implies ATS
+ *                                      may be enabled)
  */
 enum iommufd_hw_capabilities {
 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
 	IOMMU_HW_CAP_PCI_PASID_EXEC = 1 << 1,
 	IOMMU_HW_CAP_PCI_PASID_PRIV = 1 << 2,
+	IOMMU_HW_CAP_PCI_ATS_NOT_SUPPORTED = 1 << 3,
 };
 
 /**
@@ -1013,6 +1056,11 @@ struct iommu_fault_alloc {
 enum iommu_viommu_type {
 	IOMMU_VIOMMU_TYPE_DEFAULT = 0,
 	IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
+	/*
+	 * TEGRA241_CMDQV requirements (otherwise, VCMDQs will not work)
+	 * - Kernel will allocate a VINTF (HYP_OWN=0) to back this VIOMMU. So,
+	 *   VMM must wire the HYP_OWN bit to 0 in guest VINTF_CONFIG register
+	 */
 	IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
 };
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index a4ab42dcba..c1baca4302 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -11,9 +11,11 @@
 #include <linux/const.h>
 #include <linux/types.h>
 
+#include <linux/stddef.h>
 #include <linux/ioctl.h>
 #include <asm/kvm.h>
 
+
 #define KVM_API_VERSION 12
 
 /*
@@ -135,6 +137,12 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_exit_snp_req_certs {
+	__u64 gpa;
+	__u64 npages;
+	__u64 ret;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -180,6 +188,8 @@ struct kvm_xen_exit {
 #define KVM_EXIT_MEMORY_FAULT     39
 #define KVM_EXIT_TDX              40
 #define KVM_EXIT_ARM_SEA          41
+#define KVM_EXIT_ARM_LDST64B      42
+#define KVM_EXIT_SNP_REQ_CERTS    43
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -394,7 +404,7 @@ struct kvm_run {
 		} eoi;
 		/* KVM_EXIT_HYPERV */
 		struct kvm_hyperv_exit hyperv;
-		/* KVM_EXIT_ARM_NISV */
+		/* KVM_EXIT_ARM_NISV / KVM_EXIT_ARM_LDST64B */
 		struct {
 			__u64 esr_iss;
 			__u64 fault_ipa;
@@ -474,6 +484,8 @@ struct kvm_run {
 			__u64 gva;
 			__u64 gpa;
 		} arm_sea;
+		/* KVM_EXIT_SNP_REQ_CERTS */
+		struct kvm_exit_snp_req_certs snp_req_certs;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -520,7 +532,7 @@ struct kvm_coalesced_mmio {
 
 struct kvm_coalesced_mmio_ring {
 	__u32 first, last;
-	struct kvm_coalesced_mmio coalesced_mmio[];
+	__DECLARE_FLEX_ARRAY(struct kvm_coalesced_mmio, coalesced_mmio);
 };
 
 #define KVM_COALESCED_MMIO_MAX \
@@ -570,7 +582,7 @@ struct kvm_clear_dirty_log {
 /* for KVM_SET_SIGNAL_MASK */
 struct kvm_signal_mask {
 	__u32 len;
-	__u8  sigset[];
+	__DECLARE_FLEX_ARRAY(__u8, sigset);
 };
 
 /* for KVM_TPR_ACCESS_REPORTING */
@@ -681,6 +693,11 @@ struct kvm_enable_cap {
 #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK	0xffULL
 #define KVM_VM_TYPE_ARM_IPA_SIZE(x)		\
 	((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+
+#define KVM_VM_TYPE_ARM_PROTECTED	(1UL << 31)
+#define KVM_VM_TYPE_ARM_MASK		(KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \
+					 KVM_VM_TYPE_ARM_PROTECTED)
+
 /*
  * ioctls for /dev/kvm fds:
  */
@@ -966,6 +983,9 @@ struct kvm_enable_cap {
 #define KVM_CAP_GUEST_MEMFD_FLAGS 244
 #define KVM_CAP_ARM_SEA_TO_USER 245
 #define KVM_CAP_S390_USER_OPEREXEC 246
+#define KVM_CAP_S390_KEYOP 247
+#define KVM_CAP_S390_VSIE_ESAMODE 248
+#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1028,7 +1048,7 @@ struct kvm_irq_routing_entry {
 struct kvm_irq_routing {
 	__u32 nr;
 	__u32 flags;
-	struct kvm_irq_routing_entry entries[];
+	__DECLARE_FLEX_ARRAY(struct kvm_irq_routing_entry, entries);
 };
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
@@ -1119,7 +1139,7 @@ struct kvm_dirty_tlb {
 
 struct kvm_reg_list {
 	__u64 n; /* number of regs */
-	__u64 reg[];
+	__DECLARE_FLEX_ARRAY(__u64, reg);
 };
 
 struct kvm_one_reg {
@@ -1201,6 +1221,10 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_LOONGARCH_EIOINTC	KVM_DEV_TYPE_LOONGARCH_EIOINTC
 	KVM_DEV_TYPE_LOONGARCH_PCHPIC,
 #define KVM_DEV_TYPE_LOONGARCH_PCHPIC	KVM_DEV_TYPE_LOONGARCH_PCHPIC
+	KVM_DEV_TYPE_LOONGARCH_DMSINTC,
+#define KVM_DEV_TYPE_LOONGARCH_DMSINTC	KVM_DEV_TYPE_LOONGARCH_DMSINTC
+	KVM_DEV_TYPE_ARM_VGIC_V5,
+#define KVM_DEV_TYPE_ARM_VGIC_V5	KVM_DEV_TYPE_ARM_VGIC_V5
 
 	KVM_DEV_TYPE_MAX,
 
@@ -1211,6 +1235,16 @@ struct kvm_vfio_spapr_tce {
 	__s32	tablefd;
 };
 
+#define KVM_S390_KEYOP_ISKE 0x01
+#define KVM_S390_KEYOP_RRBE 0x02
+#define KVM_S390_KEYOP_SSKE 0x03
+struct kvm_s390_keyop {
+	__u64 guest_addr;
+	__u8  key;
+	__u8  operation;
+	__u8  pad[6];
+};
+
 /*
  * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
  * a vcpu fd.
@@ -1230,6 +1264,7 @@ struct kvm_vfio_spapr_tce {
 #define KVM_S390_UCAS_MAP        _IOW(KVMIO, 0x50, struct kvm_s390_ucas_mapping)
 #define KVM_S390_UCAS_UNMAP      _IOW(KVMIO, 0x51, struct kvm_s390_ucas_mapping)
 #define KVM_S390_VCPU_FAULT	 _IOW(KVMIO, 0x52, unsigned long)
+#define KVM_S390_KEYOP           _IOWR(KVMIO, 0x53, struct kvm_s390_keyop)
 
 /* Device model IOC */
 #define KVM_CREATE_IRQCHIP        _IO(KVMIO,   0x60)
@@ -1571,7 +1606,7 @@ struct kvm_stats_desc {
 	__u16 size;
 	__u32 offset;
 	__u32 bucket_size;
-	char name[];
+	__DECLARE_FLEX_ARRAY(char, name);
 };
 
 #define KVM_GET_STATS_FD  _IO(KVMIO,  0xce)
@@ -1599,6 +1634,21 @@ struct kvm_memory_attributes {
 	__u64 flags;
 };
 
+/* Available with KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES */
+#define KVM_SET_MEMORY_ATTRIBUTES2              _IOWR(KVMIO,  0xd2, struct kvm_memory_attributes2)
+
+struct kvm_memory_attributes2 {
+	union {
+		__u64 address;
+		__u64 offset;
+	};
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+	__u64 error_offset;
+	__u64 reserved[11];
+};
+
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
diff --git a/linux-headers/linux/mshv.h b/linux-headers/linux/mshv.h
index acceeddc1c..6c7d3a9316 100644
--- a/linux-headers/linux/mshv.h
+++ b/linux-headers/linux/mshv.h
@@ -27,6 +27,8 @@ enum {
 	MSHV_PT_BIT_X2APIC,
 	MSHV_PT_BIT_GPA_SUPER_PAGES,
 	MSHV_PT_BIT_CPU_AND_XSAVE_FEATURES,
+	MSHV_PT_BIT_NESTED_VIRTUALIZATION,
+	MSHV_PT_BIT_SMT_ENABLED_GUEST,
 	MSHV_PT_BIT_COUNT,
 };
 
@@ -355,7 +357,7 @@ struct mshv_vtl_sint_post_msg {
 
 struct mshv_vtl_ram_disposition {
 	__u64 start_pfn;
-	__u64 last_pfn;
+	__u64 last_pfn; /* last_pfn is excluded from the range [start_pfn, last_pfn) */
 };
 
 struct mshv_vtl_set_poll_file {
diff --git a/linux-headers/linux/psp-sev.h b/linux-headers/linux/psp-sev.h
index 9479928a4a..7df5002259 100644
--- a/linux-headers/linux/psp-sev.h
+++ b/linux-headers/linux/psp-sev.h
@@ -277,7 +277,7 @@ struct sev_user_data_snp_wrapped_vlek_hashstick {
  * struct sev_issue_cmd - SEV ioctl parameters
  *
  * @cmd: SEV commands to execute
- * @opaque: pointer to the command structure
+ * @data: pointer to the command structure
  * @error: SEV FW return code on failure
  */
 struct sev_issue_cmd {
diff --git a/linux-headers/linux/stddef.h b/linux-headers/linux/stddef.h
index 48ee4438e0..4574982594 100644
--- a/linux-headers/linux/stddef.h
+++ b/linux-headers/linux/stddef.h
@@ -69,6 +69,10 @@
 #define __counted_by_be(m)
 #endif
 
+#ifndef __counted_by_ptr
+#define __counted_by_ptr(m)
+#endif
+
 #define __kernel_nonstring
 
 #endif /* _LINUX_STDDEF_H */
diff --git a/linux-headers/linux/vduse.h b/linux-headers/linux/vduse.h
index da6ac89af1..e19b3c0f51 100644
--- a/linux-headers/linux/vduse.h
+++ b/linux-headers/linux/vduse.h
@@ -10,6 +10,10 @@
 
 #define VDUSE_API_VERSION	0
 
+/* VQ groups and ASID support */
+
+#define VDUSE_API_VERSION_1	1
+
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
  * This is used for future extension.
@@ -27,6 +31,8 @@
  * @features: virtio features
  * @vq_num: the number of virtqueues
  * @vq_align: the allocation alignment of virtqueue's metadata
+ * @ngroups: number of vq groups that VDUSE device declares
+ * @nas: number of address spaces that VDUSE device declares
  * @reserved: for future use, needs to be initialized to zero
  * @config_size: the size of the configuration space
  * @config: the buffer of the configuration space
@@ -41,7 +47,9 @@ struct vduse_dev_config {
 	__u64 features;
 	__u32 vq_num;
 	__u32 vq_align;
-	__u32 reserved[13];
+	__u32 ngroups; /* if VDUSE_API_VERSION >= 1 */
+	__u32 nas; /* if VDUSE_API_VERSION >= 1 */
+	__u32 reserved[11];
 	__u32 config_size;
 	__u8 config[];
 };
@@ -118,14 +126,18 @@ struct vduse_config_data {
  * struct vduse_vq_config - basic configuration of a virtqueue
  * @index: virtqueue index
  * @max_size: the max size of virtqueue
- * @reserved: for future use, needs to be initialized to zero
+ * @reserved1: for future use, needs to be initialized to zero
+ * @group: virtqueue group
+ * @reserved2: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_VQ_SETUP ioctl to setup a virtqueue.
  */
 struct vduse_vq_config {
 	__u32 index;
 	__u16 max_size;
-	__u16 reserved[13];
+	__u16 reserved1;
+	__u32 group;
+	__u16 reserved2[10];
 };
 
 /*
@@ -156,6 +168,16 @@ struct vduse_vq_state_packed {
 	__u16 last_used_idx;
 };
 
+/**
+ * struct vduse_vq_group_asid - virtqueue group ASID
+ * @group: Index of the virtqueue group
+ * @asid: Address space ID of the group
+ */
+struct vduse_vq_group_asid {
+	__u32 group;
+	__u32 asid;
+};
+
 /**
  * struct vduse_vq_info - information of a virtqueue
  * @index: virtqueue index
@@ -215,6 +237,7 @@ struct vduse_vq_eventfd {
  * @uaddr: start address of userspace memory, it must be aligned to page size
  * @iova: start of the IOVA region
  * @size: size of the IOVA region
+ * @asid: Address space ID of the IOVA region
  * @reserved: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_REG_UMEM and VDUSE_IOTLB_DEREG_UMEM
@@ -224,7 +247,8 @@ struct vduse_iova_umem {
 	__u64 uaddr;
 	__u64 iova;
 	__u64 size;
-	__u64 reserved[3];
+	__u32 asid;
+	__u32 reserved[5];
 };
 
 /* Register userspace memory for IOVA regions */
@@ -238,6 +262,7 @@ struct vduse_iova_umem {
  * @start: start of the IOVA region
  * @last: last of the IOVA region
  * @capability: capability of the IOVA region
+ * @asid: Address space ID of the IOVA region, only if device API version >= 1
  * @reserved: for future use, needs to be initialized to zero
  *
  * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get information of
@@ -248,7 +273,8 @@ struct vduse_iova_info {
 	__u64 last;
 #define VDUSE_IOVA_CAP_UMEM (1 << 0)
 	__u64 capability;
-	__u64 reserved[3];
+	__u32 asid; /* Only if device API version >= 1 */
+	__u32 reserved[5];
 };
 
 /*
@@ -257,6 +283,32 @@ struct vduse_iova_info {
  */
 #define VDUSE_IOTLB_GET_INFO	_IOWR(VDUSE_BASE, 0x1a, struct vduse_iova_info)
 
+/**
+ * struct vduse_iotlb_entry_v2 - entry of IOTLB to describe one IOVA region
+ *
+ * @v1: the original vduse_iotlb_entry
+ * @asid: address space ID of the IOVA region
+ * @reserved: for future use, needs to be initialized to zero
+ *
+ * Structure used by VDUSE_IOTLB_GET_FD2 ioctl to find an overlapped IOVA region.
+ */
+struct vduse_iotlb_entry_v2 {
+	__u64 offset;
+	__u64 start;
+	__u64 last;
+	__u8 perm;
+	__u8 padding[7];
+	__u32 asid;
+	__u32 reserved[11];
+};
+
+/*
+ * Same as VDUSE_IOTLB_GET_FD but with vduse_iotlb_entry_v2 argument that
+ * support extra fields.
+ */
+#define VDUSE_IOTLB_GET_FD2	_IOWR(VDUSE_BASE, 0x1b, struct vduse_iotlb_entry_v2)
+
+
 /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME */
 
 /**
@@ -265,11 +317,14 @@ struct vduse_iova_info {
  * @VDUSE_SET_STATUS: set the device status
  * @VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping for
  *                      specified IOVA range via VDUSE_IOTLB_GET_FD ioctl
+ * @VDUSE_SET_VQ_GROUP_ASID: Notify userspace to update the address space of a
+ *                           virtqueue group.
  */
 enum vduse_req_type {
 	VDUSE_GET_VQ_STATE,
 	VDUSE_SET_STATUS,
 	VDUSE_UPDATE_IOTLB,
+	VDUSE_SET_VQ_GROUP_ASID,
 };
 
 /**
@@ -304,6 +359,19 @@ struct vduse_iova_range {
 	__u64 last;
 };
 
+/**
+ * struct vduse_iova_range_v2 - IOVA range [start, last] if API_VERSION >= 1
+ * @start: start of the IOVA range
+ * @last: last of the IOVA range
+ * @asid: address space ID of the IOVA range
+ */
+struct vduse_iova_range_v2 {
+	__u64 start;
+	__u64 last;
+	__u32 asid;
+	__u32 padding;
+};
+
 /**
  * struct vduse_dev_request - control request
  * @type: request type
@@ -312,6 +380,8 @@ struct vduse_iova_range {
  * @vq_state: virtqueue state, only index field is available
  * @s: device status
  * @iova: IOVA range for updating
+ * @iova_v2: IOVA range for updating if API_VERSION >= 1
+ * @vq_group_asid: ASID of a virtqueue group
  * @padding: padding
  *
  * Structure used by read(2) on /dev/vduse/$NAME.
@@ -324,6 +394,11 @@ struct vduse_dev_request {
 		struct vduse_vq_state vq_state;
 		struct vduse_dev_status s;
 		struct vduse_iova_range iova;
+		/* Following members but padding exist only if vduse api
+		 * version >= 1
+		 */
+		struct vduse_iova_range_v2 iova_v2;
+		struct vduse_vq_group_asid vq_group_asid;
 		__u32 padding[32];
 	};
 };
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 720edfee7a..f3282b8e86 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -141,7 +141,7 @@ struct vfio_info_cap_header {
  *
  * Retrieve information about the group.  Fills in provided
  * struct vfio_group_info.  Caller sets argsz.
- * Return: 0 on succes, -errno on failure.
+ * Return: 0 on success, -errno on failure.
  * Availability: Always
  */
 struct vfio_group_status {
@@ -964,6 +964,10 @@ struct vfio_device_bind_iommufd {
  * hwpt corresponding to the given pt_id.
  *
  * Return: 0 on success, -errno on failure.
+ *
+ * When a device is resetting, -EBUSY will be returned to reject any concurrent
+ * attachment to the resetting device itself or any sibling device in the IOMMU
+ * group having the resetting device.
  */
 struct vfio_device_attach_iommufd_pt {
 	__u32	argsz;
@@ -1262,6 +1266,19 @@ enum vfio_device_mig_state {
  * The initial_bytes field indicates the amount of initial precopy
  * data available from the device. This field should have a non-zero initial
  * value and decrease as migration data is read from the device.
+ * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
+ * that new initial data is present on the stream.
+ * The new initial data may result, for example, from device reconfiguration
+ * during migration that requires additional initialization data.
+ * In that case initial_bytes may report a non-zero value irrespective of
+ * any previously reported values, which progresses towards zero as precopy
+ * data is read from the data stream. dirty_bytes is also reset
+ * to zero and represents the state change of the device relative to the new
+ * initial_bytes.
+ * VFIO_PRECOPY_INFO_REINIT can be reported only after userspace opts in to
+ * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2. Without this opt-in, the flags field
+ * of struct vfio_precopy_info is reserved for bug-compatibility reasons.
+ *
  * It is recommended to leave PRE_COPY for STOP_COPY only after this field
  * reaches zero. Leaving PRE_COPY earlier might make things slower.
  *
@@ -1297,6 +1314,7 @@ enum vfio_device_mig_state {
 struct vfio_precopy_info {
 	__u32 argsz;
 	__u32 flags;
+#define VFIO_PRECOPY_INFO_REINIT (1 << 0) /* output - new initial data is present */
 	__aligned_u64 initial_bytes;
 	__aligned_u64 dirty_bytes;
 };
@@ -1506,6 +1524,16 @@ struct vfio_device_feature_dma_buf {
 	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
 };
 
+/*
+ * Enables the migration precopy_info_v2 behaviour.
+ *
+ * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.
+ *
+ * On SET, enables the v2 pre_copy_info behaviour, where the
+ * vfio_precopy_info.flags is a valid output field.
+ */
+#define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (2 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 03/12] linux-headers: Update headers for v7 of in-place conversion kernel support Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-06-02  8:23   ` Markus Armbruster
  2026-05-28  0:03 ` [PATCH RFC 05/12] system/memory: Re-use memory-backend-guest-memfd inode for private memory Michael Roth
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

For confidential guests, guest_memfd is currently used only for private
guest memory, and normal guest memory comes from the configured memory
backend just as it does for a non-confidential guest. It is now possible
to use the same physical memory to back a particular GPA regardless of
whether it is in a shared or private state. This avoids the need to
rely on discarding memory between shared/private conversions (to avoid
doubled memory usage), and is intended to be the primary mode of using
guest_memfd for confidential guests moving forward, and future features
like hugepage support will likely require it.

Add an option to enable this support. Since ConfidentialGuestSupport is
already used to track some guest_memfd-related functionality (e.g.
whether it is required for the configured machine), similarly introduce
this option as a property of ConfidentialGuestSupport.

Also add the KVM-specific checks to enable this support, but leave the
option disabled until other required changes are implemented for
CGS variants that intend to make use of KVM's in-place conversion
support.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c                         | 21 +++++++++++++++++
 backends/confidential-guest-support.c       | 25 +++++++++++++++++++++
 include/system/confidential-guest-support.h | 14 ++++++++++++
 qapi/qom.json                               | 16 +++++++++++++
 4 files changed, 76 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e6ae2e8ced..a1832712a4 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -52,6 +52,7 @@
 #include "kvm-cpus.h"
 #include "system/dirtylimit.h"
 #include "qemu/range.h"
+#include "system/confidential-guest-support.h"
 
 #include "hw/core/boards.h"
 #include "system/stats.h"
@@ -2901,6 +2902,7 @@ static int kvm_reset_vmfd(MachineState *ms)
 static int kvm_init(AccelState *as, MachineState *ms)
 {
     MachineClass *mc = MACHINE_GET_CLASS(ms);
+    ConfidentialGuestSupport *cgs = ms->cgs;
     static const char upgrade_note[] =
         "Please upgrade to at least kernel 4.5.\n";
     const struct {
@@ -3076,6 +3078,25 @@ static int kvm_init(AccelState *as, MachineState *ms)
         kvm_vm_check_extension(s, KVM_CAP_USER_MEMORY2);
     kvm_pre_fault_memory_supported = kvm_vm_check_extension(s, KVM_CAP_PRE_FAULT_MEMORY);
 
+    if (cgs && cgs->convert_in_place) {
+        uint64_t guest_memfd_supported_memory_attributes;
+
+        guest_memfd_supported_memory_attributes =
+            kvm_vm_check_extension(s, KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES);
+
+        if (!(guest_memfd_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+            ret = -EINVAL;
+            error_report("In-place conversion is only supported if private "
+                         "memory attributes can be set via guest_memfd. "
+                         "Please ensure the 'vm_memory_attributes' KVM module "
+                         "parameter is set to 0.");
+            goto err;
+        }
+
+        assert(kvm_guest_memfd_supported);
+        kvm_supported_memory_attributes = guest_memfd_supported_memory_attributes;
+    }
+
     if (s->kernel_irqchip_split == ON_OFF_AUTO_AUTO) {
         s->kernel_irqchip_split = mc->default_kernel_irqchip_split ? ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
     }
diff --git a/backends/confidential-guest-support.c b/backends/confidential-guest-support.c
index 156dd15e66..c89bcf3cb3 100644
--- a/backends/confidential-guest-support.c
+++ b/backends/confidential-guest-support.c
@@ -21,6 +21,24 @@ OBJECT_DEFINE_ABSTRACT_TYPE(ConfidentialGuestSupport,
                             CONFIDENTIAL_GUEST_SUPPORT,
                             OBJECT)
 
+static bool
+cgs_get_convert_in_place(Object *obj, Error **errp)
+{
+    return CONFIDENTIAL_GUEST_SUPPORT(obj)->convert_in_place;
+}
+
+static void
+cgs_set_convert_in_place(Object *obj, bool value, Error **errp)
+{
+    ConfidentialGuestSupport *cgs = CONFIDENTIAL_GUEST_SUPPORT(obj);
+
+    if (!cgs->allow_convert_in_place && value) {
+        error_setg(errp, "In-place conversion support is not supported for this guest configuration.");
+    }
+
+    cgs->convert_in_place = value;
+}
+
 static bool check_support(ConfidentialGuestPlatformType platform,
                          uint16_t platform_version, uint8_t highest_vtl,
                          uint64_t shared_gpa_boundary)
@@ -70,6 +88,13 @@ static void confidential_guest_support_class_init(ObjectClass *oc,
 
 static void confidential_guest_support_init(Object *obj)
 {
+    ConfidentialGuestSupport *cgs = CONFIDENTIAL_GUEST_SUPPORT(obj);
+
+    object_property_add_bool(obj, "convert-in-place", cgs_get_convert_in_place,
+                             cgs_set_convert_in_place);
+
+    cgs->convert_in_place = false;
+    cgs->allow_convert_in_place = false;
 }
 
 static void confidential_guest_support_finalize(Object *obj)
diff --git a/include/system/confidential-guest-support.h b/include/system/confidential-guest-support.h
index 5dca717308..c1e9c41ad2 100644
--- a/include/system/confidential-guest-support.h
+++ b/include/system/confidential-guest-support.h
@@ -20,6 +20,7 @@
 
 #include "qom/object.h"
 #include "exec/hwaddr.h"
+#include "qapi/qapi-visit-qom.h"
 
 #define TYPE_CONFIDENTIAL_GUEST_SUPPORT "confidential-guest-support"
 OBJECT_DECLARE_TYPE(ConfidentialGuestSupport,
@@ -92,6 +93,19 @@ struct ConfidentialGuestSupport {
      * so 'ready' is not set, we'll abort.
      */
     bool ready;
+
+    /*
+     * True if the machine re-uses physical pages when converting
+     * between shared/private (as opposed to using different
+     * physical pages depending on the access type).
+     */
+    bool convert_in_place;
+
+    /*
+     * CGS implementations will use this to indicate whether or not
+     * in-place conversion can be enabled by users.
+     */
+    bool allow_convert_in_place;
 };
 
 typedef struct ConfidentialGuestSupportClass {
diff --git a/qapi/qom.json b/qapi/qom.json
index 502fafeb15..037c078799 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -1014,6 +1014,21 @@
   'if': 'CONFIG_IGVM',
   'data': { 'file': 'str' } }
 
+##
+# @ConfidentialGuestSupportProperties:
+#
+# Properties for ConfidentialGuestSupport base class.
+#
+# @convert-in-place: If true, the same physical pages are reused
+#     when memory is converted between shared and private states.
+#     If false (default), separate allocations are used depending
+#     on whether the page is private or shared.
+#
+# Since: 11.1
+##
+{ 'struct': 'ConfidentialGuestSupportProperties',
+  'data': { '*convert-in-place': 'bool' } }
+
 ##
 # @SevCommonProperties:
 #
@@ -1038,6 +1053,7 @@
 # Since: 9.1
 ##
 { 'struct': 'SevCommonProperties',
+  'base': 'ConfidentialGuestSupportProperties',
   'data': { '*sev-device': 'str',
             '*cbitpos': 'uint32',
             'reduced-phys-bits': 'uint32',
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support
  2026-05-28  0:03 ` [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support Michael Roth
@ 2026-06-02  8:23   ` Markus Armbruster
  2026-06-03  6:39     ` Michael Roth
  0 siblings, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2026-06-02  8:23 UTC (permalink / raw)
  To: Michael Roth
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

Michael Roth <michael.roth@amd.com> writes:

> For confidential guests, guest_memfd is currently used only for private
> guest memory, and normal guest memory comes from the configured memory
> backend just as it does for a non-confidential guest. It is now possible
> to use the same physical memory to back a particular GPA regardless of
> whether it is in a shared or private state. This avoids the need to
> rely on discarding memory between shared/private conversions (to avoid
> doubled memory usage), and is intended to be the primary mode of using
> guest_memfd for confidential guests moving forward, and future features
> like hugepage support will likely require it.
>
> Add an option to enable this support. Since ConfidentialGuestSupport is
> already used to track some guest_memfd-related functionality (e.g.
> whether it is required for the configured machine), similarly introduce
> this option as a property of ConfidentialGuestSupport.
>
> Also add the KVM-specific checks to enable this support, but leave the
> option disabled until other required changes are implemented for
> CGS variants that intend to make use of KVM's in-place conversion
> support.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>

[...]

> diff --git a/qapi/qom.json b/qapi/qom.json
> index 502fafeb15..037c078799 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -1014,6 +1014,21 @@
>    'if': 'CONFIG_IGVM',
>    'data': { 'file': 'str' } }
>  
> +##
> +# @ConfidentialGuestSupportProperties:
> +#
> +# Properties for ConfidentialGuestSupport base class.
> +#
> +# @convert-in-place: If true, the same physical pages are reused
> +#     when memory is converted between shared and private states.
> +#     If false (default), separate allocations are used depending
> +#     on whether the page is private or shared.
> +#
> +# Since: 11.1
> +##
> +{ 'struct': 'ConfidentialGuestSupportProperties',
> +  'data': { '*convert-in-place': 'bool' } }
> +
>  ##
>  # @SevCommonProperties:
>  #
> @@ -1038,6 +1053,7 @@
>  # Since: 9.1
>  ##
>  { 'struct': 'SevCommonProperties',
> +  'base': 'ConfidentialGuestSupportProperties',
>    'data': { '*sev-device': 'str',
>              '*cbitpos': 'uint32',
>              'reduced-phys-bits': 'uint32',

Why use a base type instead of simply adding @convert-in-place to
SevCommonProperties?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support
  2026-06-02  8:23   ` Markus Armbruster
@ 2026-06-03  6:39     ` Michael Roth
  2026-06-08  8:15       ` Markus Armbruster
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Roth @ 2026-06-03  6:39 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

On Tue, Jun 02, 2026 at 10:23:40AM +0200, Markus Armbruster wrote:
> Michael Roth <michael.roth@amd.com> writes:
> 
> > For confidential guests, guest_memfd is currently used only for private
> > guest memory, and normal guest memory comes from the configured memory
> > backend just as it does for a non-confidential guest. It is now possible
> > to use the same physical memory to back a particular GPA regardless of
> > whether it is in a shared or private state. This avoids the need to
> > rely on discarding memory between shared/private conversions (to avoid
> > doubled memory usage), and is intended to be the primary mode of using
> > guest_memfd for confidential guests moving forward, and future features
> > like hugepage support will likely require it.
> >
> > Add an option to enable this support. Since ConfidentialGuestSupport is
> > already used to track some guest_memfd-related functionality (e.g.
> > whether it is required for the configured machine), similarly introduce
> > this option as a property of ConfidentialGuestSupport.
> >
> > Also add the KVM-specific checks to enable this support, but leave the
> > option disabled until other required changes are implemented for
> > CGS variants that intend to make use of KVM's in-place conversion
> > support.
> >
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> 
> [...]
> 
> > diff --git a/qapi/qom.json b/qapi/qom.json
> > index 502fafeb15..037c078799 100644
> > --- a/qapi/qom.json
> > +++ b/qapi/qom.json
> > @@ -1014,6 +1014,21 @@
> >    'if': 'CONFIG_IGVM',
> >    'data': { 'file': 'str' } }
> >  
> > +##
> > +# @ConfidentialGuestSupportProperties:
> > +#
> > +# Properties for ConfidentialGuestSupport base class.
> > +#
> > +# @convert-in-place: If true, the same physical pages are reused
> > +#     when memory is converted between shared and private states.
> > +#     If false (default), separate allocations are used depending
> > +#     on whether the page is private or shared.
> > +#
> > +# Since: 11.1
> > +##
> > +{ 'struct': 'ConfidentialGuestSupportProperties',
> > +  'data': { '*convert-in-place': 'bool' } }
> > +
> >  ##
> >  # @SevCommonProperties:
> >  #
> > @@ -1038,6 +1053,7 @@
> >  # Since: 9.1
> >  ##
> >  { 'struct': 'SevCommonProperties',
> > +  'base': 'ConfidentialGuestSupportProperties',
> >    'data': { '*sev-device': 'str',
> >              '*cbitpos': 'uint32',
> >              'reduced-phys-bits': 'uint32',
> 
> Why use a base type instead of simply adding @convert-in-place to
> SevCommonProperties?
> 

My thinking was that TDX and other implementations would similarly enable
this through their CGS implementation, so I went ahead and carved out a
set of common properties that ConfidentialGuestSupport implementations
could use the same ,convert-in-place=true option (or set it by default
for newer implementations)

It is sort of tied to the 'allow_convert_in_place' flag that is part of
the common ConfidentialGuestSupport object struct, so the property
handling is sort of tied to the common ConfidentialGuestSupport base
class as well rather than something implementation-specific.

Not sure if there are better ways to handle all that though.

Thanks,

Mike


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support
  2026-06-03  6:39     ` Michael Roth
@ 2026-06-08  8:15       ` Markus Armbruster
  2026-06-08 20:21         ` Michael Roth
  0 siblings, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2026-06-08  8:15 UTC (permalink / raw)
  To: Michael Roth
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

Michael Roth <michael.roth@amd.com> writes:

> On Tue, Jun 02, 2026 at 10:23:40AM +0200, Markus Armbruster wrote:
>> Michael Roth <michael.roth@amd.com> writes:
>> 
>> > For confidential guests, guest_memfd is currently used only for private
>> > guest memory, and normal guest memory comes from the configured memory
>> > backend just as it does for a non-confidential guest. It is now possible
>> > to use the same physical memory to back a particular GPA regardless of
>> > whether it is in a shared or private state. This avoids the need to
>> > rely on discarding memory between shared/private conversions (to avoid
>> > doubled memory usage), and is intended to be the primary mode of using
>> > guest_memfd for confidential guests moving forward, and future features
>> > like hugepage support will likely require it.
>> >
>> > Add an option to enable this support. Since ConfidentialGuestSupport is
>> > already used to track some guest_memfd-related functionality (e.g.
>> > whether it is required for the configured machine), similarly introduce
>> > this option as a property of ConfidentialGuestSupport.
>> >
>> > Also add the KVM-specific checks to enable this support, but leave the
>> > option disabled until other required changes are implemented for
>> > CGS variants that intend to make use of KVM's in-place conversion
>> > support.
>> >
>> > Signed-off-by: Michael Roth <michael.roth@amd.com>
>> 
>> [...]
>> 
>> > diff --git a/qapi/qom.json b/qapi/qom.json
>> > index 502fafeb15..037c078799 100644
>> > --- a/qapi/qom.json
>> > +++ b/qapi/qom.json
>> > @@ -1014,6 +1014,21 @@
>> >    'if': 'CONFIG_IGVM',
>> >    'data': { 'file': 'str' } }
>> >  
>> > +##
>> > +# @ConfidentialGuestSupportProperties:
>> > +#
>> > +# Properties for ConfidentialGuestSupport base class.
>> > +#
>> > +# @convert-in-place: If true, the same physical pages are reused
>> > +#     when memory is converted between shared and private states.
>> > +#     If false (default), separate allocations are used depending
>> > +#     on whether the page is private or shared.
>> > +#
>> > +# Since: 11.1
>> > +##
>> > +{ 'struct': 'ConfidentialGuestSupportProperties',
>> > +  'data': { '*convert-in-place': 'bool' } }
>> > +
>> >  ##
>> >  # @SevCommonProperties:
>> >  #
>> > @@ -1038,6 +1053,7 @@
>> >  # Since: 9.1
>> >  ##
>> >  { 'struct': 'SevCommonProperties',
>> > +  'base': 'ConfidentialGuestSupportProperties',
>> >    'data': { '*sev-device': 'str',
>> >              '*cbitpos': 'uint32',
>> >              'reduced-phys-bits': 'uint32',
>> 
>> Why use a base type instead of simply adding @convert-in-place to
>> SevCommonProperties?
>> 
>
> My thinking was that TDX and other implementations would similarly enable
> this through their CGS implementation, so I went ahead and carved out a
> set of common properties that ConfidentialGuestSupport implementations
> could use the same ,convert-in-place=true option (or set it by default
> for newer implementations)

How confident are we in future reuse by TDX and others?

If there are doubts, refactoring for reuse when reuse happens would be
smarter.  The refactoring would be a bit of churn, but not all that
much.

If it's something like "pretty much inevitable", preparing the reuse now
saves us that churn, and makes sense.

Judgement call, i.e. you decide.

> It is sort of tied to the 'allow_convert_in_place' flag that is part of
> the common ConfidentialGuestSupport object struct, so the property
> handling is sort of tied to the common ConfidentialGuestSupport base
> class as well rather than something implementation-specific.

Valid point.  Not sure how much weight to assign to it, though.

> Not sure if there are better ways to handle all that though.

Work your rationale into the commit message, please.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support
  2026-06-08  8:15       ` Markus Armbruster
@ 2026-06-08 20:21         ` Michael Roth
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-06-08 20:21 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, kvm, pbonzini, berrange, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

On Mon, Jun 08, 2026 at 10:15:41AM +0200, Markus Armbruster wrote:
> Michael Roth <michael.roth@amd.com> writes:
> 
> > On Tue, Jun 02, 2026 at 10:23:40AM +0200, Markus Armbruster wrote:
> >> Michael Roth <michael.roth@amd.com> writes:
> >> 
> >> > For confidential guests, guest_memfd is currently used only for private
> >> > guest memory, and normal guest memory comes from the configured memory
> >> > backend just as it does for a non-confidential guest. It is now possible
> >> > to use the same physical memory to back a particular GPA regardless of
> >> > whether it is in a shared or private state. This avoids the need to
> >> > rely on discarding memory between shared/private conversions (to avoid
> >> > doubled memory usage), and is intended to be the primary mode of using
> >> > guest_memfd for confidential guests moving forward, and future features
> >> > like hugepage support will likely require it.
> >> >
> >> > Add an option to enable this support. Since ConfidentialGuestSupport is
> >> > already used to track some guest_memfd-related functionality (e.g.
> >> > whether it is required for the configured machine), similarly introduce
> >> > this option as a property of ConfidentialGuestSupport.
> >> >
> >> > Also add the KVM-specific checks to enable this support, but leave the
> >> > option disabled until other required changes are implemented for
> >> > CGS variants that intend to make use of KVM's in-place conversion
> >> > support.
> >> >
> >> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> >> 
> >> [...]
> >> 
> >> > diff --git a/qapi/qom.json b/qapi/qom.json
> >> > index 502fafeb15..037c078799 100644
> >> > --- a/qapi/qom.json
> >> > +++ b/qapi/qom.json
> >> > @@ -1014,6 +1014,21 @@
> >> >    'if': 'CONFIG_IGVM',
> >> >    'data': { 'file': 'str' } }
> >> >  
> >> > +##
> >> > +# @ConfidentialGuestSupportProperties:
> >> > +#
> >> > +# Properties for ConfidentialGuestSupport base class.
> >> > +#
> >> > +# @convert-in-place: If true, the same physical pages are reused
> >> > +#     when memory is converted between shared and private states.
> >> > +#     If false (default), separate allocations are used depending
> >> > +#     on whether the page is private or shared.
> >> > +#
> >> > +# Since: 11.1
> >> > +##
> >> > +{ 'struct': 'ConfidentialGuestSupportProperties',
> >> > +  'data': { '*convert-in-place': 'bool' } }
> >> > +
> >> >  ##
> >> >  # @SevCommonProperties:
> >> >  #
> >> > @@ -1038,6 +1053,7 @@
> >> >  # Since: 9.1
> >> >  ##
> >> >  { 'struct': 'SevCommonProperties',
> >> > +  'base': 'ConfidentialGuestSupportProperties',
> >> >    'data': { '*sev-device': 'str',
> >> >              '*cbitpos': 'uint32',
> >> >              'reduced-phys-bits': 'uint32',
> >> 
> >> Why use a base type instead of simply adding @convert-in-place to
> >> SevCommonProperties?
> >> 
> >
> > My thinking was that TDX and other implementations would similarly enable
> > this through their CGS implementation, so I went ahead and carved out a
> > set of common properties that ConfidentialGuestSupport implementations
> > could use the same ,convert-in-place=true option (or set it by default
> > for newer implementations)
> 
> How confident are we in future reuse by TDX and others?
> 
> If there are doubts, refactoring for reuse when reuse happens would be
> smarter.  The refactoring would be a bit of churn, but not all that
> much.
> 
> If it's something like "pretty much inevitable", preparing the reuse now
> saves us that churn, and makes sense.
> 
> Judgement call, i.e. you decide.

Hoping to hear back from the TDX folks on whether there's anything
missing to switch things on there too, at which point maybe it won't be
premature to have a common type. But yah, until then I can plan to keep
the option specific to SEV/SevCommonProperties. As you said, not a big
deal to move it out to a common base after-the-fact.

> 
> > It is sort of tied to the 'allow_convert_in_place' flag that is part of
> > the common ConfidentialGuestSupport object struct, so the property
> > handling is sort of tied to the common ConfidentialGuestSupport base
> > class as well rather than something implementation-specific.
> 
> Valid point.  Not sure how much weight to assign to it, though.

Yah, I was purposely trying to make it easy for other platforms to switch
it on, but if we do for the time being keep it limited to SNP, then
there's probably other ways to go about it.

> 
> > Not sure if there are better ways to handle all that though.
> 
> Work your rationale into the commit message, please.
> 

Will do!

Thanks,

Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 05/12] system/memory: Re-use memory-backend-guest-memfd inode for private memory
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (3 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 06/12] system/memory: Default to guest_memfd for RAM for in-place conversion Michael Roth
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

When convert-in-place=true, the shared memory allocated/provided by the
guest-memfd memory backend should also be used internally for private
memory. Do this by dup()'ing the guest_memfd FD so separate cleanup
paths for shared vs. private FDs can be managed in the same way they are
currently for convert-in-place=false (where shared memory comes from
some other backend like memory-backend-memfd).

Since it only currently makes sense to allow a
memory-backend-guest-memfd FD to be used for private memory, introduce a
new RAM_GUEST_MEMFD_SHARED flag that can be used to limit dup()'ing to
specific backend types like memory-backend-guest-memfd.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 backends/hostmem-guest-memfd.c |  1 +
 include/system/memory.h        |  3 +++
 system/physmem.c               | 46 +++++++++++++++++++++++++++++++---
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-guest-memfd.c b/backends/hostmem-guest-memfd.c
index deb796a6bd..8ab8242892 100644
--- a/backends/hostmem-guest-memfd.c
+++ b/backends/hostmem-guest-memfd.c
@@ -56,6 +56,7 @@ have_fd:
     ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     ram_flags |= backend->guest_memfd ? RAM_GUEST_MEMFD : 0;
+    ram_flags |= RAM_GUEST_MEMFD_SHARED;
     return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
                                           backend->size, ram_flags, fd, 0, errp);
 }
diff --git a/include/system/memory.h b/include/system/memory.h
index 24c68720aa..0a371b686a 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -282,6 +282,9 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_PRIVATE (1 << 13)
 
+/* RAM can be shared that has kvm guest memfd backend */
+#define RAM_GUEST_MEMFD_SHARED   (1 << 14)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/system/physmem.c b/system/physmem.c
index 04c7c38721..ebec7ae7a4 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -59,6 +59,7 @@
 #include "system/hostmem.h"
 #include "system/hw_accel.h"
 #include "system/xen-mapcache.h"
+#include "system/confidential-guest-support.h"
 #include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -2187,11 +2188,14 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
     if (new_block->flags & RAM_GUEST_MEMFD) {
         int ret;
 
+        assert(current_machine->cgs);
+
         if (!kvm_enabled()) {
             error_setg(errp, "cannot set up private guest memory for %s: KVM required",
                        object_get_typename(OBJECT(current_machine->cgs)));
             goto out_free;
         }
+
         assert(new_block->guest_memfd < 0);
 
         ret = ram_block_coordinated_discard_require(true);
@@ -2202,8 +2206,38 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             goto out_free;
         }
 
-        new_block->guest_memfd = kvm_create_guest_memfd_private(new_block->max_length,
-                                                                errp);
+        /*
+         * If both shared/private memory are handled by guest_memfd, make sure to
+         * re-use the guest_memfd inode that should have already been created for
+         * handling shared memory.
+         */
+        if (current_machine->cgs->convert_in_place) {
+            if (!(new_block->flags & RAM_GUEST_MEMFD_SHARED)) {
+                error_setg(errp, "configured memory backend is not compatible with in-place conversion");
+                qemu_mutex_unlock_ramlist();
+                goto out_free;
+            }
+            assert(new_block->fd >= 0);
+
+            /*
+             * Current logic calculates guest_memfd_offset on the assumption that
+             * offset 0 corresponds to the first GPA that is backed by the RAM
+             * block/backend. For cases where the guest_memfd is only used for
+             * private memory and created internally as-needed this is always the
+             * case, but when re-using a guest_memfd that's also usable for shared
+             * memory (e.g. via memory-backend-guest-memfd) it's possible that
+             * guest_memfd might be mmap()'d starting at some non-zero offset. For
+             * now, this isn't a reachable condition, but assert this in case this
+             * ever changes and the logic needs to be updated to account for this.
+             */
+            assert(new_block->fd_offset == 0);
+
+            new_block->guest_memfd = qemu_dup(new_block->fd);
+        } else {
+            new_block->guest_memfd =
+                kvm_create_guest_memfd_private(new_block->max_length, errp);
+        }
+
         if (new_block->guest_memfd < 0) {
             qemu_mutex_unlock_ramlist();
             goto out_free;
@@ -2315,7 +2349,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, ram_addr_t max_size,
     assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
                           RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
                           RAM_READONLY_FD | RAM_GUEST_MEMFD |
-                          RAM_RESIZEABLE)) == 0);
+                          RAM_RESIZEABLE | RAM_GUEST_MEMFD_SHARED)) == 0);
     assert(max_size >= size);
 
     if (xen_enabled()) {
@@ -2828,6 +2862,12 @@ int ram_block_rebind(Error **errp)
 {
     RAMBlock *block;
 
+    if (current_machine->cgs && current_machine->cgs->convert_in_place) {
+        error_setg(errp,
+                   "reset support is not yet enabled for in-place conversion");
+        return -1;
+    }
+
     qemu_mutex_lock_ramlist();
 
     RAMBLOCK_FOREACH(block) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 06/12] system/memory: Default to guest_memfd for RAM for in-place conversion
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (4 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 05/12] system/memory: Re-use memory-backend-guest-memfd inode for private memory Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 07/12] accel/kvm: Move post-conversion updates to a separate helper Michael Roth
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

memory_region_init_ram_guest_memfd() is called in some cases (legacy
BIOS regions / IGVM regions) to allocate a new RAM region with a
guest_memfd FD under the covers to handle private memory since the GPA
range can be converted between shared/private guest RAM.

When in-place conversion is enabled, the conversions happen with the
guest_memfd inode itself, so the same inode must be used for both shared
and private memory. Handle this accordingly when convert-in-place=true.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 system/memory.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 739ba11da6..f6c695fd23 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -35,6 +35,7 @@
 #include "hw/core/boards.h"
 #include "migration/vmstate.h"
 #include "system/address-spaces.h"
+#include "system/confidential-guest-support.h"
 
 #include "memory-internal.h"
 
@@ -3674,10 +3675,25 @@ bool memory_region_init_ram_guest_memfd(MemoryRegion *mr, Object *owner,
                                         const char *name, uint64_t size,
                                         Error **errp)
 {
-    if (!memory_region_init_ram_flags_nomigrate(mr, owner, name, size,
-                                                RAM_GUEST_MEMFD, errp)) {
-        return false;
+    if (current_machine->cgs && current_machine->cgs->convert_in_place) {
+        int fd = kvm_create_guest_memfd_shared(size, errp);
+        if (fd < 0) {
+            return false;
+        }
+
+        if (!memory_region_init_ram_from_fd(mr, owner, name, size,
+                                            RAM_SHARED | RAM_GUEST_MEMFD |
+                                            RAM_GUEST_MEMFD_SHARED,
+                                            fd, 0, errp)) {
+                return false;
+        }
+    } else {
+        if (!memory_region_init_ram_flags_nomigrate(mr, owner, name, size,
+                                                    RAM_GUEST_MEMFD, errp)) {
+            return false;
+        }
     }
+
     memory_region_register_ram(mr, owner);
     return true;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 07/12] accel/kvm: Move post-conversion updates to a separate helper
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (5 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 06/12] system/memory: Default to guest_memfd for RAM for in-place conversion Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 08/12] accel/kvm: Re-order attribute notifications for in-place conversion Michael Roth
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

Currently memory attribute conversions are followed up by other
bookkeeping tasks like discarding unused memory or issuing iommufd
notifications. Move these tasks to a separate post-conversions helper to
better compartmentalize and track these tasks, and in doing so lay the
groundwork for a pre-conversion helper which will be needed in the
future.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a1832712a4..0e6ff2de4b 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3445,20 +3445,26 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
 {
     hwaddr start = section->offset_within_address_space;
     hwaddr size = int128_get64(section->size);
-    MemoryRegion *mr = section->mr;
-    ram_addr_t offset;
-    RAMBlock *rb;
-    void *addr;
-    int ret = -EINVAL;
+    int ret;
 
     if (to_private) {
         ret = kvm_set_memory_attributes_private(start, size);
     } else {
         ret = kvm_set_memory_attributes_shared(start, size);
     }
-    if (ret) {
-        return ret;
-    }
+
+    return ret;
+}
+
+static int kvm_post_convert_section(MemoryRegionSection *section, bool to_private)
+{
+    hwaddr start = section->offset_within_address_space;
+    hwaddr size = int128_get64(section->size);
+    MemoryRegion *mr = section->mr;
+    ram_addr_t offset;
+    RAMBlock *rb;
+    void *addr;
+    int ret;
 
     addr = memory_region_get_ram_ptr(mr) + section->offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
@@ -3485,7 +3491,7 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
         ret = ram_block_discard_guest_memfd_range(rb, offset, size);
     }
 
-    return ret;
+    return 0;
 }
 
 int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
@@ -3533,6 +3539,12 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
         }
 
         ret = kvm_convert_section(&section, to_private);
+        if (ret) {
+            memory_region_unref(section.mr);
+            break;
+        }
+
+        ret = kvm_post_convert_section(&section, to_private);
         memory_region_unref(section.mr);
 
         if (ret) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 08/12] accel/kvm: Re-order attribute notifications for in-place conversion
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (6 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 07/12] accel/kvm: Move post-conversion updates to a separate helper Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls Michael Roth
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

ram-block-attribute update notifications are currently sent after
conversions from/to private pages to trigger DMA maps/unmaps of shared
GPA ranges (respectively). However, with in-place conversion additional
requirements on the kernel side come into play which require this
behavior to be adjusted.

For shared->private conversions: the attributes need to be set to
private *after* the notification, since when using VFIO it may not be
possible to update the attribute while it remains pinned due to the
IOMMU mapping, so issue the notification first to ensure unmappings are
done in advance.

For private->shared conversions: the attributes need to be set to shared
*before* the notification, since it will possibly result in the page
being mapped into an IOMMU and trigger guest_memfd's fault handler,
which will expect the page to have its attributes set to shared or
otherwise SIGBUS.

Implement this to enable passthrough support for CoCo guests with
in-place conversion support enabled. For non-inplace conversion, pages
mapped into the IOMMU are not the same physical pages as the one used
for private accesses by the guest, so neither order risks DMA accesses
to private memory and that path can be consolidated to use the same
handling as well.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c | 70 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 7 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 0e6ff2de4b..62f2e8aa15 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3456,6 +3456,47 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
     return ret;
 }
 
+static int kvm_pre_convert_section(MemoryRegionSection *section, bool to_private)
+{
+    hwaddr start = section->offset_within_address_space;
+    hwaddr size = int128_get64(section->size);
+    MemoryRegion *mr = section->mr;
+    ram_addr_t offset;
+    RAMBlock *rb;
+    void *addr;
+    int ret;
+
+    addr = memory_region_get_ram_ptr(mr) + section->offset_within_region;
+    rb = qemu_ram_block_from_host(addr, false, &offset);
+
+    /*
+     * The attributes need to be set to private *after* the notification
+     * of a shared->private conversion, since when using VFIO it may not
+     * be possible to update the attribute while it remains pinned due
+     * to the IOMMU mapping, so issue the notification first to ensure
+     * unmappings are done in advance.
+     *
+     * There is an asymmetry here in that if the subsequent memory
+     * attribute update fails, this notification is out of sync with the
+     * state as tracked by guest_memfd, which isn't ideal, but memory
+     * attribute failures are not expected to be recoverable any way so
+     * there it would be a waste of time to roll back the notification and
+     * re-trigger things like mapping the page via iommufd.
+     */
+    if (to_private) {
+        ret = ram_block_attributes_state_change(rb->attributes,
+                                                offset, size, to_private);
+        if (ret) {
+            error_report("Failed to notify the listener the state change of "
+                         "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s, ret %d",
+                         start, size, to_private ? "private" : "shared", ret);
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 static int kvm_post_convert_section(MemoryRegionSection *section, bool to_private)
 {
     hwaddr start = section->offset_within_address_space;
@@ -3469,13 +3510,22 @@ static int kvm_post_convert_section(MemoryRegionSection *section, bool to_privat
     addr = memory_region_get_ram_ptr(mr) + section->offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
-    ret = ram_block_attributes_state_change(rb->attributes,
-                                            offset, size, to_private);
-    if (ret) {
-        error_report("Failed to notify the listener the state change of "
-                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s, ret %d",
-                     start, size, to_private ? "private" : "shared", ret);
-        return ret;
+    /*
+     * The attributes need to have been set to shared *before* the notification
+     * of a private->shared conversion, since it will possibly result in the
+     * page being mapped into an IOMMU when using VFIO and trigger
+     * guest_memfd's fault handler, which will expect the page to have its
+     * attributes set to shared.
+     */
+    if (!to_private) {
+        ret = ram_block_attributes_state_change(rb->attributes,
+                                                offset, size, to_private);
+        if (ret) {
+            error_report("Failed to notify the listener the state change of "
+                         "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s, ret %d",
+                         start, size, to_private ? "private" : "shared", ret);
+            return ret;
+        }
     }
 
     if (to_private) {
@@ -3538,6 +3588,12 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
             continue;
         }
 
+        ret = kvm_pre_convert_section(&section, to_private);
+        if (ret) {
+            memory_region_unref(section.mr);
+            break;
+        }
+
         ret = kvm_convert_section(&section, to_private);
         if (ret) {
             memory_region_unref(section.mr);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (7 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 08/12] accel/kvm: Re-order attribute notifications for in-place conversion Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-06-04 13:19   ` Gupta, Pankaj
  2026-05-28  0:03 ` [PATCH RFC 10/12] accel/kvm: Don't default to private attributes for in-place conversion Michael Roth
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

When using guest_memfd with support for shared memory / in-place
conversion, it is necessary to use the guest_memfd ioctls to handle
conversions instead of KVM ioctls. Implement support for this by looping
through all the sections within a converison range. Implement everything
in terms of the kvm_convert_memory() loop, which already deals with some
special considerations regarding various holes / region types that might
be encountered.

Also update kvm_set_memory_attributes_*() to use the same common path
when convert-in-place=false. This potentially results in a small change
in behavior due to the additional MMIO checks/skips now being applied in
that case (generally qemu-triggered during setup) rather than only for
kvm_convert_memory() (generally guest-triggered), but this is arguably
safer, and it provides similar behavior between convert-in-place=false
vs. convert-in-place=true, the latter of which *must* skip MMIO holes
because the regions (and associated guest_memfds) themselves track
shared/private state internally and passing the whole conversion range
through to KVM is not an option in that case.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c | 131 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 114 insertions(+), 17 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 62f2e8aa15..fd01435a0f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1626,14 +1626,78 @@ static int kvm_set_memory_attributes(hwaddr start, uint64_t size, uint64_t attr)
     return r;
 }
 
-int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
+static int kvm_gmem_ioctl(int guest_memfd, unsigned long type, ...)
 {
-    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+    int ret;
+    void *arg;
+    va_list ap;
+
+    va_start(ap, type);
+    arg = va_arg(ap, void *);
+    va_end(ap);
+
+    ret = ioctl(guest_memfd, type, arg);
+    if (ret == -1) {
+        ret = -errno;
+    }
+    return ret;
 }
 
-int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
+static int guest_memfd_set_memory_attributes_fd(int guest_memfd, hwaddr offset,
+                                                uint64_t size, uint64_t attr)
 {
-    return kvm_set_memory_attributes(start, size, 0);
+    struct kvm_memory_attributes2 attrs;
+    int r;
+
+    assert((attr & kvm_supported_memory_attributes) == attr);
+    attrs.attributes = attr;
+    attrs.offset = offset;
+    attrs.size = size;
+    attrs.flags = 0;
+
+    /*
+     * guest_memfd may need to delay conversion requests due to
+     * the memory being in-use by the kernel. In most cases these
+     * will be transient uses. In some cases, userspace itself may
+     * be the cause of the memory being considered in-use, though
+     * QEMU currently takes steps to avoid this (e.g. via
+     * RamBlockAttributes). On that basis, this code loops
+     * indefinitely with the assumption that only transient cases
+     * will block, and that those will be for relatively short
+     * periods vs. the overall conversion path.
+     * If those assumptions at some point prove false, most likely
+     * this will manifest as guest-side lockups on their conversion
+     * path, which seems like the appropriate way to surface this
+     * situation to the guest owner rather than some hard timeout.
+     */
+    do {
+        r = kvm_gmem_ioctl(guest_memfd, KVM_SET_MEMORY_ATTRIBUTES2, &attrs);
+    } while (r == -EAGAIN);
+
+    if (r) {
+        error_report("failed to set memory (0x%" HWADDR_PRIx "+0x%" PRIx64 ") "
+                     "with attr 0x%" PRIx64 " error '%s'",
+                     offset, size, attr, strerror(-r));
+    }
+    return r;
+}
+
+static int guest_memfd_set_memory_section_attributes(MemoryRegionSection *section, uint64_t attr)
+{
+    hwaddr convert_offset, convert_size;
+    MemoryRegion *mr = section->mr;
+    RAMBlock *rb;
+
+    assert(mr);
+    rb = mr->ram_block;
+    assert(rb->guest_memfd);
+    convert_offset = section->offset_within_region;
+    convert_size = int128_get64(section->size);
+
+    return guest_memfd_set_memory_attributes_fd(rb->guest_memfd,
+                                                convert_offset,
+                                                convert_size,
+                                                attr);
 }
 
 /* Called with KVMMemoryListener.slots_lock held */
@@ -3447,10 +3511,18 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
     hwaddr size = int128_get64(section->size);
     int ret;
 
-    if (to_private) {
-        ret = kvm_set_memory_attributes_private(start, size);
+    if (current_machine->cgs && current_machine->cgs->convert_in_place) {
+        ret = guest_memfd_set_memory_section_attributes(section,
+                                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
+                                                                   : 0);
     } else {
-        ret = kvm_set_memory_attributes_shared(start, size);
+        /*
+         * Without in-place conversion, attribute-tracking is handled by KVM
+         * across all guest memory rather than on a per-section/slot basis.
+         */
+        ret = kvm_set_memory_attributes(start, size,
+                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
+                                                   : 0);
     }
 
     return ret;
@@ -3544,7 +3616,8 @@ static int kvm_post_convert_section(MemoryRegionSection *section, bool to_privat
     return 0;
 }
 
-int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+static int kvm_convert_memory_full(hwaddr start, hwaddr size, bool to_private,
+                                   bool pre_hooks, bool post_hooks)
 {
     int ret = -EINVAL;
 
@@ -3588,10 +3661,12 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
             continue;
         }
 
-        ret = kvm_pre_convert_section(&section, to_private);
-        if (ret) {
-            memory_region_unref(section.mr);
-            break;
+        if (pre_hooks) {
+            ret = kvm_pre_convert_section(&section, to_private);
+            if (ret) {
+                memory_region_unref(section.mr);
+                break;
+            }
         }
 
         ret = kvm_convert_section(&section, to_private);
@@ -3600,13 +3675,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
             break;
         }
 
-        ret = kvm_post_convert_section(&section, to_private);
-        memory_region_unref(section.mr);
-
-        if (ret) {
-            break;
+        if (post_hooks) {
+            ret = kvm_post_convert_section(&section, to_private);
+            if (ret) {
+                memory_region_unref(section.mr);
+                break;
+            }
         }
 
+        memory_region_unref(section.mr);
         size -= section_end - start;
         start = section_end;
     }
@@ -3614,6 +3691,26 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     return ret;
 }
 
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+{
+    return kvm_convert_memory_full(start, size, to_private, true, true);
+}
+
+static int kvm_convert_memory_attributes(hwaddr start, hwaddr size, bool to_private)
+{
+    return kvm_convert_memory_full(start, size, to_private, false, false);
+}
+
+int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
+{
+    return kvm_convert_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
+{
+    return kvm_convert_memory_attributes(start, size, 0);
+}
+
 int kvm_cpu_exec(CPUState *cpu)
 {
     struct kvm_run *run = cpu->kvm_run;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls
  2026-05-28  0:03 ` [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls Michael Roth
@ 2026-06-04 13:19   ` Gupta, Pankaj
  2026-06-04 23:36       ` Michael Roth via qemu development
  0 siblings, 1 reply; 26+ messages in thread
From: Gupta, Pankaj @ 2026-06-04 13:19 UTC (permalink / raw)
  To: Michael Roth, qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, isaku.yamahata, xiaoyao.li,
	chao.p.peng, david, ashish.kalra, ackerleytng


> When using guest_memfd with support for shared memory / in-place
> conversion, it is necessary to use the guest_memfd ioctls to handle
> conversions instead of KVM ioctls. Implement support for this by looping
> through all the sections within a converison range. Implement everything
> in terms of the kvm_convert_memory() loop, which already deals with some
> special considerations regarding various holes / region types that might
> be encountered.
>
> Also update kvm_set_memory_attributes_*() to use the same common path
> when convert-in-place=false. This potentially results in a small change
> in behavior due to the additional MMIO checks/skips now being applied in
> that case (generally qemu-triggered during setup) rather than only for
> kvm_convert_memory() (generally guest-triggered), but this is arguably
> safer, and it provides similar behavior between convert-in-place=false
> vs. convert-in-place=true, the latter of which *must* skip MMIO holes
> because the regions (and associated guest_memfds) themselves track
> shared/private state internally and passing the whole conversion range
> through to KVM is not an option in that case.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> ---
>   accel/kvm/kvm-all.c | 131 ++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 114 insertions(+), 17 deletions(-)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 62f2e8aa15..fd01435a0f 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1626,14 +1626,78 @@ static int kvm_set_memory_attributes(hwaddr start, uint64_t size, uint64_t attr)
>       return r;
>   }
>   
> -int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
> +static int kvm_gmem_ioctl(int guest_memfd, unsigned long type, ...)
>   {
> -    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +    int ret;
> +    void *arg;
> +    va_list ap;
> +
> +    va_start(ap, type);
> +    arg = va_arg(ap, void *);
> +    va_end(ap);
> +
> +    ret = ioctl(guest_memfd, type, arg);
> +    if (ret == -1) {
> +        ret = -errno;
> +    }
> +    return ret;
>   }
>   
> -int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
> +static int guest_memfd_set_memory_attributes_fd(int guest_memfd, hwaddr offset,
> +                                                uint64_t size, uint64_t attr)
>   {
> -    return kvm_set_memory_attributes(start, size, 0);
> +    struct kvm_memory_attributes2 attrs;

-    struct kvm_memory_attributes2 attrs;
+    struct kvm_memory_attributes2 attrs = {0};

Zero initializing 'attrs' fixed a '-EINVAL' error, caused because of 
kernel 'attrs.reserved' check failed in 'kvm_gmem_set_attributes()'.

Thanks,

Pankaj

> +    int r;
> +
> +    assert((attr & kvm_supported_memory_attributes) == attr);
> +    attrs.attributes = attr;
> +    attrs.offset = offset;
> +    attrs.size = size;
> +    attrs.flags = 0;
> +
> +    /*
> +     * guest_memfd may need to delay conversion requests due to
> +     * the memory being in-use by the kernel. In most cases these
> +     * will be transient uses. In some cases, userspace itself may
> +     * be the cause of the memory being considered in-use, though
> +     * QEMU currently takes steps to avoid this (e.g. via
> +     * RamBlockAttributes). On that basis, this code loops
> +     * indefinitely with the assumption that only transient cases
> +     * will block, and that those will be for relatively short
> +     * periods vs. the overall conversion path.
> +     * If those assumptions at some point prove false, most likely
> +     * this will manifest as guest-side lockups on their conversion
> +     * path, which seems like the appropriate way to surface this
> +     * situation to the guest owner rather than some hard timeout.
> +     */
> +    do {
> +        r = kvm_gmem_ioctl(guest_memfd, KVM_SET_MEMORY_ATTRIBUTES2, &attrs);
> +    } while (r == -EAGAIN);
> +
> +    if (r) {
> +        error_report("failed to set memory (0x%" HWADDR_PRIx "+0x%" PRIx64 ") "
> +                     "with attr 0x%" PRIx64 " error '%s'",
> +                     offset, size, attr, strerror(-r));
> +    }
> +    return r;
> +}
> +
> +static int guest_memfd_set_memory_section_attributes(MemoryRegionSection *section, uint64_t attr)
> +{
> +    hwaddr convert_offset, convert_size;
> +    MemoryRegion *mr = section->mr;
> +    RAMBlock *rb;
> +
> +    assert(mr);
> +    rb = mr->ram_block;
> +    assert(rb->guest_memfd);
> +    convert_offset = section->offset_within_region;
> +    convert_size = int128_get64(section->size);
> +
> +    return guest_memfd_set_memory_attributes_fd(rb->guest_memfd,
> +                                                convert_offset,
> +                                                convert_size,
> +                                                attr);
>   }
>   
>   /* Called with KVMMemoryListener.slots_lock held */
> @@ -3447,10 +3511,18 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
>       hwaddr size = int128_get64(section->size);
>       int ret;
>   
> -    if (to_private) {
> -        ret = kvm_set_memory_attributes_private(start, size);
> +    if (current_machine->cgs && current_machine->cgs->convert_in_place) {
> +        ret = guest_memfd_set_memory_section_attributes(section,
> +                                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
> +                                                                   : 0);
>       } else {
> -        ret = kvm_set_memory_attributes_shared(start, size);
> +        /*
> +         * Without in-place conversion, attribute-tracking is handled by KVM
> +         * across all guest memory rather than on a per-section/slot basis.
> +         */
> +        ret = kvm_set_memory_attributes(start, size,
> +                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
> +                                                   : 0);
>       }
>   
>       return ret;
> @@ -3544,7 +3616,8 @@ static int kvm_post_convert_section(MemoryRegionSection *section, bool to_privat
>       return 0;
>   }
>   
> -int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> +static int kvm_convert_memory_full(hwaddr start, hwaddr size, bool to_private,
> +                                   bool pre_hooks, bool post_hooks)
>   {
>       int ret = -EINVAL;
>   
> @@ -3588,10 +3661,12 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>               continue;
>           }
>   
> -        ret = kvm_pre_convert_section(&section, to_private);
> -        if (ret) {
> -            memory_region_unref(section.mr);
> -            break;
> +        if (pre_hooks) {
> +            ret = kvm_pre_convert_section(&section, to_private);
> +            if (ret) {
> +                memory_region_unref(section.mr);
> +                break;
> +            }
>           }
>   
>           ret = kvm_convert_section(&section, to_private);
> @@ -3600,13 +3675,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>               break;
>           }
>   
> -        ret = kvm_post_convert_section(&section, to_private);
> -        memory_region_unref(section.mr);
> -
> -        if (ret) {
> -            break;
> +        if (post_hooks) {
> +            ret = kvm_post_convert_section(&section, to_private);
> +            if (ret) {
> +                memory_region_unref(section.mr);
> +                break;
> +            }
>           }
>   
> +        memory_region_unref(section.mr);
>           size -= section_end - start;
>           start = section_end;
>       }
> @@ -3614,6 +3691,26 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>       return ret;
>   }
>   
> +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> +{
> +    return kvm_convert_memory_full(start, size, to_private, true, true);
> +}
> +
> +static int kvm_convert_memory_attributes(hwaddr start, hwaddr size, bool to_private)
> +{
> +    return kvm_convert_memory_full(start, size, to_private, false, false);
> +}
> +
> +int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
> +{
> +    return kvm_convert_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
> +{
> +    return kvm_convert_memory_attributes(start, size, 0);
> +}
> +
>   int kvm_cpu_exec(CPUState *cpu)
>   {
>       struct kvm_run *run = cpu->kvm_run;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls
  2026-06-04 13:19   ` Gupta, Pankaj
@ 2026-06-04 23:36       ` Michael Roth via qemu development
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-06-04 23:36 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: qemu-devel, kvm, pbonzini, berrange, armbru, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

On Thu, Jun 04, 2026 at 03:19:17PM +0200, Gupta, Pankaj wrote:
> 
> > When using guest_memfd with support for shared memory / in-place
> > conversion, it is necessary to use the guest_memfd ioctls to handle
> > conversions instead of KVM ioctls. Implement support for this by looping
> > through all the sections within a converison range. Implement everything
> > in terms of the kvm_convert_memory() loop, which already deals with some
> > special considerations regarding various holes / region types that might
> > be encountered.
> > 
> > Also update kvm_set_memory_attributes_*() to use the same common path
> > when convert-in-place=false. This potentially results in a small change
> > in behavior due to the additional MMIO checks/skips now being applied in
> > that case (generally qemu-triggered during setup) rather than only for
> > kvm_convert_memory() (generally guest-triggered), but this is arguably
> > safer, and it provides similar behavior between convert-in-place=false
> > vs. convert-in-place=true, the latter of which *must* skip MMIO holes
> > because the regions (and associated guest_memfds) themselves track
> > shared/private state internally and passing the whole conversion range
> > through to KVM is not an option in that case.
> > 
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > ---
> >   accel/kvm/kvm-all.c | 131 ++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 114 insertions(+), 17 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 62f2e8aa15..fd01435a0f 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -1626,14 +1626,78 @@ static int kvm_set_memory_attributes(hwaddr start, uint64_t size, uint64_t attr)
> >       return r;
> >   }
> > -int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
> > +static int kvm_gmem_ioctl(int guest_memfd, unsigned long type, ...)
> >   {
> > -    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> > +    int ret;
> > +    void *arg;
> > +    va_list ap;
> > +
> > +    va_start(ap, type);
> > +    arg = va_arg(ap, void *);
> > +    va_end(ap);
> > +
> > +    ret = ioctl(guest_memfd, type, arg);
> > +    if (ret == -1) {
> > +        ret = -errno;
> > +    }
> > +    return ret;
> >   }
> > -int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
> > +static int guest_memfd_set_memory_attributes_fd(int guest_memfd, hwaddr offset,
> > +                                                uint64_t size, uint64_t attr)
> >   {
> > -    return kvm_set_memory_attributes(start, size, 0);
> > +    struct kvm_memory_attributes2 attrs;
> 
> -    struct kvm_memory_attributes2 attrs;
> +    struct kvm_memory_attributes2 attrs = {0};
> 
> Zero initializing 'attrs' fixed a '-EINVAL' error, caused because of kernel
> 'attrs.reserved' check failed in 'kvm_gmem_set_attributes()'.

Indeed, thanks for the catch!

-Mike

> 
> Thanks,
> 
> Pankaj
> 
> > +    int r;
> > +
> > +    assert((attr & kvm_supported_memory_attributes) == attr);
> > +    attrs.attributes = attr;
> > +    attrs.offset = offset;
> > +    attrs.size = size;
> > +    attrs.flags = 0;
> > +
> > +    /*
> > +     * guest_memfd may need to delay conversion requests due to
> > +     * the memory being in-use by the kernel. In most cases these
> > +     * will be transient uses. In some cases, userspace itself may
> > +     * be the cause of the memory being considered in-use, though
> > +     * QEMU currently takes steps to avoid this (e.g. via
> > +     * RamBlockAttributes). On that basis, this code loops
> > +     * indefinitely with the assumption that only transient cases
> > +     * will block, and that those will be for relatively short
> > +     * periods vs. the overall conversion path.
> > +     * If those assumptions at some point prove false, most likely
> > +     * this will manifest as guest-side lockups on their conversion
> > +     * path, which seems like the appropriate way to surface this
> > +     * situation to the guest owner rather than some hard timeout.
> > +     */
> > +    do {
> > +        r = kvm_gmem_ioctl(guest_memfd, KVM_SET_MEMORY_ATTRIBUTES2, &attrs);
> > +    } while (r == -EAGAIN);
> > +
> > +    if (r) {
> > +        error_report("failed to set memory (0x%" HWADDR_PRIx "+0x%" PRIx64 ") "
> > +                     "with attr 0x%" PRIx64 " error '%s'",
> > +                     offset, size, attr, strerror(-r));
> > +    }
> > +    return r;
> > +}
> > +
> > +static int guest_memfd_set_memory_section_attributes(MemoryRegionSection *section, uint64_t attr)
> > +{
> > +    hwaddr convert_offset, convert_size;
> > +    MemoryRegion *mr = section->mr;
> > +    RAMBlock *rb;
> > +
> > +    assert(mr);
> > +    rb = mr->ram_block;
> > +    assert(rb->guest_memfd);
> > +    convert_offset = section->offset_within_region;
> > +    convert_size = int128_get64(section->size);
> > +
> > +    return guest_memfd_set_memory_attributes_fd(rb->guest_memfd,
> > +                                                convert_offset,
> > +                                                convert_size,
> > +                                                attr);
> >   }
> >   /* Called with KVMMemoryListener.slots_lock held */
> > @@ -3447,10 +3511,18 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
> >       hwaddr size = int128_get64(section->size);
> >       int ret;
> > -    if (to_private) {
> > -        ret = kvm_set_memory_attributes_private(start, size);
> > +    if (current_machine->cgs && current_machine->cgs->convert_in_place) {
> > +        ret = guest_memfd_set_memory_section_attributes(section,
> > +                                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
> > +                                                                   : 0);
> >       } else {
> > -        ret = kvm_set_memory_attributes_shared(start, size);
> > +        /*
> > +         * Without in-place conversion, attribute-tracking is handled by KVM
> > +         * across all guest memory rather than on a per-section/slot basis.
> > +         */
> > +        ret = kvm_set_memory_attributes(start, size,
> > +                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
> > +                                                   : 0);
> >       }
> >       return ret;
> > @@ -3544,7 +3616,8 @@ static int kvm_post_convert_section(MemoryRegionSection *section, bool to_privat
> >       return 0;
> >   }
> > -int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> > +static int kvm_convert_memory_full(hwaddr start, hwaddr size, bool to_private,
> > +                                   bool pre_hooks, bool post_hooks)
> >   {
> >       int ret = -EINVAL;
> > @@ -3588,10 +3661,12 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >               continue;
> >           }
> > -        ret = kvm_pre_convert_section(&section, to_private);
> > -        if (ret) {
> > -            memory_region_unref(section.mr);
> > -            break;
> > +        if (pre_hooks) {
> > +            ret = kvm_pre_convert_section(&section, to_private);
> > +            if (ret) {
> > +                memory_region_unref(section.mr);
> > +                break;
> > +            }
> >           }
> >           ret = kvm_convert_section(&section, to_private);
> > @@ -3600,13 +3675,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >               break;
> >           }
> > -        ret = kvm_post_convert_section(&section, to_private);
> > -        memory_region_unref(section.mr);
> > -
> > -        if (ret) {
> > -            break;
> > +        if (post_hooks) {
> > +            ret = kvm_post_convert_section(&section, to_private);
> > +            if (ret) {
> > +                memory_region_unref(section.mr);
> > +                break;
> > +            }
> >           }
> > +        memory_region_unref(section.mr);
> >           size -= section_end - start;
> >           start = section_end;
> >       }
> > @@ -3614,6 +3691,26 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >       return ret;
> >   }
> > +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> > +{
> > +    return kvm_convert_memory_full(start, size, to_private, true, true);
> > +}
> > +
> > +static int kvm_convert_memory_attributes(hwaddr start, hwaddr size, bool to_private)
> > +{
> > +    return kvm_convert_memory_full(start, size, to_private, false, false);
> > +}
> > +
> > +int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
> > +{
> > +    return kvm_convert_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> > +}
> > +
> > +int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
> > +{
> > +    return kvm_convert_memory_attributes(start, size, 0);
> > +}
> > +
> >   int kvm_cpu_exec(CPUState *cpu)
> >   {
> >       struct kvm_run *run = cpu->kvm_run;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls
@ 2026-06-04 23:36       ` Michael Roth via qemu development
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Roth via qemu development @ 2026-06-04 23:36 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: qemu-devel, kvm, pbonzini, berrange, armbru, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

On Thu, Jun 04, 2026 at 03:19:17PM +0200, Gupta, Pankaj wrote:
> 
> > When using guest_memfd with support for shared memory / in-place
> > conversion, it is necessary to use the guest_memfd ioctls to handle
> > conversions instead of KVM ioctls. Implement support for this by looping
> > through all the sections within a converison range. Implement everything
> > in terms of the kvm_convert_memory() loop, which already deals with some
> > special considerations regarding various holes / region types that might
> > be encountered.
> > 
> > Also update kvm_set_memory_attributes_*() to use the same common path
> > when convert-in-place=false. This potentially results in a small change
> > in behavior due to the additional MMIO checks/skips now being applied in
> > that case (generally qemu-triggered during setup) rather than only for
> > kvm_convert_memory() (generally guest-triggered), but this is arguably
> > safer, and it provides similar behavior between convert-in-place=false
> > vs. convert-in-place=true, the latter of which *must* skip MMIO holes
> > because the regions (and associated guest_memfds) themselves track
> > shared/private state internally and passing the whole conversion range
> > through to KVM is not an option in that case.
> > 
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > ---
> >   accel/kvm/kvm-all.c | 131 ++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 114 insertions(+), 17 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 62f2e8aa15..fd01435a0f 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -1626,14 +1626,78 @@ static int kvm_set_memory_attributes(hwaddr start, uint64_t size, uint64_t attr)
> >       return r;
> >   }
> > -int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
> > +static int kvm_gmem_ioctl(int guest_memfd, unsigned long type, ...)
> >   {
> > -    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> > +    int ret;
> > +    void *arg;
> > +    va_list ap;
> > +
> > +    va_start(ap, type);
> > +    arg = va_arg(ap, void *);
> > +    va_end(ap);
> > +
> > +    ret = ioctl(guest_memfd, type, arg);
> > +    if (ret == -1) {
> > +        ret = -errno;
> > +    }
> > +    return ret;
> >   }
> > -int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
> > +static int guest_memfd_set_memory_attributes_fd(int guest_memfd, hwaddr offset,
> > +                                                uint64_t size, uint64_t attr)
> >   {
> > -    return kvm_set_memory_attributes(start, size, 0);
> > +    struct kvm_memory_attributes2 attrs;
> 
> -    struct kvm_memory_attributes2 attrs;
> +    struct kvm_memory_attributes2 attrs = {0};
> 
> Zero initializing 'attrs' fixed a '-EINVAL' error, caused because of kernel
> 'attrs.reserved' check failed in 'kvm_gmem_set_attributes()'.

Indeed, thanks for the catch!

-Mike

> 
> Thanks,
> 
> Pankaj
> 
> > +    int r;
> > +
> > +    assert((attr & kvm_supported_memory_attributes) == attr);
> > +    attrs.attributes = attr;
> > +    attrs.offset = offset;
> > +    attrs.size = size;
> > +    attrs.flags = 0;
> > +
> > +    /*
> > +     * guest_memfd may need to delay conversion requests due to
> > +     * the memory being in-use by the kernel. In most cases these
> > +     * will be transient uses. In some cases, userspace itself may
> > +     * be the cause of the memory being considered in-use, though
> > +     * QEMU currently takes steps to avoid this (e.g. via
> > +     * RamBlockAttributes). On that basis, this code loops
> > +     * indefinitely with the assumption that only transient cases
> > +     * will block, and that those will be for relatively short
> > +     * periods vs. the overall conversion path.
> > +     * If those assumptions at some point prove false, most likely
> > +     * this will manifest as guest-side lockups on their conversion
> > +     * path, which seems like the appropriate way to surface this
> > +     * situation to the guest owner rather than some hard timeout.
> > +     */
> > +    do {
> > +        r = kvm_gmem_ioctl(guest_memfd, KVM_SET_MEMORY_ATTRIBUTES2, &attrs);
> > +    } while (r == -EAGAIN);
> > +
> > +    if (r) {
> > +        error_report("failed to set memory (0x%" HWADDR_PRIx "+0x%" PRIx64 ") "
> > +                     "with attr 0x%" PRIx64 " error '%s'",
> > +                     offset, size, attr, strerror(-r));
> > +    }
> > +    return r;
> > +}
> > +
> > +static int guest_memfd_set_memory_section_attributes(MemoryRegionSection *section, uint64_t attr)
> > +{
> > +    hwaddr convert_offset, convert_size;
> > +    MemoryRegion *mr = section->mr;
> > +    RAMBlock *rb;
> > +
> > +    assert(mr);
> > +    rb = mr->ram_block;
> > +    assert(rb->guest_memfd);
> > +    convert_offset = section->offset_within_region;
> > +    convert_size = int128_get64(section->size);
> > +
> > +    return guest_memfd_set_memory_attributes_fd(rb->guest_memfd,
> > +                                                convert_offset,
> > +                                                convert_size,
> > +                                                attr);
> >   }
> >   /* Called with KVMMemoryListener.slots_lock held */
> > @@ -3447,10 +3511,18 @@ static int kvm_convert_section(MemoryRegionSection *section, bool to_private)
> >       hwaddr size = int128_get64(section->size);
> >       int ret;
> > -    if (to_private) {
> > -        ret = kvm_set_memory_attributes_private(start, size);
> > +    if (current_machine->cgs && current_machine->cgs->convert_in_place) {
> > +        ret = guest_memfd_set_memory_section_attributes(section,
> > +                                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
> > +                                                                   : 0);
> >       } else {
> > -        ret = kvm_set_memory_attributes_shared(start, size);
> > +        /*
> > +         * Without in-place conversion, attribute-tracking is handled by KVM
> > +         * across all guest memory rather than on a per-section/slot basis.
> > +         */
> > +        ret = kvm_set_memory_attributes(start, size,
> > +                                        to_private ? KVM_MEMORY_ATTRIBUTE_PRIVATE
> > +                                                   : 0);
> >       }
> >       return ret;
> > @@ -3544,7 +3616,8 @@ static int kvm_post_convert_section(MemoryRegionSection *section, bool to_privat
> >       return 0;
> >   }
> > -int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> > +static int kvm_convert_memory_full(hwaddr start, hwaddr size, bool to_private,
> > +                                   bool pre_hooks, bool post_hooks)
> >   {
> >       int ret = -EINVAL;
> > @@ -3588,10 +3661,12 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >               continue;
> >           }
> > -        ret = kvm_pre_convert_section(&section, to_private);
> > -        if (ret) {
> > -            memory_region_unref(section.mr);
> > -            break;
> > +        if (pre_hooks) {
> > +            ret = kvm_pre_convert_section(&section, to_private);
> > +            if (ret) {
> > +                memory_region_unref(section.mr);
> > +                break;
> > +            }
> >           }
> >           ret = kvm_convert_section(&section, to_private);
> > @@ -3600,13 +3675,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >               break;
> >           }
> > -        ret = kvm_post_convert_section(&section, to_private);
> > -        memory_region_unref(section.mr);
> > -
> > -        if (ret) {
> > -            break;
> > +        if (post_hooks) {
> > +            ret = kvm_post_convert_section(&section, to_private);
> > +            if (ret) {
> > +                memory_region_unref(section.mr);
> > +                break;
> > +            }
> >           }
> > +        memory_region_unref(section.mr);
> >           size -= section_end - start;
> >           start = section_end;
> >       }
> > @@ -3614,6 +3691,26 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >       return ret;
> >   }
> > +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> > +{
> > +    return kvm_convert_memory_full(start, size, to_private, true, true);
> > +}
> > +
> > +static int kvm_convert_memory_attributes(hwaddr start, hwaddr size, bool to_private)
> > +{
> > +    return kvm_convert_memory_full(start, size, to_private, false, false);
> > +}
> > +
> > +int kvm_set_memory_attributes_private(hwaddr start, uint64_t size)
> > +{
> > +    return kvm_convert_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> > +}
> > +
> > +int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size)
> > +{
> > +    return kvm_convert_memory_attributes(start, size, 0);
> > +}
> > +
> >   int kvm_cpu_exec(CPUState *cpu)
> >   {
> >       struct kvm_run *run = cpu->kvm_run;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 10/12] accel/kvm: Don't default to private attributes for in-place conversion
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (8 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 11/12] i386/sev: Update SNP_LAUNCH_UPDATE " Michael Roth
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

Without in-place conversion, QEMU can still access shared memory to load
initial state into guest memory prior to launch even if the GPA's memory
attributes default to private, since userspace is accessing a completely
separate pool of memory. With in-place conversion, all these accesses
would need to first be converted to shared, then back to private, since
the memory all comes from guest_memfd and only shared memory can be
accessed by userspace.

To avoid sprinkling these differences in behavior throughout QEMU when
in-place conversion is enabled, just default to shared. This does not
compromise guest security, since Confidential VMs will necessarily
enforce this via trusted entities, and simply generate implicit page
state changes if their default expectations don't match KVM's. However,
in most cases a guest will explicitly convert memory to a particular
state before actually using it, so even these implicit conversion
requests should be rare.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 accel/kvm/kvm-all.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index fd01435a0f..c3d399517d 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1808,7 +1808,26 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             abort();
         }
 
-        if (memory_region_has_guest_memfd(mr)) {
+        /*
+         * Without in-place conversion, QEMU can still access shared memory
+         * to load initial state into guest memory prior to launch even if
+         * the GPA's memory attributes default to private, since userspace
+         * is accessing a completely separate pool of memory. With in-place
+         * conversion, all these accesses would need to first be converted
+         * to shared, then back to private, since the memory all comes from
+         * guest_memfd and only shared memory can be accessed by userspace.
+         *
+         * To avoid sprinkling these differences in behavior throughout QEMU
+         * when in-place conversion is enabled, just default to shared. This
+         * does not compromise guest security, since Confidential VMs will
+         * necessarily enforce this via trusted entities, and simply generate
+         * implicit page state changes if their default expectations don't
+         * match KVM's. However, in most cases a guest will explicitly
+         * convert memory to a particular state before actually using it, so
+         * even these implicit conversion requests should be rare.
+         */
+        if (memory_region_has_guest_memfd(mr) &&
+            !(current_machine->cgs && current_machine->cgs->convert_in_place)) {
             err = kvm_set_memory_attributes_private(start_addr, slot_size);
             if (err) {
                 error_report("%s: failed to set memory attribute private: %s",
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 11/12] i386/sev: Update SNP_LAUNCH_UPDATE for in-place conversion
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (9 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 10/12] accel/kvm: Don't default to private attributes for in-place conversion Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  0:03 ` [PATCH RFC 12/12] i386/sev: Allow in-place conversion for SEV-SNP guests Michael Roth
  2026-05-28  5:44 ` [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Xiaoyao Li
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

For in-place conversion, the source pointer is expected to be NULL since
the data has already been written directly to guest memory and doesn't
need to be copied in prior to encrypting it in-place for initial guest
memory payload.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 target/i386/sev.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index b44b5a1c2b..32a5e605bf 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -1186,6 +1186,8 @@ sev_snp_launch_update(SevSnpGuestState *sev_snp_guest,
     int ret, fw_error;
     SnpCpuidInfo snp_cpuid_info;
     struct kvm_sev_snp_launch_update update = {0};
+    ConfidentialGuestSupport *cgs =
+        CONFIDENTIAL_GUEST_SUPPORT(OBJECT(sev_snp_guest));
 
     if (!data->hva || !data->len) {
         error_report("SNP_LAUNCH_UPDATE called with invalid address"
@@ -1199,7 +1201,14 @@ sev_snp_launch_update(SevSnpGuestState *sev_snp_guest,
         memcpy(&snp_cpuid_info, data->hva, sizeof(snp_cpuid_info));
     }
 
-    update.uaddr = (__u64)(unsigned long)data->hva;
+    /*
+     * For in-place conversion, the source pointer is expected to be NULL
+     * since the data has already been written directly to guest memory
+     * and only needs to be encrypted in-place for secure access.
+     */
+    if (!cgs->convert_in_place) {
+        update.uaddr = (__u64)(unsigned long)data->hva;
+    }
     update.gfn_start = data->gpa >> TARGET_PAGE_BITS;
     update.len = data->len;
     update.type = data->type;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 12/12] i386/sev: Allow in-place conversion for SEV-SNP guests
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (10 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 11/12] i386/sev: Update SNP_LAUNCH_UPDATE " Michael Roth
@ 2026-05-28  0:03 ` Michael Roth
  2026-05-28  5:44 ` [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Xiaoyao Li
  12 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-05-28  0:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	xiaoyao.li, chao.p.peng, david, ashish.kalra, ackerleytng

All the necessary changes are now in place for an SNP guest to be able
to leverage in-place conversion support. Allow it to be switched on by
users. KVM-specific checks will still gate whether or not the option is
ultimately allowed, this just allows the option to be set via
command-line.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 target/i386/sev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 32a5e605bf..a56367aa5e 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -3198,6 +3198,7 @@ sev_snp_guest_instance_init(Object *obj)
     SevSnpGuestState *sev_snp_guest = SEV_SNP_GUEST(obj);
 
     cgs->require_guest_memfd = true;
+    cgs->allow_convert_in_place = true;
 
     /* default init/start/finish params for kvm */
     sev_snp_guest->kvm_start_conf.policy = DEFAULT_SEV_SNP_POLICY;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] guest_memfd: support in-place memory conversion
  2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
                   ` (11 preceding siblings ...)
  2026-05-28  0:03 ` [PATCH RFC 12/12] i386/sev: Allow in-place conversion for SEV-SNP guests Michael Roth
@ 2026-05-28  5:44 ` Xiaoyao Li
  2026-06-02 22:20   ` Michael Roth
  12 siblings, 1 reply; 26+ messages in thread
From: Xiaoyao Li @ 2026-05-28  5:44 UTC (permalink / raw)
  To: Michael Roth, qemu-devel, Peter Xu
  Cc: kvm, pbonzini, berrange, armbru, pankaj.gupta, isaku.yamahata,
	chao.p.peng, david, ashish.kalra, ackerleytng

On 5/28/2026 8:03 AM, Michael Roth wrote:
> This patchset is also available at:
> 
>    https://github.com/amdese/qemu/commits/snp-inplace-rfc1
> 
> which is in turn based on the following series:
> 
>    [PATCH 0/4] "guest_memfd: Fix handling for conversions of MMIO ranges"
>    https://lists.gnu.org/archive/html/qemu-devel/2026-05/msg07547.html
> 
> 
> OVERVIEW
> --------
> 
> This series adds guest_memfd support for in-place conversion of memory
> between private/shared, and enables it for SEV-SNP guests. It is based
> on recently-added kernel support for mmap()-able guest_memfd
> instances[1], which allow it to be used for shared memory, and the
> following patchset[2], which adds additional guest_memfd interfaces to
> allow it to be used to perform in-place conversion:
> 
>    "[PATCH v7 00/42] guest_memfd: In-place conversion support"
>    https://lore.kernel.org/kvm/20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com/
> 
> That series also introduces a new 'vm_memory_attributes' KVM
> module option, which sets whether memory attributes are tracked
> VM-wide by KVM (vm_memory_attributes=1: the existing 'legacy' mode),
> or per-guest_memfd instance (vm_memory_attributes=0: the new mode
> which allows for in-place conversion). The latter is intended to
> eventually deprecate the legacy mode, at which point in-place
> conversion would become the primarily-supported mode.
> 
> 
> MOTIVATION
> ----------
> 
> Today, SEV-SNP guests (and other CoCo VM types using guest_memfd) keep
> shared and private memory on separate physical backings: a userspace
> memory-backend object for shared pages, and a kernel-allocated
> guest_memfd file descriptor for private pages. KVM_SET_MEMORY_ATTRIBUTES
> flips which backing the guest sees for a given GPA range, and the old
> backing is typically discarded / hole-punched on conversion to avoid
> doubled memory usage.
> 
> That model works, but has a number of downsides that impact certain
> use-cases:
> 
>    - Each conversion involves discarding pages on one side and faulting
>      them in on the other, which incurs allocation overheads in the
>      host kernel for every conversion.
> 
>    - Some use-cases, like pKVM[3], rely on memory isolation rather than
>      encryption and rely on in-place conversion to pass through things
>      like secured framebuffer memory without needing to bounce data
>      through separate shared/private HPAs, which would introduce
>      unacceptable latency for that sort of workload.
> 
>    - Hugetlb support[4] for guest_memfd will rely on it, since things like
>      1GB hugepages with a mix of shared/private sub-ranges would generally
>      require 2 1GB hugetlb pages to remain available to handle shared vs.
>      private accesses, which quickly causes doubling of guest memory usage.
> 
> Recent kernel work[2] makes guest_memfd mmap()-able and lets the *same*
> physical pages be used for both shared and private states for a given
> GPA range, allowing the above pitfalls to be naturally avoided.
> 
> This series wires that support up in QEMU.

+ Peter,

Peter had the series[*] to enable the mmap() of guest memfd and allow it 
serve as unencrypted memory for VMs. I believe there are some overlapped 
efforts.

[*] 
https://lore.kernel.org/qemu-devel/20251215205203.1185099-1-peterx@redhat.com/

> 
> DESIGN
> ------
> 
> A new dedicated memory backend, memory-backend-guest-memfd, allocates
> its memory via a guest_memfd file descriptor obtained from KVM with
> the GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED flags. 

A quick feedback:

The design choice from Peter's series was to extend the current 
hostmem-memfd backend to support guest-memfd instead of a new dedicated 
backend.
I think we need to evaluate the pros and cons of each other, and make a 
choice.

(I will go read the other part later and provide more feedback)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] guest_memfd: support in-place memory conversion
  2026-05-28  5:44 ` [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Xiaoyao Li
@ 2026-06-02 22:20   ` Michael Roth
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Roth @ 2026-06-02 22:20 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: qemu-devel, Peter Xu, kvm, pbonzini, berrange, armbru,
	pankaj.gupta, isaku.yamahata, chao.p.peng, david, ashish.kalra,
	ackerleytng

On Thu, May 28, 2026 at 01:44:39PM +0800, Xiaoyao Li wrote:
> On 5/28/2026 8:03 AM, Michael Roth wrote:
> > This patchset is also available at:
> > 
> >    https://github.com/amdese/qemu/commits/snp-inplace-rfc1
> > 
> > which is in turn based on the following series:
> > 
> >    [PATCH 0/4] "guest_memfd: Fix handling for conversions of MMIO ranges"
> >    https://lists.gnu.org/archive/html/qemu-devel/2026-05/msg07547.html
> > 
> > 
> > OVERVIEW
> > --------
> > 
> > This series adds guest_memfd support for in-place conversion of memory
> > between private/shared, and enables it for SEV-SNP guests. It is based
> > on recently-added kernel support for mmap()-able guest_memfd
> > instances[1], which allow it to be used for shared memory, and the
> > following patchset[2], which adds additional guest_memfd interfaces to
> > allow it to be used to perform in-place conversion:
> > 
> >    "[PATCH v7 00/42] guest_memfd: In-place conversion support"
> >    https://lore.kernel.org/kvm/20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com/
> > 
> > That series also introduces a new 'vm_memory_attributes' KVM
> > module option, which sets whether memory attributes are tracked
> > VM-wide by KVM (vm_memory_attributes=1: the existing 'legacy' mode),
> > or per-guest_memfd instance (vm_memory_attributes=0: the new mode
> > which allows for in-place conversion). The latter is intended to
> > eventually deprecate the legacy mode, at which point in-place
> > conversion would become the primarily-supported mode.
> > 
> > 
> > MOTIVATION
> > ----------
> > 
> > Today, SEV-SNP guests (and other CoCo VM types using guest_memfd) keep
> > shared and private memory on separate physical backings: a userspace
> > memory-backend object for shared pages, and a kernel-allocated
> > guest_memfd file descriptor for private pages. KVM_SET_MEMORY_ATTRIBUTES
> > flips which backing the guest sees for a given GPA range, and the old
> > backing is typically discarded / hole-punched on conversion to avoid
> > doubled memory usage.
> > 
> > That model works, but has a number of downsides that impact certain
> > use-cases:
> > 
> >    - Each conversion involves discarding pages on one side and faulting
> >      them in on the other, which incurs allocation overheads in the
> >      host kernel for every conversion.
> > 
> >    - Some use-cases, like pKVM[3], rely on memory isolation rather than
> >      encryption and rely on in-place conversion to pass through things
> >      like secured framebuffer memory without needing to bounce data
> >      through separate shared/private HPAs, which would introduce
> >      unacceptable latency for that sort of workload.
> > 
> >    - Hugetlb support[4] for guest_memfd will rely on it, since things like
> >      1GB hugepages with a mix of shared/private sub-ranges would generally
> >      require 2 1GB hugetlb pages to remain available to handle shared vs.
> >      private accesses, which quickly causes doubling of guest memory usage.
> > 
> > Recent kernel work[2] makes guest_memfd mmap()-able and lets the *same*
> > physical pages be used for both shared and private states for a given
> > GPA range, allowing the above pitfalls to be naturally avoided.
> > 
> > This series wires that support up in QEMU.
> 
> + Peter,
> 
> Peter had the series[*] to enable the mmap() of guest memfd and allow it
> serve as unencrypted memory for VMs. I believe there are some overlapped
> efforts.
> 
> [*] https://lore.kernel.org/qemu-devel/20251215205203.1185099-1-peterx@redhat.com/

Thanks, I wasn't aware of that series but it definitely seems like a
good idea to take that for base mmapable guest_memfd support for normal
VMs and then rebase my inplace-conversion / confidential VM patches on
top.

I do think it would be a good idea to introduce a dedicated backend
however. I brought up the discussion in that thread, but I think that
mostly only calls patch #2 from this series into question and most of
the other patches still seem like they'll be needed for confidential
VMs.

Thanks for pointing this out.

-Mike

> 
> > 
> > DESIGN
> > ------
> > 
> > A new dedicated memory backend, memory-backend-guest-memfd, allocates
> > its memory via a guest_memfd file descriptor obtained from KVM with
> > the GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED flags.
> 
> A quick feedback:
> 
> The design choice from Peter's series was to extend the current
> hostmem-memfd backend to support guest-memfd instead of a new dedicated
> backend.
> I think we need to evaluate the pros and cons of each other, and make a
> choice.
> 
> (I will go read the other part later and provide more feedback)

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-06-08 20:43 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28  0:03 [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Michael Roth
2026-05-28  0:03 ` [PATCH RFC 01/12] accel/kvm: Decouple guest_memfd checks from memory attribute checks Michael Roth
2026-05-28  0:03 ` [PATCH RFC 02/12] hostmem: Introduce dedicated memory backend for guest_memfd Michael Roth
2026-06-02  8:22   ` Markus Armbruster
2026-06-03  6:19     ` Michael Roth
2026-06-08  8:20       ` Markus Armbruster
2026-06-08 20:42         ` Michael Roth
2026-05-28  0:03 ` [PATCH RFC 03/12] linux-headers: Update headers for v7 of in-place conversion kernel support Michael Roth
2026-05-28  0:03 ` [PATCH RFC 04/12] accel/kvm: Add CGS option to control in-place conversion support Michael Roth
2026-06-02  8:23   ` Markus Armbruster
2026-06-03  6:39     ` Michael Roth
2026-06-08  8:15       ` Markus Armbruster
2026-06-08 20:21         ` Michael Roth
2026-05-28  0:03 ` [PATCH RFC 05/12] system/memory: Re-use memory-backend-guest-memfd inode for private memory Michael Roth
2026-05-28  0:03 ` [PATCH RFC 06/12] system/memory: Default to guest_memfd for RAM for in-place conversion Michael Roth
2026-05-28  0:03 ` [PATCH RFC 07/12] accel/kvm: Move post-conversion updates to a separate helper Michael Roth
2026-05-28  0:03 ` [PATCH RFC 08/12] accel/kvm: Re-order attribute notifications for in-place conversion Michael Roth
2026-05-28  0:03 ` [PATCH RFC 09/12] accel/kvm: Support shared/private conversions via guest_memfd ioctls Michael Roth
2026-06-04 13:19   ` Gupta, Pankaj
2026-06-04 23:36     ` Michael Roth
2026-06-04 23:36       ` Michael Roth via qemu development
2026-05-28  0:03 ` [PATCH RFC 10/12] accel/kvm: Don't default to private attributes for in-place conversion Michael Roth
2026-05-28  0:03 ` [PATCH RFC 11/12] i386/sev: Update SNP_LAUNCH_UPDATE " Michael Roth
2026-05-28  0:03 ` [PATCH RFC 12/12] i386/sev: Allow in-place conversion for SEV-SNP guests Michael Roth
2026-05-28  5:44 ` [PATCH RFC 00/12] guest_memfd: support in-place memory conversion Xiaoyao Li
2026-06-02 22:20   ` Michael Roth

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.