public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 00/43] guest_memfd: In-place conversion support
@ 2026-03-13  6:12 Ackerley Tng
  2026-03-13  6:12 ` [PATCH RFC v3 01/43] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings Ackerley Tng
                   ` (43 more replies)
  0 siblings, 44 replies; 45+ messages in thread
From: Ackerley Tng @ 2026-03-13  6:12 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	Ackerley Tng

Hi,

(Here's the motivation for this series, which I realized was missing from
the earlier revisions of this series)

Up till now, guest_memfd supports the entire inode worth of memory being
used as all-shared, or all-private. CoCo VMs may request guest memory to be
converted between private and shared states, and the only way to support
that currently would be to have the userspace VMM provide two sources of
backing memory from completely different areas of physical memory.

pKVM has a use case for in-place sharing: the guest and host may be
cooperating on given data, and pKVM doesn't protect data through
encryption, so copying that given data between different areas of physical
memory as part of conversions would be unnecessary work.

This series also serves as a foundation for guest_memfd huge page
support. Now, guest_memfd only supports PAGE_SIZE pages, so if two sources
of backing memory are used, the userspace VMM could maintain a steady total
memory utilized by punching out the pages that are not used. When huge
pages are available in guest_memfd, even if the backing memory source
supports hole punching within a huge page, punching out pages to maintain
the total memory utilized by a VM would be introducing lots of
fragmentation.

In-place conversion avoids fragmentation by allowing the same physical
memory to be used for both shared and private memory, with guest_memfd
tracks the shared/private status of all the pages at a per-page
granularity.

The central principle, which guest_memfd continues to uphold, is that any
guest-private page will not be mappable to host userspace. All pages will
be mmap()-able in host userspace, but accesses to guest-private pages (as
tracked by guest_memfd) will result in a SIGBUS.

This series introduces a guest_memfd ioctl (not kvm, vm or vcpu, but
guest_memfd ioctl) that allows userspace to set memory
attributes (shared/private) directly through the guest_memfd. This is the
appropriate interface because shared/private-ness is a property of memory
and hence the request should be sent directly to the memory provider -
guest_memfd.

I'm intending RFC (v3) as a basis for discussion of flags/content
modes (name TBD) to allow userspace to request guarantees on how the memory
contents will look like after setting memory attributes. The last 6 patches
implement content mode support. These patches will be reordered, and some
of them could be absorbed into earlier patches, in later revisions.

Here are the discussion points I can think of (please add on):

1. (Might hopefully resolve soon?) Should ZERO be supported on shared to
   private conversions? Discussion is at [6].

2. Do we need a CAP for userspace to query the flags/modes supported?

   It seems like there won't be anything dynamic about the flags/modes
   supported.

   The userspace code can check what platform it is running on, and then
   decide ZERO or PRESERVE based on the platform:

   If the VM is running on TDX, it would want to specify ZERO all the
   time. If the VM were running on pKVM it would want to specify PRESERVE
   if it wants to enable in-place sharing, and ZERO if it wants to zero the
   memory.

   If someday TDX supports PRESERVE, then there's room for discovery of
   which algorithm to choose when running the guest. Perhaps that's when
   the CAP should be introduced?

3. What do people think of the structure of how various content modes are
   checked for support or applied? I used overridable weak functions for
   architectures that haven't defined support, and defined overrides for
   x86 to show how I think it would work. For CoCo platforms, I only
   implemented TDX for illustration purposes and might need help with the
   other platforms. Should I have used kvm_x86_ops? I tried and found
   myself defining lots of boilerplate.

4. enum for ZERO and PRESERVE?

   Pros:

   * No way to define both ZERO and PRESERVE (make impossible states
     unrepresentable)
       * e.g. enum kvm_device_type in __u32 type in struct
         kvm_create_device
       * But maybe someday some modes can be used together?
   * Content modes is a defined axis/aspect of setting memory attributes,
     having a separate field avoids having different axes of configuration
     in one field. e.g. MAP_HUGETLB for mmap() is on a different axis from
     MAP_PRIVATE for example
       * I just used flags for this RFC since it's the most common
         approach.

TODOs:

+ Let architectures override content mode handlers on a per-inode basis
  since per-folio overrides means even no-ops, like ZERO on TDX, would
  require iterating all the folios.

Also, in RFC v3, TEST_EXPECT_SIGBUS() is updated to assert that the default
signal handler is installed, so that developers get a clear, explicit
failure if/when something goes wrong.

This series is based on kvm/next, and here's the tree for your convenience:

https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v3

Older series:

+ RFCv2 is at [5]
+ RFCv1 is at [4]
+ Previous versions of this feature, part of other series, are available at
  [1][2][3].

[1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
[2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
[3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
[4] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
[5] https://lore.kernel.org/all/cover.1770071243.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/all/CAEvNRgFUc+9xCoN9Yo5NThHrvbccWAhPwp9nNM2fvx7QqrcJsg@mail.gmail.com/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
Ackerley Tng (25):
      KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
      KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
      KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
      KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
      KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2
      KVM: selftests: Test using guest_memfd for guest private memory
      KVM: selftests: Test basic single-page conversion flow
      KVM: selftests: Test conversion flow when INIT_SHARED
      KVM: selftests: Test indexing in guest_memfd
      KVM: selftests: Test conversion before allocation
      KVM: selftests: Convert with allocated folios in different layouts
      KVM: selftests: Test precision of conversion
      KVM: selftests: Test that truncation does not change shared/private status
      KVM: selftests: Test conversion with elevated page refcount
      KVM: selftests: Reset shared memory after hole-punching
      KVM: selftests: Provide function to look up guest_memfd details from gpa
      KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
      KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
      KVM: selftests: Add script to exercise private_mem_conversions_test
      KVM: guest_memfd: Introduce default handlers for content modes
      KVM: guest_memfd: Apply content modes while setting memory attributes
      KVM: x86: Add support for applying content modes
      KVM: x86: Support content mode ZERO for TDX
      KVM: selftests: Allow flags to be specified in set_memory_attributes functions
      KVM: selftests: Update tests to use flag-enabled library functions

Sean Christopherson (18):
      KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
      KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
      KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
      KVM: Stub in ability to disable per-VM memory attribute tracking
      KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
      KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
      KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
      KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes
      KVM: selftests: Create gmem fd before "regular" fd when adding memslot
      KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
      KVM: selftests: Add support for mmap() on guest_memfd in core library
      KVM: selftests: Add selftests global for guest memory attributes capability
      KVM: selftests: Add helpers for calling ioctls on guest_memfd
      KVM: selftests: Test that shared/private status is consistent across processes
      KVM: selftests: Provide common function to set memory attributes
      KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
      KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes
      KVM: selftests: Update private memory exits test work with per-gmem attributes

 Documentation/virt/kvm/api.rst                     | 112 ++++-
 arch/x86/include/asm/kvm_host.h                    |   2 +-
 arch/x86/kvm/Kconfig                               |  15 +-
 arch/x86/kvm/mmu/mmu.c                             |   4 +-
 arch/x86/kvm/x86.c                                 |  86 +++-
 include/linux/kvm_host.h                           |  62 ++-
 include/trace/events/kvm.h                         |   4 +-
 include/uapi/linux/kvm.h                           |  21 +
 tools/testing/selftests/kvm/.gitignore             |   1 +
 tools/testing/selftests/kvm/Makefile.kvm           |   1 +
 .../selftests/kvm/guest_memfd_conversions_test.c   | 496 +++++++++++++++++++++
 tools/testing/selftests/kvm/guest_memfd_test.c     |  57 ++-
 tools/testing/selftests/kvm/include/kvm_util.h     | 136 +++++-
 tools/testing/selftests/kvm/include/test_util.h    |  32 +-
 tools/testing/selftests/kvm/lib/kvm_util.c         | 130 +++---
 tools/testing/selftests/kvm/lib/test_util.c        |   7 -
 .../testing/selftests/kvm/pre_fault_memory_test.c  |   2 +-
 .../kvm/x86/private_mem_conversions_test.c         |  54 ++-
 .../kvm/x86/private_mem_conversions_test.py        | 152 +++++++
 .../selftests/kvm/x86/private_mem_kvm_exits_test.c |  36 +-
 virt/kvm/Kconfig                                   |   3 +-
 virt/kvm/guest_memfd.c                             | 488 +++++++++++++++++++-
 virt/kvm/kvm_main.c                                | 104 ++++-
 23 files changed, 1831 insertions(+), 174 deletions(-)
---
base-commit: d2ea4ff1ce50787a98a3900b3fb1636f3620b7cf
change-id: 20260225-gmem-inplace-conversion-bd0dbd39753a

Best regards,
--
Ackerley Tng <ackerleytng@google.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2026-03-13 15:45 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-13  6:12 [PATCH RFC v3 00/43] guest_memfd: In-place conversion support Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 01/43] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 02/43] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 03/43] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 04/43] KVM: Stub in ability to disable per-VM memory attribute tracking Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 06/43] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 07/43] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 08/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 09/43] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 10/43] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 11/43] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86 Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 12/43] KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 13/43] KVM: selftests: Create gmem fd before "regular" fd when adding memslot Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 14/43] KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset} Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 15/43] KVM: selftests: Add support for mmap() on guest_memfd in core library Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 16/43] KVM: selftests: Add selftests global for guest memory attributes capability Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 17/43] KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 18/43] KVM: selftests: Add helpers for calling ioctls on guest_memfd Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 19/43] KVM: selftests: Test using guest_memfd for guest private memory Ackerley Tng
2026-03-13  6:12 ` [PATCH RFC v3 20/43] KVM: selftests: Test basic single-page conversion flow Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 21/43] KVM: selftests: Test conversion flow when INIT_SHARED Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 22/43] KVM: selftests: Test indexing in guest_memfd Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 23/43] KVM: selftests: Test conversion before allocation Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 24/43] KVM: selftests: Convert with allocated folios in different layouts Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 25/43] KVM: selftests: Test precision of conversion Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 26/43] KVM: selftests: Test that truncation does not change shared/private status Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 27/43] KVM: selftests: Test that shared/private status is consistent across processes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 28/43] KVM: selftests: Test conversion with elevated page refcount Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 29/43] KVM: selftests: Reset shared memory after hole-punching Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 30/43] KVM: selftests: Provide function to look up guest_memfd details from gpa Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 31/43] KVM: selftests: Provide common function to set memory attributes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 32/43] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 33/43] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 34/43] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 35/43] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 36/43] KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 37/43] KVM: selftests: Update private memory exits test work with per-gmem attributes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 38/43] KVM: guest_memfd: Introduce default handlers for content modes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 39/43] KVM: guest_memfd: Apply content modes while setting memory attributes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 40/43] KVM: x86: Add support for applying content modes Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 41/43] KVM: x86: Support content mode ZERO for TDX Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 42/43] KVM: selftests: Allow flags to be specified in set_memory_attributes functions Ackerley Tng
2026-03-13  6:13 ` [PATCH RFC v3 43/43] KVM: selftests: Update tests to use flag-enabled library functions Ackerley Tng
2026-03-13 15:45 ` [PATCH RFC v3 00/43] guest_memfd: In-place conversion support Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox