Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 20:46                         ` Peter Xu
@ 2024-06-24 16:31                           ` Xu Yilun
  2025-01-21 15:18                             ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2024-06-24 16:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > It is still uncertain how to implement the private MMIO. Our assumption
> > > is the private MMIO would also create a memory region with
> > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > listener.
> > 
> > My current working approach is to leave it as is in QEMU and VFIO.
> 
> Agreed.  Setting ram=true to even private MMIO sounds hackish, at least

The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
normal assigned MMIO is always set ram=true,

void memory_region_init_ram_device_ptr(MemoryRegion *mr,
                                       Object *owner,
                                       const char *name,
                                       uint64_t size,
                                       void *ptr)
{
    memory_region_init(mr, owner, name, size);
    mr->ram = true;


So I don't think ram=true is a problem here.

Thanks,
Yilun

> currently QEMU heavily rely on that flag for any possible direct accesses.
> E.g., in memory_access_is_direct().
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 0/7] Enable shared device assignment
@ 2024-12-13  7:08 Chenyi Qiang
  2024-12-13  7:08 ` [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
                   ` (7 more replies)
  0 siblings, 8 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") effectively disables device assignment when using guest_memfd.
This poses a significant challenge as guest_memfd is essential for
confidential guests, thereby blocking device assignment to these VMs.
The initial rationale for disabling device assignment was due to stale
IOMMU mappings (see Problem section) and the assumption that TEE I/O
(SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
problem for confidential guests [1]. However, this assumption has proven
to be incorrect. TEE I/O relies on the ability to operate devices against
"shared" or untrusted memory, which is crucial for device initialization
and error recovery scenarios. As a result, the current implementation does
not adequately support device assignment for confidential guests, necessitating
a reevaluation of the approach to ensure compatibility and functionality.

This series enables shared device assignment by notifying VFIO of page
conversions using an existing framework named RamDiscardListener.
Additionally, there is an ongoing patch set [2] that aims to add 1G page
support for guest_memfd. This patch set introduces in-place page conversion,
where private and shared memory share the same physical pages as the backend.
This development may impact our solution.

We presented our solution in the guest_memfd meeting to discuss its
compatibility with the new changes and potential future directions (see [3]
for more details). The conclusion was that, although our solution may not be
the most elegant (see the Limitation section), it is sufficient for now and
can be easily adapted to future changes.

We are re-posting the patch series with some cleanup and have removed the RFC
label for the main enabling patches (1-6). The newly-added patch 7 is still
marked as RFC as it tries to resolve some extension concerns related to
RamDiscardManager for future usage.

The overview of the patches:
- Patch 1: Export a helper to get intersection of a MemoryRegionSection
  with a given range.
- Patch 2-6: Introduce a new object to manage the guest-memfd with
  RamDiscardManager, and notify the shared/private state change during
  conversion.
- Patch 7: Try to resolve a semantics concern related to RamDiscardManager
  i.e. RamDiscardManager is used to manage memory plug/unplug state
  instead of shared/private state. It would affect future users of
  RamDiscardManger in confidential VMs. Attach it behind as a RFC patch[4].

Changes since last version:
- Add a patch to export some generic helper functions from virtio-mem code.
- Change the bitmap in guest_memfd_manager from default shared to default
  private. This keeps alignment with virtio-mem that 1-setting in bitmap
  represents the populated state and may help to export more generic code
  if necessary.
- Add the helpers to initialize/uninitialize the guest_memfd_manager instance
  to make it more clear.
- Add a patch to distinguish between the shared/private state change and
  the memory plug/unplug state change in RamDiscardManager.
- RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-chenyi.qiang@intel.com/

---

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. The key differences between guest_memfd and normal memfd
are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
cannot be mapped, read or written by userspace.

In QEMU's implementation, shared memory is allocated with normal methods
(e.g. mmap or fallocate) while private memory is allocated from
guest_memfd. When a VM performs memory conversions, QEMU frees pages via
madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
allocates new pages from the other side.

Problem
=======
Device assignment in QEMU is implemented via VFIO system. In the normal
VM, VM memory is pinned at the beginning of time by VFIO. In the
confidential VM, the VM can convert memory and when that happens
nothing currently tells VFIO that its mappings are stale. This means
that page conversion leaks memory and leaves stale IOMMU mappings. For
example, sequence like the following can result in stale IOMMU mappings:

1. allocate shared page
2. convert page shared->private
3. discard shared page
4. convert page private->shared
5. allocate shared page
6. issue DMA operations against that shared page

After step 3, VFIO is still pinning the page. However, DMA operations in
step 6 will hit the old mapping that was allocated in step 1, which
causes the device to access the invalid data.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion.

Given the constraints and assumptions here is a solution that satisfied
the use cases. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversion is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardManager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Another possible attempt [5] was to not discard shared pages in step 3
above. This was an incomplete band-aid because guests would consume
twice the memory since shared pages wouldn't be freed even after they
were converted to private.

w/ in-place page conversion
===========================
To support 1G page support for guest_memfd, the current direction is to
allow mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages as the backend. This in-place page
conversion design eliminates the need to discard pages during shared/private
conversions. However, device assignment will still be blocked because the
in-place page conversion will reject the conversion when the page is pinned
by VFIO.

To address this, the key difference lies in the sequence of VFIO map/unmap
operations and the page conversion. This series can be adjusted to achieve
unmap-before-conversion-to-private and map-after-conversion-to-shared,
ensuring compatibility with guest_memfd.

Additionally, with in-place page conversion, the previously mentioned
solution to disable the discard of shared pages is not feasible because
shared and private memory share the same backend, and no discard operation
is performed. Retaining the old mappings in the IOMMU would result in
unsafe DMA access to protected memory.

Limitation
==========

One limitation (also discussed in the guest_memfd meeting) is that VFIO
expects the DMA mapping for a specific IOVA to be mapped and unmapped with
the same granularity. The guest may perform partial conversions, such as
converting a small region within a larger region. To prevent such invalid
cases, all operations are performed with 4K granularity. The possible
solutions we can think of are either to enable VFIO to support partial unmap
or to implement an enlightened guest to avoid partial conversion. The former
requires complex changes in VFIO, while the latter requires the page
conversion to be a guest-enlightened behavior. It is still uncertain which
option is a preferred one.

Testing
=======
This patch series is tested with the KVM/QEMU branch:
KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2024-12-13

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

No additional adjustment required.

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
[2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.jr4csfgw1uql
[4] https://lore.kernel.org/qemu-devel/d299bbad-81bc-462e-91b5-a6d9c27ffe3a@redhat.com/
[5] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/

Chenyi Qiang (7):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  guest_memfd: Introduce an object to manage the guest-memfd with
    RamDiscardManager
  guest_memfd: Introduce a callback to notify the shared/private state
    change
  KVM: Notify the state change event during shared/private conversion
  memory: Register the RamDiscardManager instance upon guest_memfd
    creation
  RAMBlock: make guest_memfd require coordinate discard
  memory: Add a new argument to indicate the request attribute in
    RamDismcardManager helpers

 accel/kvm/kvm-all.c                  |   4 +
 hw/vfio/common.c                     |  22 +-
 hw/virtio/virtio-mem.c               |  55 ++--
 include/exec/memory.h                |  36 ++-
 include/sysemu/guest-memfd-manager.h |  91 ++++++
 migration/ram.c                      |  14 +-
 system/guest-memfd-manager.c         | 456 +++++++++++++++++++++++++++
 system/memory.c                      |  30 +-
 system/memory_mapping.c              |   4 +-
 system/meson.build                   |   1 +
 system/physmem.c                     |   9 +-
 11 files changed, 659 insertions(+), 63 deletions(-)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2024-12-18 12:33   ` David Hildenbrand
  2025-01-08  4:47   ` Alexey Kardashevskiy
  2024-12-13  7:08 ` [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Rename the helper to memory_region_section_intersect_range() to make it
more generic.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 hw/virtio/virtio-mem.c | 32 +++++---------------------------
 include/exec/memory.h  | 13 +++++++++++++
 system/memory.c        | 17 +++++++++++++++++
 3 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 80ada89551..e3d1ccaeeb 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -242,28 +242,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -285,7 +263,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -317,7 +295,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -353,7 +331,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -369,7 +347,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -386,7 +364,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e5e865d1a9..ec7bc641e8 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1196,6 +1196,19 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                           uint64_t offset, uint64_t size);
+
 /**
  * memory_region_init: Initialize a memory region
  *
diff --git a/system/memory.c b/system/memory.c
index 85f6834cb3..ddcec90f5e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2898,6 +2898,23 @@ void memory_region_section_free_copy(MemoryRegionSection *s)
     g_free(s);
 }
 
+bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                           uint64_t offset, uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
+                       offset + size);
+
+    if (end <= start) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_make64(end - start);
+    return true;
+}
+
 bool memory_region_present(MemoryRegion *container, hwaddr addr)
 {
     MemoryRegion *mr;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
  2024-12-13  7:08 ` [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2024-12-18  6:45   ` Chenyi Qiang
                     ` (2 more replies)
  2024-12-13  7:08 ` [PATCH 3/7] guest_memfd: Introduce a callback to notify the shared/private state change Chenyi Qiang
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO might
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential VM via shared memory (unprotected memory
pages). Blocking shared page discard can solve this problem, but it
could cause guests to consume twice the memory with VFIO, which is not
acceptable in some cases. An alternative solution is to convey other
systems like VFIO to refresh its outdated IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other, so the similar work that needs to happen in response
to virtio-mem changes needs to happen for page conversion events.
Introduce the RamDiscardManager to guest_memfd to achieve it.

However, guest_memfd is not an object so it cannot directly implement
the RamDiscardManager interface.

One solution is to implement the interface in HostMemoryBackend. Any
guest_memfd-backed host memory backend can register itself in the target
MemoryRegion. However, this solution doesn't cover the scenario where a
guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
the virtual BIOS MemoryRegion.

Thus, choose the second option, i.e. define an object type named
guest_memfd_manager with RamDiscardManager interface. Upon creation of
guest_memfd, a new guest_memfd_manager object can be instantiated and
registered to the managed guest_memfd MemoryRegion to handle the page
conversion events.

In the context of guest_memfd, the discarded state signifies that the
page is private, while the populated state indicated that the page is
shared. The state of the memory is tracked at the granularity of the
host page size (i.e. block_size), as the minimum conversion size can be
one page per request.

In addition, VFIO expects the DMA mapping for a specific iova to be
mapped and unmapped with the same granularity. However, the confidential
VMs may do partial conversion, e.g. conversion happens on a small region
within a large region. To prevent such invalid cases and before any
potential optimization comes out, all operations are performed with 4K
granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |  46 +++++
 system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
 system/meson.build                   |   1 +
 3 files changed, 297 insertions(+)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
new file mode 100644
index 0000000000..ba4a99b614
--- /dev/null
+++ b/include/sysemu/guest-memfd-manager.h
@@ -0,0 +1,46 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
+#define SYSEMU_GUEST_MEMFD_MANAGER_H
+
+#include "sysemu/hostmem.h"
+
+#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
+
+OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
+
+struct GuestMemfdManager {
+    Object parent;
+
+    /* Managed memory region. */
+    MemoryRegion *mr;
+
+    /*
+     * 1-setting of the bit represents the memory is populated (shared).
+     */
+    int32_t bitmap_size;
+    unsigned long *bitmap;
+
+    /* block size and alignment */
+    uint64_t block_size;
+
+    /* listeners to notify on populate/discard activity. */
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+struct GuestMemfdManagerClass {
+    ObjectClass parent_class;
+};
+
+#endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
new file mode 100644
index 0000000000..d7e105fead
--- /dev/null
+++ b/system/guest-memfd-manager.c
@@ -0,0 +1,250 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "sysemu/guest-memfd-manager.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
+                                          guest_memfd_manager,
+                                          GUEST_MEMFD_MANAGER,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
+                                         const MemoryRegionSection *section)
+{
+    const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    uint64_t first_bit = section->offset_within_region / gmm->block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return first_discard_bit > last_bit;
+}
+
+typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+
+    return 0;
+}
+
+static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
+                                                  MemoryRegionSection *section,
+                                                  void *arg,
+                                                  guest_memfd_section_cb cb)
+{
+    unsigned long first_one_bit, last_one_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_one_bit = section->offset_within_region / gmm->block_size;
+    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, first_one_bit);
+
+    while (first_one_bit < gmm->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_one_bit * gmm->block_size;
+        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                          first_one_bit + 1) - 1;
+        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
+                                      last_one_bit + 2);
+    }
+
+    return ret;
+}
+
+static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
+                                                  MemoryRegionSection *section,
+                                                  void *arg,
+                                                  guest_memfd_section_cb cb)
+{
+    unsigned long first_zero_bit, last_zero_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_zero_bit = section->offset_within_region / gmm->block_size;
+    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                        first_zero_bit);
+
+    while (first_zero_bit < gmm->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_zero_bit * gmm->block_size;
+        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
+                                      first_zero_bit + 1) - 1;
+        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                            last_zero_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t guest_memfd_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                                    const MemoryRegion *mr)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+
+    g_assert(mr == gmm->mr);
+    return gmm->block_size;
+}
+
+static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
+                                              RamDiscardListener *rdl,
+                                              MemoryRegionSection *section)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(section->mr == gmm->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
+
+    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
+                                                 guest_memfd_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
+                                                RamDiscardListener *rdl)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == gmm->mr);
+
+    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
+                                                 guest_memfd_notify_discard_cb);
+    if (ret) {
+        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+
+}
+
+typedef struct GuestMemfdReplayData {
+    void *fn;
+    void *opaque;
+} GuestMemfdReplayData;
+
+static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamPopulate replay_fn = data->fn;
+
+    return replay_fn(section, data->opaque);
+}
+
+static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
+                                            MemoryRegionSection *section,
+                                            ReplayRamPopulate replay_fn,
+                                            void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    return guest_memfd_for_each_populated_section(gmm, section, &data,
+                                                  guest_memfd_rdm_replay_populated_cb);
+}
+
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamDiscard replay_fn = data->fn;
+
+    replay_fn(section, data->opaque);
+
+    return 0;
+}
+
+static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                             MemoryRegionSection *section,
+                                             ReplayRamDiscard replay_fn,
+                                             void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    guest_memfd_for_each_discarded_section(gmm, section, &data,
+                                           guest_memfd_rdm_replay_discarded_cb);
+}
+
+static void guest_memfd_manager_init(Object *obj)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
+
+    QLIST_INIT(&gmm->rdl_list);
+}
+
+static void guest_memfd_manager_finalize(Object *obj)
+{
+    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
+}
+
+static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
+    rdmc->register_listener = guest_memfd_rdm_register_listener;
+    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
+    rdmc->is_populated = guest_memfd_rdm_is_populated;
+    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
+    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index 4952f4b2c7..ed4e1137bd 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -15,6 +15,7 @@ system_ss.add(files(
   'dirtylimit.c',
   'dma-helpers.c',
   'globals.c',
+  'guest-memfd-manager.c',
   'memory_mapping.c',
   'qdev-monitor.c',
   'qtest.c',
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 3/7] guest_memfd: Introduce a callback to notify the shared/private state change
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
  2024-12-13  7:08 ` [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2024-12-13  7:08 ` [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2024-12-13  7:08 ` [PATCH 4/7] KVM: Notify the state change event during shared/private conversion Chenyi Qiang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Introduce a new state_change() callback in GuestMemfdManagerClass to
efficiently notify all registered RamDiscardListeners, including VFIO
listeners about the memory conversion events in guest_memfd. The
existing VFIO listener can dynamically DMA map/unmap the shared pages
based on conversion types:
- For conversions from shared to private, the VFIO system ensures the
  discarding of shared mapping from the IOMMU.
- For conversions from private to shared, it triggers the population of
  the shared mapping into the IOMMU.

Additionally, there could be some special conversion requests:
- When a conversion request is made for a page already in the desired
  state, the helper simply returns success.
- For requests involving a range partially in the desired state, only
  the necessary segments are converted, ensuring the entire range
  complies with the request efficiently.
- In scenarios where a conversion request is declined by other systems,
  such as a failure from VFIO during notify_populate(), the helper will
  roll back the request, maintaining consistency.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |   3 +
 system/guest-memfd-manager.c         | 144 +++++++++++++++++++++++++++
 2 files changed, 147 insertions(+)

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index ba4a99b614..f4b175529b 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -41,6 +41,9 @@ struct GuestMemfdManager {
 
 struct GuestMemfdManagerClass {
     ObjectClass parent_class;
+
+    int (*state_change)(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
+                        bool shared_to_private);
 };
 
 #endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index d7e105fead..6601df5f3f 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -225,6 +225,147 @@ static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
                                            guest_memfd_rdm_replay_discarded_cb);
 }
 
+static bool guest_memfd_is_valid_range(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    MemoryRegion *mr = gmm->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    if (!QEMU_IS_ALIGNED(offset, gmm->block_size)) {
+        return false;
+    }
+    if (offset + size < offset || !size) {
+        return false;
+    }
+    if (offset >= region_size || offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void guest_memfd_notify_discard(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &gmm->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+
+        guest_memfd_for_each_populated_section(gmm, &tmp, rdl,
+                                               guest_memfd_notify_discard_cb);
+    }
+}
+
+
+static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl, *rdl2;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &gmm->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+
+        ret = guest_memfd_for_each_discarded_section(gmm, &tmp, rdl,
+                                                     guest_memfd_notify_populate_cb);
+        if (ret) {
+            break;
+        }
+    }
+
+    if (ret) {
+        /* Notify all already-notified listeners. */
+        QLIST_FOREACH(rdl2, &gmm->rdl_list, next) {
+            MemoryRegionSection tmp = *rdl2->section;
+
+            if (rdl2 == rdl) {
+                break;
+            }
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+                continue;
+            }
+
+            guest_memfd_for_each_discarded_section(gmm, &tmp, rdl2,
+                                                   guest_memfd_notify_discard_cb);
+        }
+    }
+    return ret;
+}
+
+static bool guest_memfd_is_range_populated(GuestMemfdManager *gmm,
+                                           uint64_t offset, uint64_t size)
+{
+    const unsigned long first_bit = offset / gmm->block_size;
+    const unsigned long last_bit = first_bit + (size / gmm->block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static bool guest_memfd_is_range_discarded(GuestMemfdManager *gmm,
+                                           uint64_t offset, uint64_t size)
+{
+    const unsigned long first_bit = offset / gmm->block_size;
+    const unsigned long last_bit = first_bit + (size / gmm->block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
+                                    uint64_t size, bool shared_to_private)
+{
+    int ret = 0;
+
+    if (!guest_memfd_is_valid_range(gmm, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if ((shared_to_private && guest_memfd_is_range_discarded(gmm, offset, size)) ||
+        (!shared_to_private && guest_memfd_is_range_populated(gmm, offset, size))) {
+        return 0;
+    }
+
+    if (shared_to_private) {
+        guest_memfd_notify_discard(gmm, offset, size);
+    } else {
+        ret = guest_memfd_notify_populate(gmm, offset, size);
+    }
+
+    if (!ret) {
+        unsigned long first_bit = offset / gmm->block_size;
+        unsigned long nbits = size / gmm->block_size;
+
+        g_assert((first_bit + nbits) <= gmm->bitmap_size);
+
+        if (shared_to_private) {
+            bitmap_clear(gmm->bitmap, first_bit, nbits);
+        } else {
+            bitmap_set(gmm->bitmap, first_bit, nbits);
+        }
+
+        return 0;
+    }
+
+    return ret;
+}
+
 static void guest_memfd_manager_init(Object *obj)
 {
     GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
@@ -239,8 +380,11 @@ static void guest_memfd_manager_finalize(Object *obj)
 
 static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
 {
+    GuestMemfdManagerClass *gmmc = GUEST_MEMFD_MANAGER_CLASS(oc);
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
 
+    gmmc->state_change = guest_memfd_state_change;
+
     rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
     rdmc->register_listener = guest_memfd_rdm_register_listener;
     rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 4/7] KVM: Notify the state change event during shared/private conversion
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2024-12-13  7:08 ` [PATCH 3/7] guest_memfd: Introduce a callback to notify the shared/private state change Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2024-12-13  7:08 ` [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Introduce a helper to trigger the state_change() callback of the class.
Once exit to userspace to convert the page from private to shared or
vice versa at runtime, notify the event via the helper so that other
registered subsystems like VFIO can be notified.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 accel/kvm/kvm-all.c                  |  4 ++++
 include/sysemu/guest-memfd-manager.h | 15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 52425af534..38f41a98a5 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -48,6 +48,7 @@
 #include "kvm-cpus.h"
 #include "sysemu/dirtylimit.h"
 #include "qemu/range.h"
+#include "sysemu/guest-memfd-manager.h"
 
 #include "hw/boards.h"
 #include "sysemu/stats.h"
@@ -3080,6 +3081,9 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    guest_memfd_manager_state_change(GUEST_MEMFD_MANAGER(mr->rdm), offset,
+                                     size, to_private);
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index f4b175529b..9dc4e0346d 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -46,4 +46,19 @@ struct GuestMemfdManagerClass {
                         bool shared_to_private);
 };
 
+static inline int guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint64_t offset,
+                                                   uint64_t size, bool shared_to_private)
+{
+    GuestMemfdManagerClass *klass;
+
+    g_assert(gmm);
+    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+
+    if (klass->state_change) {
+        return klass->state_change(gmm, offset, size, shared_to_private);
+    }
+
+    return 0;
+}
+
 #endif
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2024-12-13  7:08 ` [PATCH 4/7] KVM: Notify the state change event during shared/private conversion Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2025-01-08  4:47   ` Alexey Kardashevskiy
  2025-01-09  8:14   ` Zhao Liu
  2024-12-13  7:08 ` [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Introduce the realize()/unrealize() callbacks to initialize/uninitialize
the new guest_memfd_manager object and register/unregister it in the
target MemoryRegion.

Guest_memfd was initially set to shared until the commit bd3bcf6962
("kvm/memory: Make memory type private by default if it has guest memfd
backend"). To align with this change, the default state in
guest_memfd_manager is set to private. (The bitmap is cleared to 0).
Additionally, setting the default to private can also reduce the
overhead of mapping shared pages into IOMMU by VFIO during the bootup stage.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++++++
 system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++++++-
 system/physmem.c                     |  7 +++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index 9dc4e0346d..d1e7f698e8 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -42,6 +42,8 @@ struct GuestMemfdManager {
 struct GuestMemfdManagerClass {
     ObjectClass parent_class;
 
+    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr, uint64_t region_size);
+    void (*unrealize)(GuestMemfdManager *gmm);
     int (*state_change)(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
                         bool shared_to_private);
 };
@@ -61,4 +63,29 @@ static inline int guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
     return 0;
 }
 
+static inline void guest_memfd_manager_realize(GuestMemfdManager *gmm,
+                                              MemoryRegion *mr, uint64_t region_size)
+{
+    GuestMemfdManagerClass *klass;
+
+    g_assert(gmm);
+    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+
+    if (klass->realize) {
+        klass->realize(gmm, mr, region_size);
+    }
+}
+
+static inline void guest_memfd_manager_unrealize(GuestMemfdManager *gmm)
+{
+    GuestMemfdManagerClass *klass;
+
+    g_assert(gmm);
+    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+
+    if (klass->unrealize) {
+        klass->unrealize(gmm);
+    }
+}
+
 #endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6601df5f3f..b6a32f0bfb 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -366,6 +366,31 @@ static int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
     return ret;
 }
 
+static void guest_memfd_manager_realizefn(GuestMemfdManager *gmm, MemoryRegion *mr,
+                                          uint64_t region_size)
+{
+    uint64_t bitmap_size;
+
+    gmm->block_size = qemu_real_host_page_size();
+    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm->block_size;
+
+    gmm->mr = mr;
+    gmm->bitmap_size = bitmap_size;
+    gmm->bitmap = bitmap_new(bitmap_size);
+
+    memory_region_set_ram_discard_manager(gmm->mr, RAM_DISCARD_MANAGER(gmm));
+}
+
+static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
+{
+    memory_region_set_ram_discard_manager(gmm->mr, NULL);
+
+    g_free(gmm->bitmap);
+    gmm->bitmap = NULL;
+    gmm->bitmap_size = 0;
+    gmm->mr = NULL;
+}
+
 static void guest_memfd_manager_init(Object *obj)
 {
     GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
@@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
 
 static void guest_memfd_manager_finalize(Object *obj)
 {
-    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
 }
 
 static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
@@ -384,6 +408,8 @@ static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
 
     gmmc->state_change = guest_memfd_state_change;
+    gmmc->realize = guest_memfd_manager_realizefn;
+    gmmc->unrealize = guest_memfd_manager_unrealizefn;
 
     rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
     rdmc->register_listener = guest_memfd_rdm_register_listener;
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..532182a6dd 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -53,6 +53,7 @@
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
 #include "sysemu/xen-mapcache.h"
+#include "sysemu/guest-memfd-manager.h"
 #include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
+
+        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
+        guest_memfd_manager_realize(gmm, new_block->mr, new_block->mr->size);
     }
 
     ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
@@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
 
     if (block->guest_memfd >= 0) {
         close(block->guest_memfd);
+        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
+        guest_memfd_manager_unrealize(gmm);
+        object_unref(OBJECT(gmm));
         ram_block_discard_require(false);
     }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2024-12-13  7:08 ` [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2025-01-13 10:56   ` David Hildenbrand
  2024-12-13  7:08 ` [RFC PATCH 7/7] memory: Add a new argument to indicate the request attribute in RamDismcardManager helpers Chenyi Qiang
  2025-01-08  4:47 ` [PATCH 0/7] Enable shared device assignment Alexey Kardashevskiy
  7 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

As guest_memfd is now managed by guest_memfd_manager with
RamDiscardManager, only block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/physmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/system/physmem.c b/system/physmem.c
index 532182a6dd..585090b063 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         assert(kvm_enabled());
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 7/7] memory: Add a new argument to indicate the request attribute in RamDismcardManager helpers
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
                   ` (5 preceding siblings ...)
  2024-12-13  7:08 ` [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
@ 2024-12-13  7:08 ` Chenyi Qiang
  2025-01-08  4:47 ` [PATCH 0/7] Enable shared device assignment Alexey Kardashevskiy
  7 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-13  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

For each ram_discard_manager helper, add a new argument 'is_private' to
indicate the request attribute. If is_private is true, the operation
targets the private range in the section. For example,
replay_populate(true) will replay the populate operation on private part
in the MemoryRegionSection, while replay_popuate(false) will replay
population on shared part.

This helps to distinguish between the states of private/shared and
discarded/populated. It is essential for guest_memfd_manager which uses
RamDiscardManager interface but can't treat private memory as discarded
memory. This is because it does not align with the expectation of
current RamDiscardManager users (e.g. live migration), who expect that
discarded memory is hot-removed and can be skipped when processing guest
memory. Treating private memory as discarded won't work in the future if
live migration needs to handle private memory. For example, live
migration needs to migrate private memory.

The user of the helper needs to figure out which attribute to
manipulate. For legacy VM case, use is_private=true by default. Private
attribute is only valid in a guest_memfd based VM.

Opportunistically rename the guest_memfd_for_each_{discarded,
populated}_section() to guest_memfd_for_each_{private, shared)_section()
to distinguish between private/shared and discarded/populated at the
same time.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 hw/vfio/common.c             |  22 ++++++--
 hw/virtio/virtio-mem.c       |  23 ++++----
 include/exec/memory.h        |  23 ++++++--
 migration/ram.c              |  14 ++---
 system/guest-memfd-manager.c | 106 +++++++++++++++++++++++------------
 system/memory.c              |  13 +++--
 system/memory_mapping.c      |   4 +-
 7 files changed, 135 insertions(+), 70 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index dcef44fe55..a6f49e6450 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -345,7 +345,8 @@ out:
 }
 
 static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
+                                            MemoryRegionSection *section,
+                                            bool is_private)
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
@@ -354,6 +355,11 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
     const hwaddr iova = section->offset_within_address_space;
     int ret;
 
+    if (is_private) {
+        /* Not support discard private memory yet. */
+        return;
+    }
+
     /* Unmap with a single call. */
     ret = vfio_container_dma_unmap(bcontainer, iova, size , NULL);
     if (ret) {
@@ -363,7 +369,8 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
 }
 
 static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
+                                            MemoryRegionSection *section,
+                                            bool is_private)
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
@@ -374,6 +381,11 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
     void *vaddr;
     int ret;
 
+    if (is_private) {
+        /* Not support discard private memory yet. */
+        return 0;
+    }
+
     /*
      * Map in (aligned within memory region) minimum granularity, so we can
      * unmap in minimum granularity later.
@@ -390,7 +402,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                                      vaddr, section->readonly);
         if (ret) {
             /* Rollback */
-            vfio_ram_discard_notify_discard(rdl, section);
+            vfio_ram_discard_notify_discard(rdl, section, false);
             return ret;
         }
     }
@@ -1248,7 +1260,7 @@ out:
 }
 
 static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
-                                             void *opaque)
+                                             bool is_private, void *opaque)
 {
     const hwaddr size = int128_get64(section->size);
     const hwaddr iova = section->offset_within_address_space;
@@ -1293,7 +1305,7 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
      * We only want/can synchronize the bitmap for actually mapped parts -
      * which correspond to populated parts. Replay all populated parts.
      */
-    return ram_discard_manager_replay_populated(rdm, section,
+    return ram_discard_manager_replay_populated(rdm, section, false,
                                               vfio_ram_discard_get_dirty_bitmap,
                                                 &vrdl);
 }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index e3d1ccaeeb..e7304c7e47 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -312,14 +312,14 @@ static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    return rdl->notify_populate(rdl, s);
+    return rdl->notify_populate(rdl, s, false);
 }
 
 static int virtio_mem_notify_discard_cb(MemoryRegionSection *s, void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    rdl->notify_discard(rdl, s);
+    rdl->notify_discard(rdl, s, false);
     return 0;
 }
 
@@ -334,7 +334,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-        rdl->notify_discard(rdl, &tmp);
+        rdl->notify_discard(rdl, &tmp, false);
     }
 }
 
@@ -350,7 +350,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-        ret = rdl->notify_populate(rdl, &tmp);
+        ret = rdl->notify_populate(rdl, &tmp, false);
         if (ret) {
             break;
         }
@@ -367,7 +367,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
-            rdl2->notify_discard(rdl2, &tmp);
+            rdl2->notify_discard(rdl2, &tmp, false);
         }
     }
     return ret;
@@ -383,7 +383,7 @@ static void virtio_mem_notify_unplug_all(VirtIOMEM *vmem)
 
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         if (rdl->double_discard_supported) {
-            rdl->notify_discard(rdl, rdl->section);
+            rdl->notify_discard(rdl, rdl->section, false);
         } else {
             virtio_mem_for_each_plugged_section(vmem, rdl->section, rdl,
                                                 virtio_mem_notify_discard_cb);
@@ -1685,7 +1685,8 @@ static uint64_t virtio_mem_rdm_get_min_granularity(const RamDiscardManager *rdm,
 }
 
 static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
-                                        const MemoryRegionSection *s)
+                                        const MemoryRegionSection *s,
+                                        bool is_private)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
     uint64_t start_gpa = vmem->addr + s->offset_within_region;
@@ -1712,11 +1713,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
+    return ((ReplayRamPopulate)data->fn)(s, false, data->opaque);
 }
 
 static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                            MemoryRegionSection *s,
+                                           bool is_private,
                                            ReplayRamPopulate replay_fn,
                                            void *opaque)
 {
@@ -1736,12 +1738,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    ((ReplayRamDiscard)data->fn)(s, data->opaque);
+    ((ReplayRamDiscard)data->fn)(s, false, data->opaque);
     return 0;
 }
 
 static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
                                             MemoryRegionSection *s,
+                                            bool is_private,
                                             ReplayRamDiscard replay_fn,
                                             void *opaque)
 {
@@ -1783,7 +1786,7 @@ static void virtio_mem_rdm_unregister_listener(RamDiscardManager *rdm,
     g_assert(rdl->section->mr == &vmem->memdev->mr);
     if (vmem->size) {
         if (rdl->double_discard_supported) {
-            rdl->notify_discard(rdl, rdl->section);
+            rdl->notify_discard(rdl, rdl->section, false);
         } else {
             virtio_mem_for_each_plugged_section(vmem, rdl->section, rdl,
                                                 virtio_mem_notify_discard_cb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index ec7bc641e8..8aac61af08 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -508,9 +508,11 @@ struct IOMMUMemoryRegionClass {
 
 typedef struct RamDiscardListener RamDiscardListener;
 typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl,
-                                 MemoryRegionSection *section);
+                                 MemoryRegionSection *section,
+                                 bool is_private);
 typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl,
-                                 MemoryRegionSection *section);
+                                 MemoryRegionSection *section,
+                                 bool is_private);
 
 struct RamDiscardListener {
     /*
@@ -566,8 +568,8 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
     rdl->double_discard_supported = double_discard_supported;
 }
 
-typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
-typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
+typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, bool is_private, void *opaque);
+typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, bool is_private, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -632,11 +634,13 @@ struct RamDiscardManagerClass {
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
+     * @is_private: the attribute of the request section
      *
      * Returns whether the given range is completely populated.
      */
     bool (*is_populated)(const RamDiscardManager *rdm,
-                         const MemoryRegionSection *section);
+                         const MemoryRegionSection *section,
+                         bool is_private);
 
     /**
      * @replay_populated:
@@ -648,6 +652,7 @@ struct RamDiscardManagerClass {
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
+     * @is_private: the attribute of the populated parts
      * @replay_fn: the #ReplayRamPopulate callback
      * @opaque: pointer to forward to the callback
      *
@@ -655,6 +660,7 @@ struct RamDiscardManagerClass {
      */
     int (*replay_populated)(const RamDiscardManager *rdm,
                             MemoryRegionSection *section,
+                            bool is_private,
                             ReplayRamPopulate replay_fn, void *opaque);
 
     /**
@@ -665,11 +671,13 @@ struct RamDiscardManagerClass {
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
+     * @is_private: the attribute of the discarded parts
      * @replay_fn: the #ReplayRamDiscard callback
      * @opaque: pointer to forward to the callback
      */
     void (*replay_discarded)(const RamDiscardManager *rdm,
                              MemoryRegionSection *section,
+                             bool is_private,
                              ReplayRamDiscard replay_fn, void *opaque);
 
     /**
@@ -709,15 +717,18 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
                                                  const MemoryRegion *mr);
 
 bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
-                                      const MemoryRegionSection *section);
+                                      const MemoryRegionSection *section,
+                                      bool is_private);
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
+                                         bool is_private,
                                          ReplayRamPopulate replay_fn,
                                          void *opaque);
 
 void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
                                           MemoryRegionSection *section,
+                                          bool is_private,
                                           ReplayRamDiscard replay_fn,
                                           void *opaque);
 
diff --git a/migration/ram.c b/migration/ram.c
index 05ff9eb328..b9efba1d14 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -838,7 +838,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
 }
 
 static void dirty_bitmap_clear_section(MemoryRegionSection *section,
-                                       void *opaque)
+                                       bool is_private, void *opaque)
 {
     const hwaddr offset = section->offset_within_region;
     const hwaddr size = int128_get64(section->size);
@@ -884,7 +884,7 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
             .size = int128_make64(qemu_ram_get_used_length(rb)),
         };
 
-        ram_discard_manager_replay_discarded(rdm, &section,
+        ram_discard_manager_replay_discarded(rdm, &section, false,
                                              dirty_bitmap_clear_section,
                                              &cleared_bits);
     }
@@ -907,7 +907,7 @@ bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
             .size = int128_make64(qemu_ram_pagesize(rb)),
         };
 
-        return !ram_discard_manager_is_populated(rdm, &section);
+        return !ram_discard_manager_is_populated(rdm, &section, false);
     }
     return false;
 }
@@ -1539,7 +1539,7 @@ static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
 }
 
 static inline int populate_read_section(MemoryRegionSection *section,
-                                        void *opaque)
+                                        bool is_private, void *opaque)
 {
     const hwaddr size = int128_get64(section->size);
     hwaddr offset = section->offset_within_region;
@@ -1579,7 +1579,7 @@ static void ram_block_populate_read(RAMBlock *rb)
             .size = rb->mr->size,
         };
 
-        ram_discard_manager_replay_populated(rdm, &section,
+        ram_discard_manager_replay_populated(rdm, &section, false,
                                              populate_read_section, NULL);
     } else {
         populate_read_range(rb, 0, rb->used_length);
@@ -1614,7 +1614,7 @@ void ram_write_tracking_prepare(void)
 }
 
 static inline int uffd_protect_section(MemoryRegionSection *section,
-                                       void *opaque)
+                                       bool is_private, void *opaque)
 {
     const hwaddr size = int128_get64(section->size);
     const hwaddr offset = section->offset_within_region;
@@ -1638,7 +1638,7 @@ static int ram_block_uffd_protect(RAMBlock *rb, int uffd_fd)
             .size = rb->mr->size,
         };
 
-        return ram_discard_manager_replay_populated(rdm, &section,
+        return ram_discard_manager_replay_populated(rdm, &section, false,
                                                     uffd_protect_section,
                                                     (void *)(uintptr_t)uffd_fd);
     }
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index b6a32f0bfb..50802b34d7 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -23,39 +23,51 @@ OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
                                           { })
 
 static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
-                                         const MemoryRegionSection *section)
+                                         const MemoryRegionSection *section,
+                                         bool is_private)
 {
     const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
     uint64_t first_bit = section->offset_within_region / gmm->block_size;
     uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
     unsigned long first_discard_bit;
 
-    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    if (is_private) {
+        /* Check if the private section is populated */
+        first_discard_bit = find_next_bit(gmm->bitmap, last_bit + 1, first_bit);
+    } else {
+        /* Check if the shared section is populated */
+        first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    }
+
     return first_discard_bit > last_bit;
 }
 
-typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
+typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, bool is_private,
+                                      void *arg);
 
-static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, bool is_private,
+                                          void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    return rdl->notify_populate(rdl, section);
+    return rdl->notify_populate(rdl, section, is_private);
 }
 
-static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, bool is_private,
+                                         void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    rdl->notify_discard(rdl, section);
+    rdl->notify_discard(rdl, section, is_private);
 
     return 0;
 }
 
-static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
-                                                  MemoryRegionSection *section,
-                                                  void *arg,
-                                                  guest_memfd_section_cb cb)
+static int guest_memfd_for_each_shared_section(const GuestMemfdManager *gmm,
+                                               MemoryRegionSection *section,
+                                               bool is_private,
+                                               void *arg,
+                                               guest_memfd_section_cb cb)
 {
     unsigned long first_one_bit, last_one_bit;
     uint64_t offset, size;
@@ -76,7 +88,7 @@ static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
             break;
         }
 
-        ret = cb(&tmp, arg);
+        ret = cb(&tmp, is_private, arg);
         if (ret) {
             break;
         }
@@ -88,10 +100,11 @@ static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
     return ret;
 }
 
-static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
-                                                  MemoryRegionSection *section,
-                                                  void *arg,
-                                                  guest_memfd_section_cb cb)
+static int guest_memfd_for_each_private_section(const GuestMemfdManager *gmm,
+                                                MemoryRegionSection *section,
+                                                bool is_private,
+                                                void *arg,
+                                                guest_memfd_section_cb cb)
 {
     unsigned long first_zero_bit, last_zero_bit;
     uint64_t offset, size;
@@ -113,7 +126,7 @@ static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
             break;
         }
 
-        ret = cb(&tmp, arg);
+        ret = cb(&tmp, is_private, arg);
         if (ret) {
             break;
         }
@@ -146,8 +159,9 @@ static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
 
     QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
 
-    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
-                                                 guest_memfd_notify_populate_cb);
+    /* Populate shared part */
+    ret = guest_memfd_for_each_shared_section(gmm, section, false, rdl,
+                                              guest_memfd_notify_populate_cb);
     if (ret) {
         error_report("%s: Failed to register RAM discard listener: %s", __func__,
                      strerror(-ret));
@@ -163,8 +177,9 @@ static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
     g_assert(rdl->section);
     g_assert(rdl->section->mr == gmm->mr);
 
-    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
-                                                 guest_memfd_notify_discard_cb);
+    /* Discard shared part */
+    ret = guest_memfd_for_each_shared_section(gmm, rdl->section, false, rdl,
+                                              guest_memfd_notify_discard_cb);
     if (ret) {
         error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
                      strerror(-ret));
@@ -181,16 +196,18 @@ typedef struct GuestMemfdReplayData {
     void *opaque;
 } GuestMemfdReplayData;
 
-static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section,
+                                               bool is_private, void *arg)
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamPopulate replay_fn = data->fn;
 
-    return replay_fn(section, data->opaque);
+    return replay_fn(section, is_private, data->opaque);
 }
 
 static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
                                             MemoryRegionSection *section,
+                                            bool is_private,
                                             ReplayRamPopulate replay_fn,
                                             void *opaque)
 {
@@ -198,22 +215,31 @@ static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
     struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
 
     g_assert(section->mr == gmm->mr);
-    return guest_memfd_for_each_populated_section(gmm, section, &data,
-                                                  guest_memfd_rdm_replay_populated_cb);
+    if (is_private) {
+        /* Replay populate on private section */
+        return guest_memfd_for_each_private_section(gmm, section, is_private, &data,
+                                                    guest_memfd_rdm_replay_populated_cb);
+    } else {
+        /* Replay populate on shared section */
+        return guest_memfd_for_each_shared_section(gmm, section, is_private, &data,
+                                                   guest_memfd_rdm_replay_populated_cb);
+    }
 }
 
-static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section,
+                                               bool is_private, void *arg)
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamDiscard replay_fn = data->fn;
 
-    replay_fn(section, data->opaque);
+    replay_fn(section, is_private, data->opaque);
 
     return 0;
 }
 
 static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
                                              MemoryRegionSection *section,
+                                             bool is_private,
                                              ReplayRamDiscard replay_fn,
                                              void *opaque)
 {
@@ -221,8 +247,16 @@ static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
     struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
 
     g_assert(section->mr == gmm->mr);
-    guest_memfd_for_each_discarded_section(gmm, section, &data,
-                                           guest_memfd_rdm_replay_discarded_cb);
+
+    if (is_private) {
+        /* Replay discard on private section */
+        guest_memfd_for_each_private_section(gmm, section, is_private, &data,
+                                             guest_memfd_rdm_replay_discarded_cb);
+    } else {
+        /* Replay discard on shared section */
+        guest_memfd_for_each_shared_section(gmm, section, is_private, &data,
+                                            guest_memfd_rdm_replay_discarded_cb);
+    }
 }
 
 static bool guest_memfd_is_valid_range(GuestMemfdManager *gmm,
@@ -257,8 +291,9 @@ static void guest_memfd_notify_discard(GuestMemfdManager *gmm,
             continue;
         }
 
-        guest_memfd_for_each_populated_section(gmm, &tmp, rdl,
-                                               guest_memfd_notify_discard_cb);
+        /* For current shared section, notify to discard shared parts */
+        guest_memfd_for_each_shared_section(gmm, &tmp, false, rdl,
+                                            guest_memfd_notify_discard_cb);
     }
 }
 
@@ -276,8 +311,9 @@ static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
             continue;
         }
 
-        ret = guest_memfd_for_each_discarded_section(gmm, &tmp, rdl,
-                                                     guest_memfd_notify_populate_cb);
+        /* For current private section, notify to populate the shared parts */
+        ret = guest_memfd_for_each_private_section(gmm, &tmp, false, rdl,
+                                                   guest_memfd_notify_populate_cb);
         if (ret) {
             break;
         }
@@ -295,8 +331,8 @@ static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
                 continue;
             }
 
-            guest_memfd_for_each_discarded_section(gmm, &tmp, rdl2,
-                                                   guest_memfd_notify_discard_cb);
+            guest_memfd_for_each_private_section(gmm, &tmp, false, rdl2,
+                                                 guest_memfd_notify_discard_cb);
         }
     }
     return ret;
diff --git a/system/memory.c b/system/memory.c
index ddcec90f5e..d3d5a04f98 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2133,34 +2133,37 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
 }
 
 bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
-                                      const MemoryRegionSection *section)
+                                      const MemoryRegionSection *section,
+                                      bool is_private)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->is_populated);
-    return rdmc->is_populated(rdm, section);
+    return rdmc->is_populated(rdm, section, is_private);
 }
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
+                                         bool is_private,
                                          ReplayRamPopulate replay_fn,
                                          void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_populated);
-    return rdmc->replay_populated(rdm, section, replay_fn, opaque);
+    return rdmc->replay_populated(rdm, section, is_private, replay_fn, opaque);
 }
 
 void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
                                           MemoryRegionSection *section,
+                                          bool is_private,
                                           ReplayRamDiscard replay_fn,
                                           void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_discarded);
-    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+    rdmc->replay_discarded(rdm, section, is_private, replay_fn, opaque);
 }
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
@@ -2221,7 +2224,7 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
          * Disallow that. vmstate priorities make sure any RamDiscardManager
          * were already restored before IOMMUs are restored.
          */
-        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
+        if (!ram_discard_manager_is_populated(rdm, &tmp, false)) {
             error_setg(errp, "iommu map to discarded memory (e.g., unplugged"
                          " via virtio-mem): %" HWADDR_PRIx "",
                          iotlb->translated_addr);
diff --git a/system/memory_mapping.c b/system/memory_mapping.c
index ca2390eb80..c55c0c0c93 100644
--- a/system/memory_mapping.c
+++ b/system/memory_mapping.c
@@ -249,7 +249,7 @@ static void guest_phys_block_add_section(GuestPhysListener *g,
 }
 
 static int guest_phys_ram_populate_cb(MemoryRegionSection *section,
-                                      void *opaque)
+                                      bool is_private, void *opaque)
 {
     GuestPhysListener *g = opaque;
 
@@ -274,7 +274,7 @@ static void guest_phys_blocks_region_add(MemoryListener *listener,
         RamDiscardManager *rdm;
 
         rdm = memory_region_get_ram_discard_manager(section->mr);
-        ram_discard_manager_replay_populated(rdm, section,
+        ram_discard_manager_replay_populated(rdm, section, false,
                                              guest_phys_ram_populate_cb, g);
         return;
     }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2024-12-13  7:08 ` [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
@ 2024-12-18  6:45   ` Chenyi Qiang
  2025-01-08  4:48   ` Alexey Kardashevskiy
  2025-01-20 18:09   ` Peter Xu
  2 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2024-12-18  6:45 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 12/13/2024 3:08 PM, Chenyi Qiang wrote:
> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") highlighted, some subsystems like VFIO might
> disable ram block discard. However, guest_memfd relies on the discard
> operation to perform page conversion between private and shared memory.
> This can lead to stale IOMMU mapping issue when assigning a hardware
> device to a confidential VM via shared memory (unprotected memory
> pages). Blocking shared page discard can solve this problem, but it
> could cause guests to consume twice the memory with VFIO, which is not
> acceptable in some cases. An alternative solution is to convey other
> systems like VFIO to refresh its outdated IOMMU mappings.
> 
> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
> VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other, so the similar work that needs to happen in response
> to virtio-mem changes needs to happen for page conversion events.
> Introduce the RamDiscardManager to guest_memfd to achieve it.
> 
> However, guest_memfd is not an object so it cannot directly implement
> the RamDiscardManager interface.
> 
> One solution is to implement the interface in HostMemoryBackend. Any
> guest_memfd-backed host memory backend can register itself in the target
> MemoryRegion. However, this solution doesn't cover the scenario where a
> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
> the virtual BIOS MemoryRegion.
> 
> Thus, choose the second option, i.e. define an object type named
> guest_memfd_manager with RamDiscardManager interface. Upon creation of
> guest_memfd, a new guest_memfd_manager object can be instantiated and
> registered to the managed guest_memfd MemoryRegion to handle the page
> conversion events.
> 
> In the context of guest_memfd, the discarded state signifies that the
> page is private, while the populated state indicated that the page is
> shared. The state of the memory is tracked at the granularity of the
> host page size (i.e. block_size), as the minimum conversion size can be
> one page per request.
> 
> In addition, VFIO expects the DMA mapping for a specific iova to be
> mapped and unmapped with the same granularity. However, the confidential
> VMs may do partial conversion, e.g. conversion happens on a small region
> within a large region. To prevent such invalid cases and before any
> potential optimization comes out, all operations are performed with 4K
> granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>  include/sysemu/guest-memfd-manager.h |  46 +++++
>  system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>  system/meson.build                   |   1 +
>  3 files changed, 297 insertions(+)
>  create mode 100644 include/sysemu/guest-memfd-manager.h
>  create mode 100644 system/guest-memfd-manager.c
> 
> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
> new file mode 100644
> index 0000000000..ba4a99b614
> --- /dev/null
> +++ b/include/sysemu/guest-memfd-manager.h
> @@ -0,0 +1,46 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
> +
> +#include "sysemu/hostmem.h"
> +
> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
> +
> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
> +
> +struct GuestMemfdManager {
> +    Object parent;
> +
> +    /* Managed memory region. */
> +    MemoryRegion *mr;
> +
> +    /*
> +     * 1-setting of the bit represents the memory is populated (shared).
> +     */
> +    int32_t bitmap_size;
> +    unsigned long *bitmap;
> +
> +    /* block size and alignment */
> +    uint64_t block_size;
> +
> +    /* listeners to notify on populate/discard activity. */
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +struct GuestMemfdManagerClass {
> +    ObjectClass parent_class;
> +};
> +
> +#endif
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> new file mode 100644
> index 0000000000..d7e105fead
> --- /dev/null
> +++ b/system/guest-memfd-manager.c
> @@ -0,0 +1,250 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/guest-memfd-manager.h"
> +
> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
> +                                          guest_memfd_manager,
> +                                          GUEST_MEMFD_MANAGER,
> +                                          OBJECT,
> +                                          { TYPE_RAM_DISCARD_MANAGER },
> +                                          { })
> +

Fixup: Use OBJECT_DEFINE_TYPE_WITH_INTERFACES() instead of
OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES() as we define a class struct.

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 50802b34d7..f7dc93071a 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -15,12 +15,12 @@
 #include "qemu/error-report.h"
 #include "sysemu/guest-memfd-manager.h"

-OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
-                                          guest_memfd_manager,
-                                          GUEST_MEMFD_MANAGER,
-                                          OBJECT,
-                                          { TYPE_RAM_DISCARD_MANAGER },
-                                          { })
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(GuestMemfdManager,
+                                   guest_memfd_manager,
+                                   GUEST_MEMFD_MANAGER,
+                                   OBJECT,
+                                   { TYPE_RAM_DISCARD_MANAGER },
+                                   { })

 static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,




^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2024-12-13  7:08 ` [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2024-12-18 12:33   ` David Hildenbrand
  2025-01-08  4:47   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2024-12-18 12:33 UTC (permalink / raw)
  To: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13.12.24 08:08, Chenyi Qiang wrote:
> Rename the helper to memory_region_section_intersect_range() to make it
> more generic.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   hw/virtio/virtio-mem.c | 32 +++++---------------------------
>   include/exec/memory.h  | 13 +++++++++++++
>   system/memory.c        | 17 +++++++++++++++++
>   3 files changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 80ada89551..e3d1ccaeeb 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -242,28 +242,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
>       return ret;
>   }
>   
> -/*
> - * Adjust the memory section to cover the intersection with the given range.
> - *
> - * Returns false if the intersection is empty, otherwise returns true.
> - */
> -static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
> -                                                uint64_t offset, uint64_t size)
> -{
> -    uint64_t start = MAX(s->offset_within_region, offset);
> -    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
> -                       offset + size);
> -
> -    if (end <= start) {
> -        return false;
> -    }
> -
> -    s->offset_within_address_space += start - s->offset_within_region;
> -    s->offset_within_region = start;
> -    s->size = int128_make64(end - start);
> -    return true;
> -}
> -
>   typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
>   
>   static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
> @@ -285,7 +263,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
>                                         first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -317,7 +295,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
>                                    first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -353,7 +331,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
>           rdl->notify_discard(rdl, &tmp);
> @@ -369,7 +347,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
>           ret = rdl->notify_populate(rdl, &tmp);
> @@ -386,7 +364,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>               if (rdl2 == rdl) {
>                   break;
>               }
> -            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>                   continue;
>               }
>               rdl2->notify_discard(rdl2, &tmp);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e5e865d1a9..ec7bc641e8 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1196,6 +1196,19 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
>    */
>   void memory_region_section_free_copy(MemoryRegionSection *s);
>   
> +/**
> + * memory_region_section_intersect_range: Adjust the memory section to cover
> + * the intersection with the given range.
> + *
> + * @s: the #MemoryRegionSection to be adjusted
> + * @offset: the offset of the given range in the memory region
> + * @size: the size of the given range
> + *
> + * Returns false if the intersection is empty, otherwise returns true.
> + */
> +bool memory_region_section_intersect_range(MemoryRegionSection *s,
> +                                           uint64_t offset, uint64_t size);
> +
>   /**
>    * memory_region_init: Initialize a memory region

Maybe it could simply be an inline function. In any case, LGTM:

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
                   ` (6 preceding siblings ...)
  2024-12-13  7:08 ` [RFC PATCH 7/7] memory: Add a new argument to indicate the request attribute in RamDismcardManager helpers Chenyi Qiang
@ 2025-01-08  4:47 ` Alexey Kardashevskiy
  2025-01-08  6:28   ` Chenyi Qiang
  7 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-08  4:47 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13/12/24 18:08, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") effectively disables device assignment when using guest_memfd.
> This poses a significant challenge as guest_memfd is essential for
> confidential guests, thereby blocking device assignment to these VMs.
> The initial rationale for disabling device assignment was due to stale
> IOMMU mappings (see Problem section) and the assumption that TEE I/O
> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
> problem for confidential guests [1]. However, this assumption has proven
> to be incorrect. TEE I/O relies on the ability to operate devices against
> "shared" or untrusted memory, which is crucial for device initialization
> and error recovery scenarios. As a result, the current implementation does
> not adequately support device assignment for confidential guests, necessitating
> a reevaluation of the approach to ensure compatibility and functionality.
> 
> This series enables shared device assignment by notifying VFIO of page
> conversions using an existing framework named RamDiscardListener.
> Additionally, there is an ongoing patch set [2] that aims to add 1G page
> support for guest_memfd. This patch set introduces in-place page conversion,
> where private and shared memory share the same physical pages as the backend.
> This development may impact our solution.
> 
> We presented our solution in the guest_memfd meeting to discuss its
> compatibility with the new changes and potential future directions (see [3]
> for more details). The conclusion was that, although our solution may not be
> the most elegant (see the Limitation section), it is sufficient for now and
> can be easily adapted to future changes.
> 
> We are re-posting the patch series with some cleanup and have removed the RFC
> label for the main enabling patches (1-6). The newly-added patch 7 is still
> marked as RFC as it tries to resolve some extension concerns related to
> RamDiscardManager for future usage.
> 
> The overview of the patches:
> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>    with a given range.
> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>    RamDiscardManager, and notify the shared/private state change during
>    conversion.
> - Patch 7: Try to resolve a semantics concern related to RamDiscardManager
>    i.e. RamDiscardManager is used to manage memory plug/unplug state
>    instead of shared/private state. It would affect future users of
>    RamDiscardManger in confidential VMs. Attach it behind as a RFC patch[4].
> 
> Changes since last version:
> - Add a patch to export some generic helper functions from virtio-mem code.
> - Change the bitmap in guest_memfd_manager from default shared to default
>    private. This keeps alignment with virtio-mem that 1-setting in bitmap
>    represents the populated state and may help to export more generic code
>    if necessary.
> - Add the helpers to initialize/uninitialize the guest_memfd_manager instance
>    to make it more clear.
> - Add a patch to distinguish between the shared/private state change and
>    the memory plug/unplug state change in RamDiscardManager.
> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-chenyi.qiang@intel.com/
> 
> ---
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. The key differences between guest_memfd and normal memfd
> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> cannot be mapped, read or written by userspace.

The "cannot be mapped" seems to be not true soon anymore (if not already).

https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/


> 
> In QEMU's implementation, shared memory is allocated with normal methods
> (e.g. mmap or fallocate) while private memory is allocated from
> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> allocates new pages from the other side.
> 
> Problem
> =======
> Device assignment in QEMU is implemented via VFIO system. In the normal
> VM, VM memory is pinned at the beginning of time by VFIO. In the
> confidential VM, the VM can convert memory and when that happens
> nothing currently tells VFIO that its mappings are stale. This means
> that page conversion leaks memory and leaves stale IOMMU mappings. For
> example, sequence like the following can result in stale IOMMU mappings:
> 
> 1. allocate shared page
> 2. convert page shared->private
> 3. discard shared page
> 4. convert page private->shared
> 5. allocate shared page
> 6. issue DMA operations against that shared page
> 
> After step 3, VFIO is still pinning the page. However, DMA operations in
> step 6 will hit the old mapping that was allocated in step 1, which
> causes the device to access the invalid data.
> 
> Solution
> ========
> The key to enable shared device assignment is to update the IOMMU mappings
> on page conversion.
> 
> Given the constraints and assumptions here is a solution that satisfied
> the use cases. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversion is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardManager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Another possible attempt [5] was to not discard shared pages in step 3
> above. This was an incomplete band-aid because guests would consume
> twice the memory since shared pages wouldn't be freed even after they
> were converted to private.
> 
> w/ in-place page conversion
> ===========================
> To support 1G page support for guest_memfd, the current direction is to
> allow mmap() of guest_memfd to userspace so that both private and shared
> memory can use the same physical pages as the backend. This in-place page
> conversion design eliminates the need to discard pages during shared/private
> conversions. However, device assignment will still be blocked because the
> in-place page conversion will reject the conversion when the page is pinned
> by VFIO.
> 
> To address this, the key difference lies in the sequence of VFIO map/unmap
> operations and the page conversion. This series can be adjusted to achieve
> unmap-before-conversion-to-private and map-after-conversion-to-shared,
> ensuring compatibility with guest_memfd.
> 
> Additionally, with in-place page conversion, the previously mentioned
> solution to disable the discard of shared pages is not feasible because
> shared and private memory share the same backend, and no discard operation
> is performed. Retaining the old mappings in the IOMMU would result in
> unsafe DMA access to protected memory.
> 
> Limitation
> ==========
> 
> One limitation (also discussed in the guest_memfd meeting) is that VFIO
> expects the DMA mapping for a specific IOVA to be mapped and unmapped with
> the same granularity. The guest may perform partial conversions, such as
> converting a small region within a larger region. To prevent such invalid
> cases, all operations are performed with 4K granularity. The possible
> solutions we can think of are either to enable VFIO to support partial unmap
> or to implement an enlightened guest to avoid partial conversion. The former
> requires complex changes in VFIO, while the latter requires the page
> conversion to be a guest-enlightened behavior. It is still uncertain which
> option is a preferred one.

in-place memory conversion is :)

> 
> Testing
> =======
> This patch series is tested with the KVM/QEMU branch:
> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2024-12-13


The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have 
these though. Thanks,

> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>      -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>      -object iommufd,id=iommufd0 \
>      -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> No additional adjustment required.
> 
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
> 
> Related link
> ============
> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> [2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
> [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.jr4csfgw1uql
> [4] https://lore.kernel.org/qemu-devel/d299bbad-81bc-462e-91b5-a6d9c27ffe3a@redhat.com/
> [5] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> 
> Chenyi Qiang (7):
>    memory: Export a helper to get intersection of a MemoryRegionSection
>      with a given range
>    guest_memfd: Introduce an object to manage the guest-memfd with
>      RamDiscardManager
>    guest_memfd: Introduce a callback to notify the shared/private state
>      change
>    KVM: Notify the state change event during shared/private conversion
>    memory: Register the RamDiscardManager instance upon guest_memfd
>      creation
>    RAMBlock: make guest_memfd require coordinate discard
>    memory: Add a new argument to indicate the request attribute in
>      RamDismcardManager helpers
> 
>   accel/kvm/kvm-all.c                  |   4 +
>   hw/vfio/common.c                     |  22 +-
>   hw/virtio/virtio-mem.c               |  55 ++--
>   include/exec/memory.h                |  36 ++-
>   include/sysemu/guest-memfd-manager.h |  91 ++++++
>   migration/ram.c                      |  14 +-
>   system/guest-memfd-manager.c         | 456 +++++++++++++++++++++++++++
>   system/memory.c                      |  30 +-
>   system/memory_mapping.c              |   4 +-
>   system/meson.build                   |   1 +
>   system/physmem.c                     |   9 +-
>   11 files changed, 659 insertions(+), 63 deletions(-)
>   create mode 100644 include/sysemu/guest-memfd-manager.h
>   create mode 100644 system/guest-memfd-manager.c
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2024-12-13  7:08 ` [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2024-12-18 12:33   ` David Hildenbrand
@ 2025-01-08  4:47   ` Alexey Kardashevskiy
  2025-01-08  6:41     ` Chenyi Qiang
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-08  4:47 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13/12/24 18:08, Chenyi Qiang wrote:
> Rename the helper to memory_region_section_intersect_range() to make it
> more generic.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   hw/virtio/virtio-mem.c | 32 +++++---------------------------
>   include/exec/memory.h  | 13 +++++++++++++
>   system/memory.c        | 17 +++++++++++++++++
>   3 files changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 80ada89551..e3d1ccaeeb 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -242,28 +242,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
>       return ret;
>   }
>   
> -/*
> - * Adjust the memory section to cover the intersection with the given range.
> - *
> - * Returns false if the intersection is empty, otherwise returns true.
> - */
> -static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
> -                                                uint64_t offset, uint64_t size)
> -{
> -    uint64_t start = MAX(s->offset_within_region, offset);
> -    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
> -                       offset + size);
> -
> -    if (end <= start) {
> -        return false;
> -    }
> -
> -    s->offset_within_address_space += start - s->offset_within_region;
> -    s->offset_within_region = start;
> -    s->size = int128_make64(end - start);
> -    return true;
> -}
> -
>   typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
>   
>   static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
> @@ -285,7 +263,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
>                                         first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -317,7 +295,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
>                                    first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -353,7 +331,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
>           rdl->notify_discard(rdl, &tmp);
> @@ -369,7 +347,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
>           ret = rdl->notify_populate(rdl, &tmp);
> @@ -386,7 +364,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>               if (rdl2 == rdl) {
>                   break;
>               }
> -            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>                   continue;
>               }
>               rdl2->notify_discard(rdl2, &tmp);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e5e865d1a9..ec7bc641e8 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1196,6 +1196,19 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
>    */
>   void memory_region_section_free_copy(MemoryRegionSection *s);
>   
> +/**
> + * memory_region_section_intersect_range: Adjust the memory section to cover
> + * the intersection with the given range.
> + *
> + * @s: the #MemoryRegionSection to be adjusted
> + * @offset: the offset of the given range in the memory region
> + * @size: the size of the given range
> + *
> + * Returns false if the intersection is empty, otherwise returns true.
> + */
> +bool memory_region_section_intersect_range(MemoryRegionSection *s,
> +                                           uint64_t offset, uint64_t size);
> +
>   /**
>    * memory_region_init: Initialize a memory region
>    *
> diff --git a/system/memory.c b/system/memory.c
> index 85f6834cb3..ddcec90f5e 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2898,6 +2898,23 @@ void memory_region_section_free_copy(MemoryRegionSection *s)
>       g_free(s);
>   }
>   
> +bool memory_region_section_intersect_range(MemoryRegionSection *s,
> +                                           uint64_t offset, uint64_t size)
> +{
> +    uint64_t start = MAX(s->offset_within_region, offset);
> +    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
> +                       offset + size);

imho @end needs to be Int128 and s/MIN/int128_min/, etc to be totally 
correct (although it is going to look horrendous). May be it was alright 
when it was just virtio but now it is a wider API. I understand this is 
cut-n-paste and unlikely scenario of offset+size crossing 1<<64 but 
still. Thanks,


> +
> +    if (end <= start) {
> +        return false;
> +    }
> +
> +    s->offset_within_address_space += start - s->offset_within_region;
> +    s->offset_within_region = start;
> +    s->size = int128_make64(end - start);
> +    return true;
> +}
> +
>   bool memory_region_present(MemoryRegion *container, hwaddr addr)
>   {
>       MemoryRegion *mr;

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2024-12-13  7:08 ` [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
@ 2025-01-08  4:47   ` Alexey Kardashevskiy
  2025-01-09  5:34     ` Chenyi Qiang
  2025-01-09  8:14   ` Zhao Liu
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-08  4:47 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13/12/24 18:08, Chenyi Qiang wrote:
> Introduce the realize()/unrealize() callbacks to initialize/uninitialize
> the new guest_memfd_manager object and register/unregister it in the
> target MemoryRegion.
> 
> Guest_memfd was initially set to shared until the commit bd3bcf6962
> ("kvm/memory: Make memory type private by default if it has guest memfd
> backend"). To align with this change, the default state in
> guest_memfd_manager is set to private. (The bitmap is cleared to 0).
> Additionally, setting the default to private can also reduce the
> overhead of mapping shared pages into IOMMU by VFIO during the bootup stage.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++++++
>   system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++++++-
>   system/physmem.c                     |  7 +++++++
>   3 files changed, 61 insertions(+), 1 deletion(-)
> 
> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
> index 9dc4e0346d..d1e7f698e8 100644
> --- a/include/sysemu/guest-memfd-manager.h
> +++ b/include/sysemu/guest-memfd-manager.h
> @@ -42,6 +42,8 @@ struct GuestMemfdManager {
>   struct GuestMemfdManagerClass {
>       ObjectClass parent_class;
>   
> +    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr, uint64_t region_size);
> +    void (*unrealize)(GuestMemfdManager *gmm);
>       int (*state_change)(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
>                           bool shared_to_private);
>   };
> @@ -61,4 +63,29 @@ static inline int guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
>       return 0;
>   }
>   
> +static inline void guest_memfd_manager_realize(GuestMemfdManager *gmm,
> +                                              MemoryRegion *mr, uint64_t region_size)
> +{
> +    GuestMemfdManagerClass *klass;
> +
> +    g_assert(gmm);
> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
> +
> +    if (klass->realize) {
> +        klass->realize(gmm, mr, region_size);

Ditch realize() hook and call guest_memfd_manager_realizefn() directly?
Not clear why these new hooks are needed.

> +    }
> +}
> +
> +static inline void guest_memfd_manager_unrealize(GuestMemfdManager *gmm)
> +{
> +    GuestMemfdManagerClass *klass;
> +
> +    g_assert(gmm);
> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
> +
> +    if (klass->unrealize) {
> +        klass->unrealize(gmm);
> +    }
> +}

guest_memfd_manager_unrealizefn()?


> +
>   #endif
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> index 6601df5f3f..b6a32f0bfb 100644
> --- a/system/guest-memfd-manager.c
> +++ b/system/guest-memfd-manager.c
> @@ -366,6 +366,31 @@ static int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
>       return ret;
>   }
>   
> +static void guest_memfd_manager_realizefn(GuestMemfdManager *gmm, MemoryRegion *mr,
> +                                          uint64_t region_size)
> +{
> +    uint64_t bitmap_size;
> +
> +    gmm->block_size = qemu_real_host_page_size();
> +    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm->block_size;

imho unaligned region_size should be an assert.

> +
> +    gmm->mr = mr;
> +    gmm->bitmap_size = bitmap_size;
> +    gmm->bitmap = bitmap_new(bitmap_size);
> +
> +    memory_region_set_ram_discard_manager(gmm->mr, RAM_DISCARD_MANAGER(gmm));
> +}

This belongs to 2/7.

> +
> +static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
> +{
> +    memory_region_set_ram_discard_manager(gmm->mr, NULL);
> +
> +    g_free(gmm->bitmap);
> +    gmm->bitmap = NULL;
> +    gmm->bitmap_size = 0;
> +    gmm->mr = NULL;

@gmm is being destroyed here, why bother zeroing?

> +}
> +

This function belongs to 2/7.

>   static void guest_memfd_manager_init(Object *obj)
>   {
>       GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
> @@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
>   
>   static void guest_memfd_manager_finalize(Object *obj)
>   {
> -    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>   }
>   
>   static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
> @@ -384,6 +408,8 @@ static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>   
>       gmmc->state_change = guest_memfd_state_change;
> +    gmmc->realize = guest_memfd_manager_realizefn;
> +    gmmc->unrealize = guest_memfd_manager_unrealizefn;
>   
>       rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>       rdmc->register_listener = guest_memfd_rdm_register_listener;
> diff --git a/system/physmem.c b/system/physmem.c
> index dc1db3a384..532182a6dd 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -53,6 +53,7 @@
>   #include "sysemu/hostmem.h"
>   #include "sysemu/hw_accel.h"
>   #include "sysemu/xen-mapcache.h"
> +#include "sysemu/guest-memfd-manager.h"
>   #include "trace.h"
>   
>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>               qemu_mutex_unlock_ramlist();
>               goto out_free;
>           }
> +
> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block->mr->size);

Wow. Quite invasive.

>       }
>   
>       ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> @@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
>   
>       if (block->guest_memfd >= 0) {
>           close(block->guest_memfd);
> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
> +        guest_memfd_manager_unrealize(gmm);
> +        object_unref(OBJECT(gmm));

Likely don't matter but I'd do the cleanup before close() or do 
block->guest_memfd=-1 before the cleanup. Thanks,


>           ram_block_discard_require(false);
>       }
>   

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2024-12-13  7:08 ` [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
  2024-12-18  6:45   ` Chenyi Qiang
@ 2025-01-08  4:48   ` Alexey Kardashevskiy
  2025-01-08 10:56     ` Chenyi Qiang
  2025-01-20 18:09   ` Peter Xu
  2 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-08  4:48 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13/12/24 18:08, Chenyi Qiang wrote:
> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") highlighted, some subsystems like VFIO might
> disable ram block discard. However, guest_memfd relies on the discard
> operation to perform page conversion between private and shared memory.
> This can lead to stale IOMMU mapping issue when assigning a hardware
> device to a confidential VM via shared memory (unprotected memory
> pages). Blocking shared page discard can solve this problem, but it
> could cause guests to consume twice the memory with VFIO, which is not
> acceptable in some cases. An alternative solution is to convey other
> systems like VFIO to refresh its outdated IOMMU mappings.
> 
> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
> VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other, so the similar work that needs to happen in response
> to virtio-mem changes needs to happen for page conversion events.
> Introduce the RamDiscardManager to guest_memfd to achieve it.
> 
> However, guest_memfd is not an object so it cannot directly implement
> the RamDiscardManager interface.
> 
> One solution is to implement the interface in HostMemoryBackend. Any

This sounds about right.

> guest_memfd-backed host memory backend can register itself in the target
> MemoryRegion. However, this solution doesn't cover the scenario where a
> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
> the virtual BIOS MemoryRegion.

What is this virtual BIOS MemoryRegion exactly? What does it look like 
in "info mtree -f"? Do we really want this memory to be DMAable?


> Thus, choose the second option, i.e. define an object type named
> guest_memfd_manager with RamDiscardManager interface. Upon creation of
> guest_memfd, a new guest_memfd_manager object can be instantiated and
> registered to the managed guest_memfd MemoryRegion to handle the page
> conversion events.
> 
> In the context of guest_memfd, the discarded state signifies that the
> page is private, while the populated state indicated that the page is
> shared. The state of the memory is tracked at the granularity of the
> host page size (i.e. block_size), as the minimum conversion size can be
> one page per request.
> 
> In addition, VFIO expects the DMA mapping for a specific iova to be
> mapped and unmapped with the same granularity. However, the confidential
> VMs may do partial conversion, e.g. conversion happens on a small region
> within a large region. To prevent such invalid cases and before any
> potential optimization comes out, all operations are performed with 4K
> granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   include/sysemu/guest-memfd-manager.h |  46 +++++
>   system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>   system/meson.build                   |   1 +
>   3 files changed, 297 insertions(+)
>   create mode 100644 include/sysemu/guest-memfd-manager.h
>   create mode 100644 system/guest-memfd-manager.c
> 
> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
> new file mode 100644
> index 0000000000..ba4a99b614
> --- /dev/null
> +++ b/include/sysemu/guest-memfd-manager.h
> @@ -0,0 +1,46 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
> +
> +#include "sysemu/hostmem.h"
> +
> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
> +
> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
> +
> +struct GuestMemfdManager {
> +    Object parent;
> +
> +    /* Managed memory region. */

Do not need this comment. And the period.

> +    MemoryRegion *mr;
> +
> +    /*
> +     * 1-setting of the bit represents the memory is populated (shared).
> +     */

Could be 1 line comment.

> +    int32_t bitmap_size;

int or unsigned

> +    unsigned long *bitmap;
> +
> +    /* block size and alignment */
> +    uint64_t block_size;

unsigned?

(u)int(32|64)_t make sense for migrations which is not the case (yet?). 
Thanks,

> +
> +    /* listeners to notify on populate/discard activity. */

Do not really need this comment either imho.

> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +struct GuestMemfdManagerClass {
> +    ObjectClass parent_class;
> +};
> +
> +#endif
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> new file mode 100644
> index 0000000000..d7e105fead
> --- /dev/null
> +++ b/system/guest-memfd-manager.c
> @@ -0,0 +1,250 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/guest-memfd-manager.h"
> +
> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
> +                                          guest_memfd_manager,
> +                                          GUEST_MEMFD_MANAGER,
> +                                          OBJECT,
> +                                          { TYPE_RAM_DISCARD_MANAGER },
> +                                          { })
> +
> +static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
> +                                         const MemoryRegionSection *section)
> +{
> +    const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    uint64_t first_bit = section->offset_within_region / gmm->block_size;
> +    uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
> +    unsigned long first_discard_bit;
> +
> +    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
> +    return first_discard_bit > last_bit;
> +}
> +
> +typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
> +
> +static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    return rdl->notify_populate(rdl, section);
> +}
> +
> +static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    rdl->notify_discard(rdl, section);
> +
> +    return 0;
> +}
> +
> +static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
> +                                                  MemoryRegionSection *section,
> +                                                  void *arg,
> +                                                  guest_memfd_section_cb cb)
> +{
> +    unsigned long first_one_bit, last_one_bit;
> +    uint64_t offset, size;
> +    int ret = 0;
> +
> +    first_one_bit = section->offset_within_region / gmm->block_size;
> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, first_one_bit);
> +
> +    while (first_one_bit < gmm->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_one_bit * gmm->block_size;
> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
> +                                          first_one_bit + 1) - 1;
> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;

This tries calling cb() on bigger chunks even though we say from the 
beginning that only page size is supported?

May be simplify this for now and extend if/when VFIO learns to split 
mappings,  or  just drop it when we get in-place page state convertion 
(which will make this all irrelevant)?


> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            break;
> +        }
> +
> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
> +                                      last_one_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
> +                                                  MemoryRegionSection *section,
> +                                                  void *arg,
> +                                                  guest_memfd_section_cb cb)
> +{
> +    unsigned long first_zero_bit, last_zero_bit;
> +    uint64_t offset, size;
> +    int ret = 0;
> +
> +    first_zero_bit = section->offset_within_region / gmm->block_size;
> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
> +                                        first_zero_bit);
> +
> +    while (first_zero_bit < gmm->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_zero_bit * gmm->block_size;
> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
> +                                      first_zero_bit + 1) - 1;
> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            break;
> +        }
> +
> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
> +                                            last_zero_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint64_t guest_memfd_rdm_get_min_granularity(const RamDiscardManager *rdm,
> +                                                    const MemoryRegion *mr)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +
> +    g_assert(mr == gmm->mr);
> +    return gmm->block_size;
> +}
> +
> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
> +                                              RamDiscardListener *rdl,
> +                                              MemoryRegionSection *section)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(section->mr == gmm->mr);
> +    rdl->section = memory_region_section_new_copy(section);
> +
> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
> +
> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
> +                                                 guest_memfd_notify_populate_cb);
> +    if (ret) {
> +        error_report("%s: Failed to register RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +}
> +
> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
> +                                                RamDiscardListener *rdl)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(rdl->section);
> +    g_assert(rdl->section->mr == gmm->mr);
> +
> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
> +                                                 guest_memfd_notify_discard_cb);
> +    if (ret) {
> +        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +
> +    memory_region_section_free_copy(rdl->section);
> +    rdl->section = NULL;
> +    QLIST_REMOVE(rdl, next);
> +
> +}
> +
> +typedef struct GuestMemfdReplayData {
> +    void *fn;

s/void */ReplayRamPopulate/

> +    void *opaque;
> +} GuestMemfdReplayData;
> +
> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
> +{
> +    struct GuestMemfdReplayData *data = arg;

Drop "struct" here and below.

> +    ReplayRamPopulate replay_fn = data->fn;
> +
> +    return replay_fn(section, data->opaque);
> +}
> +
> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
> +                                            MemoryRegionSection *section,
> +                                            ReplayRamPopulate replay_fn,
> +                                            void *opaque)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == gmm->mr);
> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
> +                                                  guest_memfd_rdm_replay_populated_cb);
> +}
> +
> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
> +{
> +    struct GuestMemfdReplayData *data = arg;
> +    ReplayRamDiscard replay_fn = data->fn;
> +
> +    replay_fn(section, data->opaque);


guest_memfd_rdm_replay_populated_cb() checks for errors though.

> +
> +    return 0;
> +}
> +
> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                             MemoryRegionSection *section,
> +                                             ReplayRamDiscard replay_fn,
> +                                             void *opaque)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == gmm->mr);
> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
> +                                           guest_memfd_rdm_replay_discarded_cb);
> +}
> +
> +static void guest_memfd_manager_init(Object *obj)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
> +
> +    QLIST_INIT(&gmm->rdl_list);
> +}
> +
> +static void guest_memfd_manager_finalize(Object *obj)
> +{
> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);


bitmap is not allocated though. And 5/7 removes this anyway. Thanks,


> +}
> +
> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
> +{
> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
> +
> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
> +}
> diff --git a/system/meson.build b/system/meson.build
> index 4952f4b2c7..ed4e1137bd 100644
> --- a/system/meson.build
> +++ b/system/meson.build
> @@ -15,6 +15,7 @@ system_ss.add(files(
>     'dirtylimit.c',
>     'dma-helpers.c',
>     'globals.c',
> +  'guest-memfd-manager.c',
>     'memory_mapping.c',
>     'qdev-monitor.c',
>     'qtest.c',

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-08  4:47 ` [PATCH 0/7] Enable shared device assignment Alexey Kardashevskiy
@ 2025-01-08  6:28   ` Chenyi Qiang
  2025-01-08 11:38     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-08  6:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

Thanks Alexey for your review!

On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") effectively disables device assignment when using guest_memfd.
>> This poses a significant challenge as guest_memfd is essential for
>> confidential guests, thereby blocking device assignment to these VMs.
>> The initial rationale for disabling device assignment was due to stale
>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>> problem for confidential guests [1]. However, this assumption has proven
>> to be incorrect. TEE I/O relies on the ability to operate devices against
>> "shared" or untrusted memory, which is crucial for device initialization
>> and error recovery scenarios. As a result, the current implementation
>> does
>> not adequately support device assignment for confidential guests,
>> necessitating
>> a reevaluation of the approach to ensure compatibility and functionality.
>>
>> This series enables shared device assignment by notifying VFIO of page
>> conversions using an existing framework named RamDiscardListener.
>> Additionally, there is an ongoing patch set [2] that aims to add 1G page
>> support for guest_memfd. This patch set introduces in-place page
>> conversion,
>> where private and shared memory share the same physical pages as the
>> backend.
>> This development may impact our solution.
>>
>> We presented our solution in the guest_memfd meeting to discuss its
>> compatibility with the new changes and potential future directions
>> (see [3]
>> for more details). The conclusion was that, although our solution may
>> not be
>> the most elegant (see the Limitation section), it is sufficient for
>> now and
>> can be easily adapted to future changes.
>>
>> We are re-posting the patch series with some cleanup and have removed
>> the RFC
>> label for the main enabling patches (1-6). The newly-added patch 7 is
>> still
>> marked as RFC as it tries to resolve some extension concerns related to
>> RamDiscardManager for future usage.
>>
>> The overview of the patches:
>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>    with a given range.
>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>    RamDiscardManager, and notify the shared/private state change during
>>    conversion.
>> - Patch 7: Try to resolve a semantics concern related to
>> RamDiscardManager
>>    i.e. RamDiscardManager is used to manage memory plug/unplug state
>>    instead of shared/private state. It would affect future users of
>>    RamDiscardManger in confidential VMs. Attach it behind as a RFC
>> patch[4].
>>
>> Changes since last version:
>> - Add a patch to export some generic helper functions from virtio-mem
>> code.
>> - Change the bitmap in guest_memfd_manager from default shared to default
>>    private. This keeps alignment with virtio-mem that 1-setting in bitmap
>>    represents the populated state and may help to export more generic
>> code
>>    if necessary.
>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>> instance
>>    to make it more clear.
>> - Add a patch to distinguish between the shared/private state change and
>>    the memory plug/unplug state change in RamDiscardManager.
>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>> chenyi.qiang@intel.com/
>>
>> ---
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. The key differences between guest_memfd and normal memfd
>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>> cannot be mapped, read or written by userspace.
> 
> The "cannot be mapped" seems to be not true soon anymore (if not already).
> 
> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/

Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
it below with in-place page conversion. Maybe I would move it here to
make it more clear.

> 
> 
>>
>> In QEMU's implementation, shared memory is allocated with normal methods
>> (e.g. mmap or fallocate) while private memory is allocated from
>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>> allocates new pages from the other side.
>>

[...]

>>
>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>> with
>> the same granularity. The guest may perform partial conversions, such as
>> converting a small region within a larger region. To prevent such invalid
>> cases, all operations are performed with 4K granularity. The possible
>> solutions we can think of are either to enable VFIO to support partial
>> unmap
>> or to implement an enlightened guest to avoid partial conversion. The
>> former
>> requires complex changes in VFIO, while the latter requires the page
>> conversion to be a guest-enlightened behavior. It is still uncertain
>> which
>> option is a preferred one.
> 
> in-place memory conversion is :)
> 
>>
>> Testing
>> =======
>> This patch series is tested with the KVM/QEMU branch:
>> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
>> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-
>> snapshot-2024-12-13
> 
> 
> The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have
> these though. Thanks,

Thanks for pointing it out. You're right,
tdx-upstream-snapshot-2024-12-18 is the latest branch. I added the fixup
for patch 1 and forgot to update the change here.

> 
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>      -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> If use the iommufd-backed VFIO with the qemu command:
>>
>> qemu-system-x86_64 [...]
>>      -object iommufd,id=iommufd0 \
>>      -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>>
>> No additional adjustment required.
>>
>> Following the bootup of the TD guest, the guest's IP address becomes
>> visible, and iperf is able to successfully send and receive data.

> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-01-08  4:47   ` Alexey Kardashevskiy
@ 2025-01-08  6:41     ` Chenyi Qiang
  0 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-08  6:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> Rename the helper to memory_region_section_intersect_range() to make it
>> more generic.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   hw/virtio/virtio-mem.c | 32 +++++---------------------------
>>   include/exec/memory.h  | 13 +++++++++++++
>>   system/memory.c        | 17 +++++++++++++++++
>>   3 files changed, 35 insertions(+), 27 deletions(-)
>>
>> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
>> index 80ada89551..e3d1ccaeeb 100644
>> --- a/hw/virtio/virtio-mem.c
>> +++ b/hw/virtio/virtio-mem.c
>> @@ -242,28 +242,6 @@ static int
>> virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
>>       return ret;
>>   }
>>   -/*
>> - * Adjust the memory section to cover the intersection with the given
>> range.
>> - *
>> - * Returns false if the intersection is empty, otherwise returns true.
>> - */
>> -static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
>> -                                                uint64_t offset,
>> uint64_t size)
>> -{
>> -    uint64_t start = MAX(s->offset_within_region, offset);
>> -    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
>> -                       offset + size);
>> -
>> -    if (end <= start) {
>> -        return false;
>> -    }
>> -
>> -    s->offset_within_address_space += start - s->offset_within_region;
>> -    s->offset_within_region = start;
>> -    s->size = int128_make64(end - start);
>> -    return true;
>> -}
>> -
>>   typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void
>> *arg);
>>     static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
>> @@ -285,7 +263,7 @@ static int
>> virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
>>                                         first_bit + 1) - 1;
>>           size = (last_bit - first_bit + 1) * vmem->block_size;
>>   -        if (!virtio_mem_intersect_memory_section(&tmp, offset,
>> size)) {
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>>               break;
>>           }
>>           ret = cb(&tmp, arg);
>> @@ -317,7 +295,7 @@ static int
>> virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
>>                                    first_bit + 1) - 1;
>>           size = (last_bit - first_bit + 1) * vmem->block_size;
>>   -        if (!virtio_mem_intersect_memory_section(&tmp, offset,
>> size)) {
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>>               break;
>>           }
>>           ret = cb(&tmp, arg);
>> @@ -353,7 +331,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM
>> *vmem, uint64_t offset,
>>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>>           MemoryRegionSection tmp = *rdl->section;
>>   -        if (!virtio_mem_intersect_memory_section(&tmp, offset,
>> size)) {
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>>               continue;
>>           }
>>           rdl->notify_discard(rdl, &tmp);
>> @@ -369,7 +347,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem,
>> uint64_t offset,
>>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>>           MemoryRegionSection tmp = *rdl->section;
>>   -        if (!virtio_mem_intersect_memory_section(&tmp, offset,
>> size)) {
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>>               continue;
>>           }
>>           ret = rdl->notify_populate(rdl, &tmp);
>> @@ -386,7 +364,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem,
>> uint64_t offset,
>>               if (rdl2 == rdl) {
>>                   break;
>>               }
>> -            if (!virtio_mem_intersect_memory_section(&tmp, offset,
>> size)) {
>> +            if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>>                   continue;
>>               }
>>               rdl2->notify_discard(rdl2, &tmp);
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index e5e865d1a9..ec7bc641e8 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -1196,6 +1196,19 @@ MemoryRegionSection
>> *memory_region_section_new_copy(MemoryRegionSection *s);
>>    */
>>   void memory_region_section_free_copy(MemoryRegionSection *s);
>>   +/**
>> + * memory_region_section_intersect_range: Adjust the memory section
>> to cover
>> + * the intersection with the given range.
>> + *
>> + * @s: the #MemoryRegionSection to be adjusted
>> + * @offset: the offset of the given range in the memory region
>> + * @size: the size of the given range
>> + *
>> + * Returns false if the intersection is empty, otherwise returns true.
>> + */
>> +bool memory_region_section_intersect_range(MemoryRegionSection *s,
>> +                                           uint64_t offset, uint64_t
>> size);
>> +
>>   /**
>>    * memory_region_init: Initialize a memory region
>>    *
>> diff --git a/system/memory.c b/system/memory.c
>> index 85f6834cb3..ddcec90f5e 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2898,6 +2898,23 @@ void
>> memory_region_section_free_copy(MemoryRegionSection *s)
>>       g_free(s);
>>   }
>>   +bool memory_region_section_intersect_range(MemoryRegionSection *s,
>> +                                           uint64_t offset, uint64_t
>> size)
>> +{
>> +    uint64_t start = MAX(s->offset_within_region, offset);
>> +    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
>> +                       offset + size);
> 
> imho @end needs to be Int128 and s/MIN/int128_min/, etc to be totally
> correct (although it is going to look horrendous). May be it was alright
> when it was just virtio but now it is a wider API. I understand this is
> cut-n-paste and unlikely scenario of offset+size crossing 1<<64 but
> still. Thanks,

Make sense. I'll change it in next version.

> 
> 
>> +
>> +    if (end <= start) {
>> +        return false;
>> +    }
>> +
>> +    s->offset_within_address_space += start - s->offset_within_region;
>> +    s->offset_within_region = start;
>> +    s->size = int128_make64(end - start);
>> +    return true;
>> +}
>> +
>>   bool memory_region_present(MemoryRegion *container, hwaddr addr)
>>   {
>>       MemoryRegion *mr;
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-08  4:48   ` Alexey Kardashevskiy
@ 2025-01-08 10:56     ` Chenyi Qiang
  2025-01-08 11:20       ` Alexey Kardashevskiy
  2025-01-13 10:54       ` David Hildenbrand
  0 siblings, 2 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-08 10:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>> uncoordinated discard") highlighted, some subsystems like VFIO might
>> disable ram block discard. However, guest_memfd relies on the discard
>> operation to perform page conversion between private and shared memory.
>> This can lead to stale IOMMU mapping issue when assigning a hardware
>> device to a confidential VM via shared memory (unprotected memory
>> pages). Blocking shared page discard can solve this problem, but it
>> could cause guests to consume twice the memory with VFIO, which is not
>> acceptable in some cases. An alternative solution is to convey other
>> systems like VFIO to refresh its outdated IOMMU mappings.
>>
>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>> VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other, so the similar work that needs to happen in response
>> to virtio-mem changes needs to happen for page conversion events.
>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>
>> However, guest_memfd is not an object so it cannot directly implement
>> the RamDiscardManager interface.
>>
>> One solution is to implement the interface in HostMemoryBackend. Any
> 
> This sounds about right.
> 
>> guest_memfd-backed host memory backend can register itself in the target
>> MemoryRegion. However, this solution doesn't cover the scenario where a
>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>> the virtual BIOS MemoryRegion.
> 
> What is this virtual BIOS MemoryRegion exactly? What does it look like
> in "info mtree -f"? Do we really want this memory to be DMAable?

virtual BIOS shows in a separate region:

 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
  ...
  00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
  0000000100000000-000000017fffffff (prio 0, ram): pc.ram
@0000000080000000 KVM

We also consider to implement the interface in HostMemoryBackend, but
maybe implement with guest_memfd region is more general. We don't know
if any DMAable memory would belong to HostMemoryBackend although at
present it is.

If it is more appropriate to implement it with HostMemoryBackend, I can
change to this way.

> 
> 
>> Thus, choose the second option, i.e. define an object type named
>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>> registered to the managed guest_memfd MemoryRegion to handle the page
>> conversion events.
>>
>> In the context of guest_memfd, the discarded state signifies that the
>> page is private, while the populated state indicated that the page is
>> shared. The state of the memory is tracked at the granularity of the
>> host page size (i.e. block_size), as the minimum conversion size can be
>> one page per request.
>>
>> In addition, VFIO expects the DMA mapping for a specific iova to be
>> mapped and unmapped with the same granularity. However, the confidential
>> VMs may do partial conversion, e.g. conversion happens on a small region
>> within a large region. To prevent such invalid cases and before any
>> potential optimization comes out, all operations are performed with 4K
>> granularity.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   include/sysemu/guest-memfd-manager.h |  46 +++++
>>   system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>>   system/meson.build                   |   1 +
>>   3 files changed, 297 insertions(+)
>>   create mode 100644 include/sysemu/guest-memfd-manager.h
>>   create mode 100644 system/guest-memfd-manager.c
>>
>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>> guest-memfd-manager.h
>> new file mode 100644
>> index 0000000000..ba4a99b614
>> --- /dev/null
>> +++ b/include/sysemu/guest-memfd-manager.h
>> @@ -0,0 +1,46 @@
>> +/*
>> + * QEMU guest memfd manager
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>> +
>> +#include "sysemu/hostmem.h"
>> +
>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>> +
>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>> GUEST_MEMFD_MANAGER)
>> +
>> +struct GuestMemfdManager {
>> +    Object parent;
>> +
>> +    /* Managed memory region. */
> 
> Do not need this comment. And the period.

[...]

> 
>> +    MemoryRegion *mr;
>> +
>> +    /*
>> +     * 1-setting of the bit represents the memory is populated (shared).
>> +     */

Will fix it.

> 
> Could be 1 line comment.
> 
>> +    int32_t bitmap_size;
> 
> int or unsigned
> 
>> +    unsigned long *bitmap;
>> +
>> +    /* block size and alignment */
>> +    uint64_t block_size;
> 
> unsigned?
> 
> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
> Thanks,

I think these fields would be helpful for future migration support.
Maybe defining as this way is more straightforward.

> 
>> +
>> +    /* listeners to notify on populate/discard activity. */
> 
> Do not really need this comment either imho.
> 

I prefer to provide the comment for each field as virtio-mem do. If it
is not necessary, I would remove those obvious ones.

>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>> +
>> +struct GuestMemfdManagerClass {
>> +    ObjectClass parent_class;
>> +};
>> +
>> +#endif

[...]

           void *arg,
>> +                                                 
>> guest_memfd_section_cb cb)
>> +{
>> +    unsigned long first_one_bit, last_one_bit;
>> +    uint64_t offset, size;
>> +    int ret = 0;
>> +
>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>> first_one_bit);
>> +
>> +    while (first_one_bit < gmm->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_one_bit * gmm->block_size;
>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                          first_one_bit + 1) - 1;
>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
> 
> This tries calling cb() on bigger chunks even though we say from the
> beginning that only page size is supported?
> 
> May be simplify this for now and extend if/when VFIO learns to split
> mappings,  or  just drop it when we get in-place page state convertion
> (which will make this all irrelevant)?

The cb() will call with big chunks but actually it do the split with the
granularity of block_size in the cb(). See the
vfio_ram_discard_notify_populate(), which do the DMA_MAP with
granularity size.

> 
> 
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            break;
>> +        }
>> +
>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                      last_one_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int guest_memfd_for_each_discarded_section(const
>> GuestMemfdManager *gmm,
>> +                                                  MemoryRegionSection
>> *section,
>> +                                                  void *arg,
>> +                                                 
>> guest_memfd_section_cb cb)
>> +{
>> +    unsigned long first_zero_bit, last_zero_bit;
>> +    uint64_t offset, size;
>> +    int ret = 0;
>> +
>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                        first_zero_bit);
>> +
>> +    while (first_zero_bit < gmm->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_zero_bit * gmm->block_size;
>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                      first_zero_bit + 1) - 1;
>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            break;
>> +        }
>> +
>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>> >bitmap_size,
>> +                                            last_zero_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>> RamDiscardManager *rdm,
>> +                                                    const
>> MemoryRegion *mr)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +
>> +    g_assert(mr == gmm->mr);
>> +    return gmm->block_size;
>> +}
>> +
>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>> +                                              RamDiscardListener *rdl,
>> +                                              MemoryRegionSection
>> *section)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(section->mr == gmm->mr);
>> +    rdl->section = memory_region_section_new_copy(section);
>> +
>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>> +
>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>> +                                                
>> guest_memfd_notify_populate_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to register RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +}
>> +
>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
>> +                                                RamDiscardListener *rdl)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(rdl->section);
>> +    g_assert(rdl->section->mr == gmm->mr);
>> +
>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
>> +                                                
>> guest_memfd_notify_discard_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to unregister RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +
>> +    memory_region_section_free_copy(rdl->section);
>> +    rdl->section = NULL;
>> +    QLIST_REMOVE(rdl, next);
>> +
>> +}
>> +
>> +typedef struct GuestMemfdReplayData {
>> +    void *fn;
> 
> s/void */ReplayRamPopulate/

[...]

> 
>> +    void *opaque;
>> +} GuestMemfdReplayData;
>> +
>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    struct GuestMemfdReplayData *data = arg;
> 
> Drop "struct" here and below.

Fixed. Thanks!

> 
>> +    ReplayRamPopulate replay_fn = data->fn;
>> +
>> +    return replay_fn(section, data->opaque);
>> +}
>> +
>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>> *rdm,
>> +                                            MemoryRegionSection
>> *section,
>> +                                            ReplayRamPopulate replay_fn,
>> +                                            void *opaque)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == gmm->mr);
>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>> +                                                 
>> guest_memfd_rdm_replay_populated_cb);
>> +}
>> +
>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    struct GuestMemfdReplayData *data = arg;
>> +    ReplayRamDiscard replay_fn = data->fn;
>> +
>> +    replay_fn(section, data->opaque);
> 
> 
> guest_memfd_rdm_replay_populated_cb() checks for errors though.

It follows current definiton of ReplayRamDiscard() and
ReplayRamPopulate() where replay_discard() doesn't return errors and
replay_populate() returns errors.

> 
>> +
>> +    return 0;
>> +}
>> +
>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>> *rdm,
>> +                                             MemoryRegionSection
>> *section,
>> +                                             ReplayRamDiscard replay_fn,
>> +                                             void *opaque)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == gmm->mr);
>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>> +                                          
>> guest_memfd_rdm_replay_discarded_cb);
>> +}
>> +
>> +static void guest_memfd_manager_init(Object *obj)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>> +
>> +    QLIST_INIT(&gmm->rdl_list);
>> +}
>> +
>> +static void guest_memfd_manager_finalize(Object *obj)
>> +{
>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
> 
> 
> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,

Will remove it. Thanks.

> 
> 
>> +}
>> +
>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>> +
>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>> +}
>> diff --git a/system/meson.build b/system/meson.build
>> index 4952f4b2c7..ed4e1137bd 100644
>> --- a/system/meson.build
>> +++ b/system/meson.build
>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>     'dirtylimit.c',
>>     'dma-helpers.c',
>>     'globals.c',
>> +  'guest-memfd-manager.c',
>>     'memory_mapping.c',
>>     'qdev-monitor.c',
>>     'qtest.c',
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-08 10:56     ` Chenyi Qiang
@ 2025-01-08 11:20       ` Alexey Kardashevskiy
  2025-01-09  2:11         ` Chenyi Qiang
  2025-01-13 10:54       ` David Hildenbrand
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-08 11:20 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 8/1/25 21:56, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>> disable ram block discard. However, guest_memfd relies on the discard
>>> operation to perform page conversion between private and shared memory.
>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>> device to a confidential VM via shared memory (unprotected memory
>>> pages). Blocking shared page discard can solve this problem, but it
>>> could cause guests to consume twice the memory with VFIO, which is not
>>> acceptable in some cases. An alternative solution is to convey other
>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>
>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>> VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other, so the similar work that needs to happen in response
>>> to virtio-mem changes needs to happen for page conversion events.
>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>
>>> However, guest_memfd is not an object so it cannot directly implement
>>> the RamDiscardManager interface.
>>>
>>> One solution is to implement the interface in HostMemoryBackend. Any
>>
>> This sounds about right.
>>
>>> guest_memfd-backed host memory backend can register itself in the target
>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>> the virtual BIOS MemoryRegion.
>>
>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>> in "info mtree -f"? Do we really want this memory to be DMAable?
> 
> virtual BIOS shows in a separate region:
> 
>   Root memory region: system
>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>    ...
>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM

Looks like a normal MR which can be backed by guest_memfd.

>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
> @0000000080000000 KVM

Anyway if there is no guest_memfd backing it and 
memory_region_has_ram_discard_manager() returns false, then the MR is 
just going to be mapped for VFIO as usual which seems... alright, right?


> We also consider to implement the interface in HostMemoryBackend, but
> maybe implement with guest_memfd region is more general. We don't know
> if any DMAable memory would belong to HostMemoryBackend although at
> present it is.
> 
> If it is more appropriate to implement it with HostMemoryBackend, I can
> change to this way.

Seems cleaner imho.

>>
>>
>>> Thus, choose the second option, i.e. define an object type named
>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>> conversion events.
>>>
>>> In the context of guest_memfd, the discarded state signifies that the
>>> page is private, while the populated state indicated that the page is
>>> shared. The state of the memory is tracked at the granularity of the
>>> host page size (i.e. block_size), as the minimum conversion size can be
>>> one page per request.
>>>
>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>> mapped and unmapped with the same granularity. However, the confidential
>>> VMs may do partial conversion, e.g. conversion happens on a small region
>>> within a large region. To prevent such invalid cases and before any
>>> potential optimization comes out, all operations are performed with 4K
>>> granularity.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>>    include/sysemu/guest-memfd-manager.h |  46 +++++
>>>    system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>>>    system/meson.build                   |   1 +
>>>    3 files changed, 297 insertions(+)
>>>    create mode 100644 include/sysemu/guest-memfd-manager.h
>>>    create mode 100644 system/guest-memfd-manager.c
>>>
>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>> guest-memfd-manager.h
>>> new file mode 100644
>>> index 0000000000..ba4a99b614
>>> --- /dev/null
>>> +++ b/include/sysemu/guest-memfd-manager.h
>>> @@ -0,0 +1,46 @@
>>> +/*
>>> + * QEMU guest memfd manager
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>> later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>> +
>>> +#include "sysemu/hostmem.h"
>>> +
>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>> +
>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>> GUEST_MEMFD_MANAGER)
>>> +
>>> +struct GuestMemfdManager {
>>> +    Object parent;
>>> +
>>> +    /* Managed memory region. */
>>
>> Do not need this comment. And the period.
> 
> [...]
> 
>>
>>> +    MemoryRegion *mr;
>>> +
>>> +    /*
>>> +     * 1-setting of the bit represents the memory is populated (shared).
>>> +     */
> 
> Will fix it.
> 
>>
>> Could be 1 line comment.
>>
>>> +    int32_t bitmap_size;
>>
>> int or unsigned
>>
>>> +    unsigned long *bitmap;
>>> +
>>> +    /* block size and alignment */
>>> +    uint64_t block_size;
>>
>> unsigned?
>>
>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>> Thanks,
> 
> I think these fields would be helpful for future migration support.
> Maybe defining as this way is more straightforward.
 >
>>
>>> +
>>> +    /* listeners to notify on populate/discard activity. */
>>
>> Do not really need this comment either imho.
>>
> 
> I prefer to provide the comment for each field as virtio-mem do. If it
> is not necessary, I would remove those obvious ones.

[bikeshedding on] But the "RamDiscardListener" word says that already, 
why repeating? :) It should add information, not duplicate. Like the 
block_size comment which mentions "alignment" [bikeshedding off]

>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>> +};
>>> +
>>> +struct GuestMemfdManagerClass {
>>> +    ObjectClass parent_class;
>>> +};
>>> +
>>> +#endif
> 
> [...]
> 
>             void *arg,
>>> +
>>> guest_memfd_section_cb cb)
>>> +{
>>> +    unsigned long first_one_bit, last_one_bit;
>>> +    uint64_t offset, size;
>>> +    int ret = 0;
>>> +
>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>> first_one_bit);
>>> +
>>> +    while (first_one_bit < gmm->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_one_bit * gmm->block_size;
>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                          first_one_bit + 1) - 1;
>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>
>> This tries calling cb() on bigger chunks even though we say from the
>> beginning that only page size is supported?
>>
>> May be simplify this for now and extend if/when VFIO learns to split
>> mappings,  or  just drop it when we get in-place page state convertion
>> (which will make this all irrelevant)?
> 
> The cb() will call with big chunks but actually it do the split with the
> granularity of block_size in the cb(). See the
> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
> granularity size.


Right, and this all happens inside QEMU - first the code finds bigger 
chunks and then it splits them anyway to call the VFIO driver. Seems 
pointless to bother about bigger chunks here.

> 
>>
>>
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +
>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                      last_one_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int guest_memfd_for_each_discarded_section(const
>>> GuestMemfdManager *gmm,
>>> +                                                  MemoryRegionSection
>>> *section,
>>> +                                                  void *arg,
>>> +
>>> guest_memfd_section_cb cb)
>>> +{
>>> +    unsigned long first_zero_bit, last_zero_bit;
>>> +    uint64_t offset, size;
>>> +    int ret = 0;
>>> +
>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                        first_zero_bit);
>>> +
>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_zero_bit * gmm->block_size;
>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                      first_zero_bit + 1) - 1;
>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +
>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>> bitmap_size,
>>> +                                            last_zero_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>> RamDiscardManager *rdm,
>>> +                                                    const
>>> MemoryRegion *mr)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +
>>> +    g_assert(mr == gmm->mr);
>>> +    return gmm->block_size;
>>> +}
>>> +
>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>> +                                              RamDiscardListener *rdl,
>>> +                                              MemoryRegionSection
>>> *section)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(section->mr == gmm->mr);
>>> +    rdl->section = memory_region_section_new_copy(section);
>>> +
>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>> +
>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>> +
>>> guest_memfd_notify_populate_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to register RAM discard listener:
>>> %s", __func__,
>>> +                     strerror(-ret));
>>> +    }
>>> +}
>>> +
>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
>>> +                                                RamDiscardListener *rdl)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(rdl->section);
>>> +    g_assert(rdl->section->mr == gmm->mr);
>>> +
>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
>>> +
>>> guest_memfd_notify_discard_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>> %s", __func__,
>>> +                     strerror(-ret));
>>> +    }
>>> +
>>> +    memory_region_section_free_copy(rdl->section);
>>> +    rdl->section = NULL;
>>> +    QLIST_REMOVE(rdl, next);
>>> +
>>> +}
>>> +
>>> +typedef struct GuestMemfdReplayData {
>>> +    void *fn;
>>
>> s/void */ReplayRamPopulate/
> 
> [...]
> 
>>
>>> +    void *opaque;
>>> +} GuestMemfdReplayData;
>>> +
>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>> *section, void *arg)
>>> +{
>>> +    struct GuestMemfdReplayData *data = arg;
>>
>> Drop "struct" here and below.
> 
> Fixed. Thanks!
> 
>>
>>> +    ReplayRamPopulate replay_fn = data->fn;
>>> +
>>> +    return replay_fn(section, data->opaque);
>>> +}
>>> +
>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>> *rdm,
>>> +                                            MemoryRegionSection
>>> *section,
>>> +                                            ReplayRamPopulate replay_fn,
>>> +                                            void *opaque)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == gmm->mr);
>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>> +
>>> guest_memfd_rdm_replay_populated_cb);
>>> +}
>>> +
>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>> *section, void *arg)
>>> +{
>>> +    struct GuestMemfdReplayData *data = arg;
>>> +    ReplayRamDiscard replay_fn = data->fn;
>>> +
>>> +    replay_fn(section, data->opaque);
>>
>>
>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
> 
> It follows current definiton of ReplayRamDiscard() and
> ReplayRamPopulate() where replay_discard() doesn't return errors and
> replay_populate() returns errors.

A trace would be appropriate imho. Thanks,

>>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>> *rdm,
>>> +                                             MemoryRegionSection
>>> *section,
>>> +                                             ReplayRamDiscard replay_fn,
>>> +                                             void *opaque)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == gmm->mr);
>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>> +
>>> guest_memfd_rdm_replay_discarded_cb);
>>> +}
>>> +
>>> +static void guest_memfd_manager_init(Object *obj)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>> +
>>> +    QLIST_INIT(&gmm->rdl_list);
>>> +}
>>> +
>>> +static void guest_memfd_manager_finalize(Object *obj)
>>> +{
>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>
>>
>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
> 
> Will remove it. Thanks.
> 
>>
>>
>>> +}
>>> +
>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>> +{
>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>> +
>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>> +}
>>> diff --git a/system/meson.build b/system/meson.build
>>> index 4952f4b2c7..ed4e1137bd 100644
>>> --- a/system/meson.build
>>> +++ b/system/meson.build
>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>      'dirtylimit.c',
>>>      'dma-helpers.c',
>>>      'globals.c',
>>> +  'guest-memfd-manager.c',
>>>      'memory_mapping.c',
>>>      'qdev-monitor.c',
>>>      'qtest.c',
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-08  6:28   ` Chenyi Qiang
@ 2025-01-08 11:38     ` Alexey Kardashevskiy
  2025-01-09  7:52       ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-08 11:38 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 8/1/25 17:28, Chenyi Qiang wrote:
> Thanks Alexey for your review!
> 
> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>> discard") effectively disables device assignment when using guest_memfd.
>>> This poses a significant challenge as guest_memfd is essential for
>>> confidential guests, thereby blocking device assignment to these VMs.
>>> The initial rationale for disabling device assignment was due to stale
>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>> problem for confidential guests [1]. However, this assumption has proven
>>> to be incorrect. TEE I/O relies on the ability to operate devices against
>>> "shared" or untrusted memory, which is crucial for device initialization
>>> and error recovery scenarios. As a result, the current implementation
>>> does
>>> not adequately support device assignment for confidential guests,
>>> necessitating
>>> a reevaluation of the approach to ensure compatibility and functionality.
>>>
>>> This series enables shared device assignment by notifying VFIO of page
>>> conversions using an existing framework named RamDiscardListener.
>>> Additionally, there is an ongoing patch set [2] that aims to add 1G page
>>> support for guest_memfd. This patch set introduces in-place page
>>> conversion,
>>> where private and shared memory share the same physical pages as the
>>> backend.
>>> This development may impact our solution.
>>>
>>> We presented our solution in the guest_memfd meeting to discuss its
>>> compatibility with the new changes and potential future directions
>>> (see [3]
>>> for more details). The conclusion was that, although our solution may
>>> not be
>>> the most elegant (see the Limitation section), it is sufficient for
>>> now and
>>> can be easily adapted to future changes.
>>>
>>> We are re-posting the patch series with some cleanup and have removed
>>> the RFC
>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>> still
>>> marked as RFC as it tries to resolve some extension concerns related to
>>> RamDiscardManager for future usage.
>>>
>>> The overview of the patches:
>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>     with a given range.
>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>     RamDiscardManager, and notify the shared/private state change during
>>>     conversion.
>>> - Patch 7: Try to resolve a semantics concern related to
>>> RamDiscardManager
>>>     i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>     instead of shared/private state. It would affect future users of
>>>     RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>> patch[4].
>>>
>>> Changes since last version:
>>> - Add a patch to export some generic helper functions from virtio-mem
>>> code.
>>> - Change the bitmap in guest_memfd_manager from default shared to default
>>>     private. This keeps alignment with virtio-mem that 1-setting in bitmap
>>>     represents the populated state and may help to export more generic
>>> code
>>>     if necessary.
>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>> instance
>>>     to make it more clear.
>>> - Add a patch to distinguish between the shared/private state change and
>>>     the memory plug/unplug state change in RamDiscardManager.
>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>> chenyi.qiang@intel.com/
>>>
>>> ---
>>>
>>> Background
>>> ==========
>>> Confidential VMs have two classes of memory: shared and private memory.
>>> Shared memory is accessible from the host/VMM while private memory is
>>> not. Confidential VMs can decide which memory is shared/private and
>>> convert memory between shared/private at runtime.
>>>
>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>> private memory. The key differences between guest_memfd and normal memfd
>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>>> cannot be mapped, read or written by userspace.
>>
>> The "cannot be mapped" seems to be not true soon anymore (if not already).
>>
>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/
> 
> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
> it below with in-place page conversion. Maybe I would move it here to
> make it more clear.
> 
>>
>>
>>>
>>> In QEMU's implementation, shared memory is allocated with normal methods
>>> (e.g. mmap or fallocate) while private memory is allocated from
>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>> allocates new pages from the other side.
>>>
> 
> [...]
> 
>>>
>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>> with
>>> the same granularity. The guest may perform partial conversions, such as
>>> converting a small region within a larger region. To prevent such invalid
>>> cases, all operations are performed with 4K granularity. The possible
>>> solutions we can think of are either to enable VFIO to support partial
>>> unmap

btw the old VFIO does not split mappings but iommufd seems to be capable 
of it - there is iopt_area_split(). What happens if you try unmapping a 
smaller chunk that does not exactly match any mapped chunk? thanks,


-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-08 11:20       ` Alexey Kardashevskiy
@ 2025-01-09  2:11         ` Chenyi Qiang
  2025-01-09  2:55           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-09  2:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 8/1/25 21:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
> 
> Looks like a normal MR which can be backed by guest_memfd.

Yes, virtual BIOS memory region is initialized by
memory_region_init_ram_guest_memfd() which will be backed by a guest_memfd.

The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
BIOS image will be loaded and then copied to private region. After that,
the loaded image will be discarded and this region become useless. So I
feel like this virtual BIOS should not be backed by guest_memfd?

> 
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
> 
> Anyway if there is no guest_memfd backing it and
> memory_region_has_ram_discard_manager() returns false, then the MR is
> just going to be mapped for VFIO as usual which seems... alright, right?

Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.

If we go with the HostMemoryBackend instead of guest_memfd_manager, this
MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
just ignore it since the MR is useless (but looks not so good).

> 
> 
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Seems cleaner imho.

I can go this way.

> 
>>>
>>>
>>>> Thus, choose the second option, i.e. define an object type named
>>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>>> conversion events.
>>>>
>>>> In the context of guest_memfd, the discarded state signifies that the
>>>> page is private, while the populated state indicated that the page is
>>>> shared. The state of the memory is tracked at the granularity of the
>>>> host page size (i.e. block_size), as the minimum conversion size can be
>>>> one page per request.
>>>>
>>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>>> mapped and unmapped with the same granularity. However, the
>>>> confidential
>>>> VMs may do partial conversion, e.g. conversion happens on a small
>>>> region
>>>> within a large region. To prevent such invalid cases and before any
>>>> potential optimization comes out, all operations are performed with 4K
>>>> granularity.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>>    include/sysemu/guest-memfd-manager.h |  46 +++++
>>>>    system/guest-memfd-manager.c         | 250 ++++++++++++++++++++++
>>>> +++++
>>>>    system/meson.build                   |   1 +
>>>>    3 files changed, 297 insertions(+)
>>>>    create mode 100644 include/sysemu/guest-memfd-manager.h
>>>>    create mode 100644 system/guest-memfd-manager.c
>>>>
>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>> guest-memfd-manager.h
>>>> new file mode 100644
>>>> index 0000000000..ba4a99b614
>>>> --- /dev/null
>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>> @@ -0,0 +1,46 @@
>>>> +/*
>>>> + * QEMU guest memfd manager
>>>> + *
>>>> + * Copyright Intel
>>>> + *
>>>> + * Author:
>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>> later.
>>>> + * See the COPYING file in the top-level directory
>>>> + *
>>>> + */
>>>> +
>>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>>> +
>>>> +#include "sysemu/hostmem.h"
>>>> +
>>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>>> +
>>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>>> GUEST_MEMFD_MANAGER)
>>>> +
>>>> +struct GuestMemfdManager {
>>>> +    Object parent;
>>>> +
>>>> +    /* Managed memory region. */
>>>
>>> Do not need this comment. And the period.
>>
>> [...]
>>
>>>
>>>> +    MemoryRegion *mr;
>>>> +
>>>> +    /*
>>>> +     * 1-setting of the bit represents the memory is populated
>>>> (shared).
>>>> +     */
>>
>> Will fix it.
>>
>>>
>>> Could be 1 line comment.
>>>
>>>> +    int32_t bitmap_size;
>>>
>>> int or unsigned
>>>
>>>> +    unsigned long *bitmap;
>>>> +
>>>> +    /* block size and alignment */
>>>> +    uint64_t block_size;
>>>
>>> unsigned?
>>>
>>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>>> Thanks,
>>
>> I think these fields would be helpful for future migration support.
>> Maybe defining as this way is more straightforward.
>>
>>>
>>>> +
>>>> +    /* listeners to notify on populate/discard activity. */
>>>
>>> Do not really need this comment either imho.
>>>
>>
>> I prefer to provide the comment for each field as virtio-mem do. If it
>> is not necessary, I would remove those obvious ones.
> 
> [bikeshedding on] But the "RamDiscardListener" word says that already,
> why repeating? :) It should add information, not duplicate. Like the
> block_size comment which mentions "alignment" [bikeshedding off]

Got it. Thanks!

> 
>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>> +};
>>>> +
>>>> +struct GuestMemfdManagerClass {
>>>> +    ObjectClass parent_class;
>>>> +};
>>>> +
>>>> +#endif
>>
>> [...]
>>
>>             void *arg,
>>>> +
>>>> guest_memfd_section_cb cb)
>>>> +{
>>>> +    unsigned long first_one_bit, last_one_bit;
>>>> +    uint64_t offset, size;
>>>> +    int ret = 0;
>>>> +
>>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> first_one_bit);
>>>> +
>>>> +    while (first_one_bit < gmm->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_one_bit * gmm->block_size;
>>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>> >bitmap_size,
>>>> +                                          first_one_bit + 1) - 1;
>>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>>
>>> This tries calling cb() on bigger chunks even though we say from the
>>> beginning that only page size is supported?
>>>
>>> May be simplify this for now and extend if/when VFIO learns to split
>>> mappings,  or  just drop it when we get in-place page state convertion
>>> (which will make this all irrelevant)?
>>
>> The cb() will call with big chunks but actually it do the split with the
>> granularity of block_size in the cb(). See the
>> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
>> granularity size.
> 
> 
> Right, and this all happens inside QEMU - first the code finds bigger
> chunks and then it splits them anyway to call the VFIO driver. Seems
> pointless to bother about bigger chunks here.
> 
>>
>>>
>>>
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                      last_one_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int guest_memfd_for_each_discarded_section(const
>>>> GuestMemfdManager *gmm,
>>>> +                                                  MemoryRegionSection
>>>> *section,
>>>> +                                                  void *arg,
>>>> +
>>>> guest_memfd_section_cb cb)
>>>> +{
>>>> +    unsigned long first_zero_bit, last_zero_bit;
>>>> +    uint64_t offset, size;
>>>> +    int ret = 0;
>>>> +
>>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                        first_zero_bit);
>>>> +
>>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_zero_bit * gmm->block_size;
>>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                      first_zero_bit + 1) - 1;
>>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>> bitmap_size,
>>>> +                                            last_zero_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>>> RamDiscardManager *rdm,
>>>> +                                                    const
>>>> MemoryRegion *mr)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +
>>>> +    g_assert(mr == gmm->mr);
>>>> +    return gmm->block_size;
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>>> +                                              RamDiscardListener *rdl,
>>>> +                                              MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    rdl->section = memory_region_section_new_copy(section);
>>>> +
>>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>>> +
>>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>>> +
>>>> guest_memfd_notify_populate_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to register RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager
>>>> *rdm,
>>>> +                                                RamDiscardListener
>>>> *rdl)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(rdl->section);
>>>> +    g_assert(rdl->section->mr == gmm->mr);
>>>> +
>>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section,
>>>> rdl,
>>>> +
>>>> guest_memfd_notify_discard_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +
>>>> +    memory_region_section_free_copy(rdl->section);
>>>> +    rdl->section = NULL;
>>>> +    QLIST_REMOVE(rdl, next);
>>>> +
>>>> +}
>>>> +
>>>> +typedef struct GuestMemfdReplayData {
>>>> +    void *fn;
>>>
>>> s/void */ReplayRamPopulate/
>>
>> [...]
>>
>>>
>>>> +    void *opaque;
>>>> +} GuestMemfdReplayData;
>>>> +
>>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    struct GuestMemfdReplayData *data = arg;
>>>
>>> Drop "struct" here and below.
>>
>> Fixed. Thanks!
>>
>>>
>>>> +    ReplayRamPopulate replay_fn = data->fn;
>>>> +
>>>> +    return replay_fn(section, data->opaque);
>>>> +}
>>>> +
>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>> *rdm,
>>>> +                                            MemoryRegionSection
>>>> *section,
>>>> +                                            ReplayRamPopulate
>>>> replay_fn,
>>>> +                                            void *opaque)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>>> +
>>>> guest_memfd_rdm_replay_populated_cb);
>>>> +}
>>>> +
>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    struct GuestMemfdReplayData *data = arg;
>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>> +
>>>> +    replay_fn(section, data->opaque);
>>>
>>>
>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>
>> It follows current definiton of ReplayRamDiscard() and
>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>> replay_populate() returns errors.
> 
> A trace would be appropriate imho. Thanks,

Sorry, can't catch you. What kind of info to be traced? The errors
returned by replay_populate()?

> 
>>>
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>> *rdm,
>>>> +                                             MemoryRegionSection
>>>> *section,
>>>> +                                             ReplayRamDiscard
>>>> replay_fn,
>>>> +                                             void *opaque)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>> +
>>>> guest_memfd_rdm_replay_discarded_cb);
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_init(Object *obj)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>> +
>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>> +{
>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>
>>>
>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>
>> Will remove it. Thanks.
>>
>>>
>>>
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>> *data)
>>>> +{
>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>> +
>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>> +}
>>>> diff --git a/system/meson.build b/system/meson.build
>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>> --- a/system/meson.build
>>>> +++ b/system/meson.build
>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>      'dirtylimit.c',
>>>>      'dma-helpers.c',
>>>>      'globals.c',
>>>> +  'guest-memfd-manager.c',
>>>>      'memory_mapping.c',
>>>>      'qdev-monitor.c',
>>>>      'qtest.c',
>>>
>>
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-09  2:11         ` Chenyi Qiang
@ 2025-01-09  2:55           ` Alexey Kardashevskiy
  2025-01-09  4:29             ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-09  2:55 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 9/1/25 13:11, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>> operation to perform page conversion between private and shared memory.
>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>
>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>>> back in the other, so the similar work that needs to happen in response
>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>
>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>> the RamDiscardManager interface.
>>>>>
>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>
>>>> This sounds about right.
>>>>
>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>> target
>>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>>> the virtual BIOS MemoryRegion.
>>>>
>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>
>>> virtual BIOS shows in a separate region:
>>>
>>>    Root memory region: system
>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>     ...
>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>
>> Looks like a normal MR which can be backed by guest_memfd.
> 
> Yes, virtual BIOS memory region is initialized by
> memory_region_init_ram_guest_memfd() which will be backed by a guest_memfd.
> 
> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
> BIOS image will be loaded and then copied to private region.
> After that,
> the loaded image will be discarded and this region become useless.

I'd think it is loaded as "struct Rom" and then copied to the 
MR-ram_guest_memfd() which does not leave MR useless - we still see 
"pc.bios" in the list so it is not discarded. What piece of code are you 
referring to exactly?


> So I
> feel like this virtual BIOS should not be backed by guest_memfd?

 From the above it sounds like the opposite, i.e. it should :)

>>
>>>     0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>> @0000000080000000 KVM
>>
>> Anyway if there is no guest_memfd backing it and
>> memory_region_has_ram_discard_manager() returns false, then the MR is
>> just going to be mapped for VFIO as usual which seems... alright, right?
> 
> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
> 
> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
> just ignore it since the MR is useless (but looks not so good).

Sorry I am missing necessary details here, let's figure out the above.

> 
>>
>>
>>> We also consider to implement the interface in HostMemoryBackend, but
>>> maybe implement with guest_memfd region is more general. We don't know
>>> if any DMAable memory would belong to HostMemoryBackend although at
>>> present it is.
>>>
>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>> change to this way.
>>
>> Seems cleaner imho.
> 
> I can go this way.
> 
>>
>>>>
>>>>
>>>>> Thus, choose the second option, i.e. define an object type named
>>>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>>>> conversion events.
>>>>>
>>>>> In the context of guest_memfd, the discarded state signifies that the
>>>>> page is private, while the populated state indicated that the page is
>>>>> shared. The state of the memory is tracked at the granularity of the
>>>>> host page size (i.e. block_size), as the minimum conversion size can be
>>>>> one page per request.
>>>>>
>>>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>>>> mapped and unmapped with the same granularity. However, the
>>>>> confidential
>>>>> VMs may do partial conversion, e.g. conversion happens on a small
>>>>> region
>>>>> within a large region. To prevent such invalid cases and before any
>>>>> potential optimization comes out, all operations are performed with 4K
>>>>> granularity.
>>>>>
>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> ---
>>>>>     include/sysemu/guest-memfd-manager.h |  46 +++++
>>>>>     system/guest-memfd-manager.c         | 250 ++++++++++++++++++++++
>>>>> +++++
>>>>>     system/meson.build                   |   1 +
>>>>>     3 files changed, 297 insertions(+)
>>>>>     create mode 100644 include/sysemu/guest-memfd-manager.h
>>>>>     create mode 100644 system/guest-memfd-manager.c
>>>>>
>>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>>> guest-memfd-manager.h
>>>>> new file mode 100644
>>>>> index 0000000000..ba4a99b614
>>>>> --- /dev/null
>>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>>> @@ -0,0 +1,46 @@
>>>>> +/*
>>>>> + * QEMU guest memfd manager
>>>>> + *
>>>>> + * Copyright Intel
>>>>> + *
>>>>> + * Author:
>>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> + *
>>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>>> later.
>>>>> + * See the COPYING file in the top-level directory
>>>>> + *
>>>>> + */
>>>>> +
>>>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>>>> +
>>>>> +#include "sysemu/hostmem.h"
>>>>> +
>>>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>>>> +
>>>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>>>> GUEST_MEMFD_MANAGER)
>>>>> +
>>>>> +struct GuestMemfdManager {
>>>>> +    Object parent;
>>>>> +
>>>>> +    /* Managed memory region. */
>>>>
>>>> Do not need this comment. And the period.
>>>
>>> [...]
>>>
>>>>
>>>>> +    MemoryRegion *mr;
>>>>> +
>>>>> +    /*
>>>>> +     * 1-setting of the bit represents the memory is populated
>>>>> (shared).
>>>>> +     */
>>>
>>> Will fix it.
>>>
>>>>
>>>> Could be 1 line comment.
>>>>
>>>>> +    int32_t bitmap_size;
>>>>
>>>> int or unsigned
>>>>
>>>>> +    unsigned long *bitmap;
>>>>> +
>>>>> +    /* block size and alignment */
>>>>> +    uint64_t block_size;
>>>>
>>>> unsigned?
>>>>
>>>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>>>> Thanks,
>>>
>>> I think these fields would be helpful for future migration support.
>>> Maybe defining as this way is more straightforward.
>>>
>>>>
>>>>> +
>>>>> +    /* listeners to notify on populate/discard activity. */
>>>>
>>>> Do not really need this comment either imho.
>>>>
>>>
>>> I prefer to provide the comment for each field as virtio-mem do. If it
>>> is not necessary, I would remove those obvious ones.
>>
>> [bikeshedding on] But the "RamDiscardListener" word says that already,
>> why repeating? :) It should add information, not duplicate. Like the
>> block_size comment which mentions "alignment" [bikeshedding off]
> 
> Got it. Thanks!
> 
>>
>>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>>> +};
>>>>> +
>>>>> +struct GuestMemfdManagerClass {
>>>>> +    ObjectClass parent_class;
>>>>> +};
>>>>> +
>>>>> +#endif
>>>
>>> [...]
>>>
>>>              void *arg,
>>>>> +
>>>>> guest_memfd_section_cb cb)
>>>>> +{
>>>>> +    unsigned long first_one_bit, last_one_bit;
>>>>> +    uint64_t offset, size;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> first_one_bit);
>>>>> +
>>>>> +    while (first_one_bit < gmm->bitmap_size) {
>>>>> +        MemoryRegionSection tmp = *section;
>>>>> +
>>>>> +        offset = first_one_bit * gmm->block_size;
>>>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>>> bitmap_size,
>>>>> +                                          first_one_bit + 1) - 1;
>>>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>>>
>>>> This tries calling cb() on bigger chunks even though we say from the
>>>> beginning that only page size is supported?
>>>>
>>>> May be simplify this for now and extend if/when VFIO learns to split
>>>> mappings,  or  just drop it when we get in-place page state convertion
>>>> (which will make this all irrelevant)?
>>>
>>> The cb() will call with big chunks but actually it do the split with the
>>> granularity of block_size in the cb(). See the
>>> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
>>> granularity size.
>>
>>
>> Right, and this all happens inside QEMU - first the code finds bigger
>> chunks and then it splits them anyway to call the VFIO driver. Seems
>> pointless to bother about bigger chunks here.
>>
>>>
>>>>
>>>>
>>>>> +
>>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>>> size)) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        ret = cb(&tmp, arg);
>>>>> +        if (ret) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> +                                      last_one_bit + 2);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static int guest_memfd_for_each_discarded_section(const
>>>>> GuestMemfdManager *gmm,
>>>>> +                                                  MemoryRegionSection
>>>>> *section,
>>>>> +                                                  void *arg,
>>>>> +
>>>>> guest_memfd_section_cb cb)
>>>>> +{
>>>>> +    unsigned long first_zero_bit, last_zero_bit;
>>>>> +    uint64_t offset, size;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> +                                        first_zero_bit);
>>>>> +
>>>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>>>> +        MemoryRegionSection tmp = *section;
>>>>> +
>>>>> +        offset = first_zero_bit * gmm->block_size;
>>>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> +                                      first_zero_bit + 1) - 1;
>>>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>>>> +
>>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>>> size)) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        ret = cb(&tmp, arg);
>>>>> +        if (ret) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>>> bitmap_size,
>>>>> +                                            last_zero_bit + 2);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>>>> RamDiscardManager *rdm,
>>>>> +                                                    const
>>>>> MemoryRegion *mr)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +
>>>>> +    g_assert(mr == gmm->mr);
>>>>> +    return gmm->block_size;
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>>>> +                                              RamDiscardListener *rdl,
>>>>> +                                              MemoryRegionSection
>>>>> *section)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    int ret;
>>>>> +
>>>>> +    g_assert(section->mr == gmm->mr);
>>>>> +    rdl->section = memory_region_section_new_copy(section);
>>>>> +
>>>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>>>> +
>>>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>>>> +
>>>>> guest_memfd_notify_populate_cb);
>>>>> +    if (ret) {
>>>>> +        error_report("%s: Failed to register RAM discard listener:
>>>>> %s", __func__,
>>>>> +                     strerror(-ret));
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager
>>>>> *rdm,
>>>>> +                                                RamDiscardListener
>>>>> *rdl)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    int ret;
>>>>> +
>>>>> +    g_assert(rdl->section);
>>>>> +    g_assert(rdl->section->mr == gmm->mr);
>>>>> +
>>>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section,
>>>>> rdl,
>>>>> +
>>>>> guest_memfd_notify_discard_cb);
>>>>> +    if (ret) {
>>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>>> %s", __func__,
>>>>> +                     strerror(-ret));
>>>>> +    }
>>>>> +
>>>>> +    memory_region_section_free_copy(rdl->section);
>>>>> +    rdl->section = NULL;
>>>>> +    QLIST_REMOVE(rdl, next);
>>>>> +
>>>>> +}
>>>>> +
>>>>> +typedef struct GuestMemfdReplayData {
>>>>> +    void *fn;
>>>>
>>>> s/void */ReplayRamPopulate/
>>>
>>> [...]
>>>
>>>>
>>>>> +    void *opaque;
>>>>> +} GuestMemfdReplayData;
>>>>> +
>>>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>>>> *section, void *arg)
>>>>> +{
>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>
>>>> Drop "struct" here and below.
>>>
>>> Fixed. Thanks!
>>>
>>>>
>>>>> +    ReplayRamPopulate replay_fn = data->fn;
>>>>> +
>>>>> +    return replay_fn(section, data->opaque);
>>>>> +}
>>>>> +
>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>> *rdm,
>>>>> +                                            MemoryRegionSection
>>>>> *section,
>>>>> +                                            ReplayRamPopulate
>>>>> replay_fn,
>>>>> +                                            void *opaque)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>> opaque };
>>>>> +
>>>>> +    g_assert(section->mr == gmm->mr);
>>>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>>>> +
>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>> +}
>>>>> +
>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>> *section, void *arg)
>>>>> +{
>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>> +
>>>>> +    replay_fn(section, data->opaque);
>>>>
>>>>
>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>
>>> It follows current definiton of ReplayRamDiscard() and
>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>> replay_populate() returns errors.
>>
>> A trace would be appropriate imho. Thanks,
> 
> Sorry, can't catch you. What kind of info to be traced? The errors
> returned by replay_populate()?

Yeah. imho these are useful as we expect this part to work in general 
too, right? Thanks,

> 
>>
>>>>
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>> *rdm,
>>>>> +                                             MemoryRegionSection
>>>>> *section,
>>>>> +                                             ReplayRamDiscard
>>>>> replay_fn,
>>>>> +                                             void *opaque)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>> opaque };
>>>>> +
>>>>> +    g_assert(section->mr == gmm->mr);
>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>> +
>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>> +
>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>> +{
>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>
>>>>
>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>
>>> Will remove it. Thanks.
>>>
>>>>
>>>>
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>> *data)
>>>>> +{
>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>> +
>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>> +}
>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>> --- a/system/meson.build
>>>>> +++ b/system/meson.build
>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>       'dirtylimit.c',
>>>>>       'dma-helpers.c',
>>>>>       'globals.c',
>>>>> +  'guest-memfd-manager.c',
>>>>>       'memory_mapping.c',
>>>>>       'qdev-monitor.c',
>>>>>       'qtest.c',
>>>>
>>>
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-09  2:55           ` Alexey Kardashevskiy
@ 2025-01-09  4:29             ` Chenyi Qiang
  2025-01-10  0:58               ` Alexey Kardashevskiy
  2025-01-14  6:45               ` Chenyi Qiang
  0 siblings, 2 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-09  4:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 13:11, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>> operation to perform page conversion between private and shared
>>>>>> memory.
>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>> not
>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>
>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>> adjust
>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>> adding it
>>>>>> back in the other, so the similar work that needs to happen in
>>>>>> response
>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>
>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>> the RamDiscardManager interface.
>>>>>>
>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>
>>>>> This sounds about right.
>>>>>
>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>> target
>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>> where a
>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>> e.g.
>>>>>> the virtual BIOS MemoryRegion.
>>>>>
>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>
>>>> virtual BIOS shows in a separate region:
>>>>
>>>>    Root memory region: system
>>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>     ...
>>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>
>>> Looks like a normal MR which can be backed by guest_memfd.
>>
>> Yes, virtual BIOS memory region is initialized by
>> memory_region_init_ram_guest_memfd() which will be backed by a
>> guest_memfd.
>>
>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>> BIOS image will be loaded and then copied to private region.
>> After that,
>> the loaded image will be discarded and this region become useless.
> 
> I'd think it is loaded as "struct Rom" and then copied to the MR-
> ram_guest_memfd() which does not leave MR useless - we still see
> "pc.bios" in the list so it is not discarded. What piece of code are you
> referring to exactly?

Sorry for confusion, maybe it is different between TDX and SEV-SNP for
the vBIOS handling.

In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
the vBIOS image to the shared part of the guest_memfd MR. For TDX, it
will copy the image to private region (not the vBIOS guest_memfd MR
private part) and discard the shared part. So, although the memory
region still exists, it seems useless.

It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
vBIOS guest_memfd private memory?

> 
> 
>> So I
>> feel like this virtual BIOS should not be backed by guest_memfd?
> 
> From the above it sounds like the opposite, i.e. it should :)
> 
>>>
>>>>     0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>> @0000000080000000 KVM
>>>
>>> Anyway if there is no guest_memfd backing it and
>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>> just going to be mapped for VFIO as usual which seems... alright, right?
>>
>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>
>> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>> just ignore it since the MR is useless (but looks not so good).
> 
> Sorry I am missing necessary details here, let's figure out the above.
> 
>>
>>>
>>>
>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>> maybe implement with guest_memfd region is more general. We don't know
>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>> present it is.
>>>>
>>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>>> change to this way.
>>>
>>> Seems cleaner imho.
>>
>> I can go this way.

[...]

>>>>>> +
>>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>>> *rdm,
>>>>>> +                                            MemoryRegionSection
>>>>>> *section,
>>>>>> +                                            ReplayRamPopulate
>>>>>> replay_fn,
>>>>>> +                                            void *opaque)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>> opaque };
>>>>>> +
>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>> &data,
>>>>>> +
>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>> +}
>>>>>> +
>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>> *section, void *arg)
>>>>>> +{
>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>> +
>>>>>> +    replay_fn(section, data->opaque);
>>>>>
>>>>>
>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>
>>>> It follows current definiton of ReplayRamDiscard() and
>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>> replay_populate() returns errors.
>>>
>>> A trace would be appropriate imho. Thanks,
>>
>> Sorry, can't catch you. What kind of info to be traced? The errors
>> returned by replay_populate()?
> 
> Yeah. imho these are useful as we expect this part to work in general
> too, right? Thanks,

Something like?

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6b3e1ee9d6..4440ac9e59 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -185,8 +185,14 @@ static int
guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamPopulate replay_fn = data->fn;
+    int ret;

-    return replay_fn(section, data->opaque);
+    ret = replay_fn(section, data->opaque);
+    if (ret) {
+        trace_guest_memfd_rdm_replay_populated_cb(ret);
+    }
+
+    return ret;
 }

How about just adding some error output in
guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section()
if the cb() (i.e. replay_populate()) returns error?

> 
>>
>>>
>>>>>
>>>>>> +
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>>> *rdm,
>>>>>> +                                             MemoryRegionSection
>>>>>> *section,
>>>>>> +                                             ReplayRamDiscard
>>>>>> replay_fn,
>>>>>> +                                             void *opaque)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>> opaque };
>>>>>> +
>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>>> +
>>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>> +
>>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>>> +{
>>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>
>>>>>
>>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>>
>>>> Will remove it. Thanks.
>>>>
>>>>>
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>>> *data)
>>>>>> +{
>>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>> +
>>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>>> +}
>>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>>> --- a/system/meson.build
>>>>>> +++ b/system/meson.build
>>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>>       'dirtylimit.c',
>>>>>>       'dma-helpers.c',
>>>>>>       'globals.c',
>>>>>> +  'guest-memfd-manager.c',
>>>>>>       'memory_mapping.c',
>>>>>>       'qdev-monitor.c',
>>>>>>       'qtest.c',
>>>>>
>>>>
>>>
>>
> 



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2025-01-08  4:47   ` Alexey Kardashevskiy
@ 2025-01-09  5:34     ` Chenyi Qiang
  2025-01-09  9:32       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-09  5:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> Introduce the realize()/unrealize() callbacks to initialize/uninitialize
>> the new guest_memfd_manager object and register/unregister it in the
>> target MemoryRegion.
>>
>> Guest_memfd was initially set to shared until the commit bd3bcf6962
>> ("kvm/memory: Make memory type private by default if it has guest memfd
>> backend"). To align with this change, the default state in
>> guest_memfd_manager is set to private. (The bitmap is cleared to 0).
>> Additionally, setting the default to private can also reduce the
>> overhead of mapping shared pages into IOMMU by VFIO during the bootup
>> stage.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++++++
>>   system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++++++-
>>   system/physmem.c                     |  7 +++++++
>>   3 files changed, 61 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>> guest-memfd-manager.h
>> index 9dc4e0346d..d1e7f698e8 100644
>> --- a/include/sysemu/guest-memfd-manager.h
>> +++ b/include/sysemu/guest-memfd-manager.h
>> @@ -42,6 +42,8 @@ struct GuestMemfdManager {
>>   struct GuestMemfdManagerClass {
>>       ObjectClass parent_class;
>>   +    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr,
>> uint64_t region_size);
>> +    void (*unrealize)(GuestMemfdManager *gmm);
>>       int (*state_change)(GuestMemfdManager *gmm, uint64_t offset,
>> uint64_t size,
>>                           bool shared_to_private);
>>   };
>> @@ -61,4 +63,29 @@ static inline int
>> guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
>>       return 0;
>>   }
>>   +static inline void guest_memfd_manager_realize(GuestMemfdManager *gmm,
>> +                                              MemoryRegion *mr,
>> uint64_t region_size)
>> +{
>> +    GuestMemfdManagerClass *klass;
>> +
>> +    g_assert(gmm);
>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>> +
>> +    if (klass->realize) {
>> +        klass->realize(gmm, mr, region_size);
> 
> Ditch realize() hook and call guest_memfd_manager_realizefn() directly?
> Not clear why these new hooks are needed.

> 
>> +    }
>> +}
>> +
>> +static inline void guest_memfd_manager_unrealize(GuestMemfdManager *gmm)
>> +{
>> +    GuestMemfdManagerClass *klass;
>> +
>> +    g_assert(gmm);
>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>> +
>> +    if (klass->unrealize) {
>> +        klass->unrealize(gmm);
>> +    }
>> +}
> 
> guest_memfd_manager_unrealizefn()?

Agree. Adding these wrappers seem unnecessary.

> 
> 
>> +
>>   #endif
>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>> index 6601df5f3f..b6a32f0bfb 100644
>> --- a/system/guest-memfd-manager.c
>> +++ b/system/guest-memfd-manager.c
>> @@ -366,6 +366,31 @@ static int
>> guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
>>       return ret;
>>   }
>>   +static void guest_memfd_manager_realizefn(GuestMemfdManager *gmm,
>> MemoryRegion *mr,
>> +                                          uint64_t region_size)
>> +{
>> +    uint64_t bitmap_size;
>> +
>> +    gmm->block_size = qemu_real_host_page_size();
>> +    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm-
>> >block_size;
> 
> imho unaligned region_size should be an assert.

There's no guarantee the region_size of the MemoryRegion is PAGE_SIZE
aligned. So the ROUND_UP() is more appropriate.

> 
>> +
>> +    gmm->mr = mr;
>> +    gmm->bitmap_size = bitmap_size;
>> +    gmm->bitmap = bitmap_new(bitmap_size);
>> +
>> +    memory_region_set_ram_discard_manager(gmm->mr,
>> RAM_DISCARD_MANAGER(gmm));
>> +}
> 
> This belongs to 2/7.
> 
>> +
>> +static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
>> +{
>> +    memory_region_set_ram_discard_manager(gmm->mr, NULL);
>> +
>> +    g_free(gmm->bitmap);
>> +    gmm->bitmap = NULL;
>> +    gmm->bitmap_size = 0;
>> +    gmm->mr = NULL;
> 
> @gmm is being destroyed here, why bother zeroing?

OK, will remove it.

> 
>> +}
>> +
> 
> This function belongs to 2/7.

Will move both realizefn() and unrealizefn().

> 
>>   static void guest_memfd_manager_init(Object *obj)
>>   {
>>       GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>> @@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
>>     static void guest_memfd_manager_finalize(Object *obj)
>>   {
>> -    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>   }
>>     static void guest_memfd_manager_class_init(ObjectClass *oc, void
>> *data)
>> @@ -384,6 +408,8 @@ static void
>> guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>         gmmc->state_change = guest_memfd_state_change;
>> +    gmmc->realize = guest_memfd_manager_realizefn;
>> +    gmmc->unrealize = guest_memfd_manager_unrealizefn;
>>         rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>       rdmc->register_listener = guest_memfd_rdm_register_listener;
>> diff --git a/system/physmem.c b/system/physmem.c
>> index dc1db3a384..532182a6dd 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -53,6 +53,7 @@
>>   #include "sysemu/hostmem.h"
>>   #include "sysemu/hw_accel.h"
>>   #include "sysemu/xen-mapcache.h"
>> +#include "sysemu/guest-memfd-manager.h"
>>   #include "trace.h"
>>     #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block,
>> Error **errp)
>>               qemu_mutex_unlock_ramlist();
>>               goto out_free;
>>           }
>> +
>> +        GuestMemfdManager *gmm =
>> GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
>> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block-
>> >mr->size);
> 
> Wow. Quite invasive.

Yeah... It creates a manager object no matter whether the user wants to
use shared passthru or not. We assume some fields like private/shared
bitmap may also be helpful in other scenario for future usage, and if no
passthru device, the listener would just return, so it is acceptable.

> 
>>       }
>>         ram_size = (new_block->offset + new_block->max_length) >>
>> TARGET_PAGE_BITS;
>> @@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
>>         if (block->guest_memfd >= 0) {
>>           close(block->guest_memfd);
>> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
>> +        guest_memfd_manager_unrealize(gmm);
>> +        object_unref(OBJECT(gmm));
> 
> Likely don't matter but I'd do the cleanup before close() or do block-
>>guest_memfd=-1 before the cleanup. Thanks,
> 
> 
>>           ram_block_discard_require(false);
>>       }
>>   
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-08 11:38     ` Alexey Kardashevskiy
@ 2025-01-09  7:52       ` Chenyi Qiang
  2025-01-09  8:18         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-09  7:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 8/1/25 17:28, Chenyi Qiang wrote:
>> Thanks Alexey for your review!
>>
>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>> discard") effectively disables device assignment when using
>>>> guest_memfd.
>>>> This poses a significant challenge as guest_memfd is essential for
>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>> The initial rationale for disabling device assignment was due to stale
>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>>> problem for confidential guests [1]. However, this assumption has
>>>> proven
>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>> against
>>>> "shared" or untrusted memory, which is crucial for device
>>>> initialization
>>>> and error recovery scenarios. As a result, the current implementation
>>>> does
>>>> not adequately support device assignment for confidential guests,
>>>> necessitating
>>>> a reevaluation of the approach to ensure compatibility and
>>>> functionality.
>>>>
>>>> This series enables shared device assignment by notifying VFIO of page
>>>> conversions using an existing framework named RamDiscardListener.
>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>> page
>>>> support for guest_memfd. This patch set introduces in-place page
>>>> conversion,
>>>> where private and shared memory share the same physical pages as the
>>>> backend.
>>>> This development may impact our solution.
>>>>
>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>> compatibility with the new changes and potential future directions
>>>> (see [3]
>>>> for more details). The conclusion was that, although our solution may
>>>> not be
>>>> the most elegant (see the Limitation section), it is sufficient for
>>>> now and
>>>> can be easily adapted to future changes.
>>>>
>>>> We are re-posting the patch series with some cleanup and have removed
>>>> the RFC
>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>> still
>>>> marked as RFC as it tries to resolve some extension concerns related to
>>>> RamDiscardManager for future usage.
>>>>
>>>> The overview of the patches:
>>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>>     with a given range.
>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>     RamDiscardManager, and notify the shared/private state change
>>>> during
>>>>     conversion.
>>>> - Patch 7: Try to resolve a semantics concern related to
>>>> RamDiscardManager
>>>>     i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>>     instead of shared/private state. It would affect future users of
>>>>     RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>> patch[4].
>>>>
>>>> Changes since last version:
>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>> code.
>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>> default
>>>>     private. This keeps alignment with virtio-mem that 1-setting in
>>>> bitmap
>>>>     represents the populated state and may help to export more generic
>>>> code
>>>>     if necessary.
>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>> instance
>>>>     to make it more clear.
>>>> - Add a patch to distinguish between the shared/private state change
>>>> and
>>>>     the memory plug/unplug state change in RamDiscardManager.
>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>> chenyi.qiang@intel.com/
>>>>
>>>> ---
>>>>
>>>> Background
>>>> ==========
>>>> Confidential VMs have two classes of memory: shared and private memory.
>>>> Shared memory is accessible from the host/VMM while private memory is
>>>> not. Confidential VMs can decide which memory is shared/private and
>>>> convert memory between shared/private at runtime.
>>>>
>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>>> private memory. The key differences between guest_memfd and normal
>>>> memfd
>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>> VM and
>>>> cannot be mapped, read or written by userspace.
>>>
>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>> already).
>>>
>>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/
>>
>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>> it below with in-place page conversion. Maybe I would move it here to
>> make it more clear.
>>
>>>
>>>
>>>>
>>>> In QEMU's implementation, shared memory is allocated with normal
>>>> methods
>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>> via
>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>> allocates new pages from the other side.
>>>>
>>
>> [...]
>>
>>>>
>>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>> with
>>>> the same granularity. The guest may perform partial conversions,
>>>> such as
>>>> converting a small region within a larger region. To prevent such
>>>> invalid
>>>> cases, all operations are performed with 4K granularity. The possible
>>>> solutions we can think of are either to enable VFIO to support partial
>>>> unmap
> 
> btw the old VFIO does not split mappings but iommufd seems to be capable
> of it - there is iopt_area_split(). What happens if you try unmapping a
> smaller chunk that does not exactly match any mapped chunk? thanks,

iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
disable_large_page=true. That means the large IOPTE is also disabled in
IOMMU. So it can do the split easily. See the comment in
iommufd_vfio_set_iommu().

iommufd VFIO compatible mode is a transition from legacy VFIO to
iommufd. For the normal iommufd, it requires the iova/length must be a
superset of a previously mapped range. If not match, will return error.

> 
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2024-12-13  7:08 ` [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
  2025-01-08  4:47   ` Alexey Kardashevskiy
@ 2025-01-09  8:14   ` Zhao Liu
  2025-01-09  8:17     ` Chenyi Qiang
  1 sibling, 1 reply; 98+ messages in thread
From: Zhao Liu @ 2025-01-09  8:14 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>              qemu_mutex_unlock_ramlist();
>              goto out_free;
>          }
> +
> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block->mr->size);

realize & unrealize are usually used for QDev. I think it's not good to use
*realize and *unrealize here.

Why about "guest_memfd_manager_attach_ram"?

In addition, it seems the third parameter is unnecessary and we can access
MemoryRegion.size directly in guest_memfd_manager_realize().

>      }
>  
>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> @@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
>  
>      if (block->guest_memfd >= 0) {
>          close(block->guest_memfd);
> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
> +        guest_memfd_manager_unrealize(gmm);

Similiarly, what about "guest_memfd_manager_unattach_ram"?

> +        object_unref(OBJECT(gmm));
>          ram_block_discard_require(false);
>      }
>  

Regards,
Zhao



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2025-01-09  8:14   ` Zhao Liu
@ 2025-01-09  8:17     ` Chenyi Qiang
  0 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-09  8:17 UTC (permalink / raw)
  To: Zhao Liu
  Cc: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

Thanks Zhao for your review!

On 1/9/2025 4:14 PM, Zhao Liu wrote:
>>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>              qemu_mutex_unlock_ramlist();
>>              goto out_free;
>>          }
>> +
>> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
>> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block->mr->size);
> 
> realize & unrealize are usually used for QDev. I think it's not good to use
> *realize and *unrealize here.
> 
> Why about "guest_memfd_manager_attach_ram"?
> 
> In addition, it seems the third parameter is unnecessary and we can access
> MemoryRegion.size directly in guest_memfd_manager_realize().

LGTM. Will follow your suggestion if we still wrap the operations in one
function. (We may change to the HostMemoryBackend RDM then unpack the
operations).

> 
>>      }
>>  
>>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
>> @@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
>>  
>>      if (block->guest_memfd >= 0) {
>>          close(block->guest_memfd);
>> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
>> +        guest_memfd_manager_unrealize(gmm);
> 
> Similiarly, what about "guest_memfd_manager_unattach_ram"?

Ditto. thanks.

> 
>> +        object_unref(OBJECT(gmm));
>>          ram_block_discard_require(false);
>>      }
>>  
> 
> Regards,
> Zhao
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-09  7:52       ` Chenyi Qiang
@ 2025-01-09  8:18         ` Alexey Kardashevskiy
  2025-01-09  8:49           ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-09  8:18 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 9/1/25 18:52, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>> Thanks Alexey for your review!
>>>
>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>> discard") effectively disables device assignment when using
>>>>> guest_memfd.
>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>>> The initial rationale for disabling device assignment was due to stale
>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>>>> problem for confidential guests [1]. However, this assumption has
>>>>> proven
>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>> against
>>>>> "shared" or untrusted memory, which is crucial for device
>>>>> initialization
>>>>> and error recovery scenarios. As a result, the current implementation
>>>>> does
>>>>> not adequately support device assignment for confidential guests,
>>>>> necessitating
>>>>> a reevaluation of the approach to ensure compatibility and
>>>>> functionality.
>>>>>
>>>>> This series enables shared device assignment by notifying VFIO of page
>>>>> conversions using an existing framework named RamDiscardListener.
>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>> page
>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>> conversion,
>>>>> where private and shared memory share the same physical pages as the
>>>>> backend.
>>>>> This development may impact our solution.
>>>>>
>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>> compatibility with the new changes and potential future directions
>>>>> (see [3]
>>>>> for more details). The conclusion was that, although our solution may
>>>>> not be
>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>> now and
>>>>> can be easily adapted to future changes.
>>>>>
>>>>> We are re-posting the patch series with some cleanup and have removed
>>>>> the RFC
>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>>> still
>>>>> marked as RFC as it tries to resolve some extension concerns related to
>>>>> RamDiscardManager for future usage.
>>>>>
>>>>> The overview of the patches:
>>>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>>>      with a given range.
>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>      RamDiscardManager, and notify the shared/private state change
>>>>> during
>>>>>      conversion.
>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>> RamDiscardManager
>>>>>      i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>>>      instead of shared/private state. It would affect future users of
>>>>>      RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>>> patch[4].
>>>>>
>>>>> Changes since last version:
>>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>>> code.
>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>> default
>>>>>      private. This keeps alignment with virtio-mem that 1-setting in
>>>>> bitmap
>>>>>      represents the populated state and may help to export more generic
>>>>> code
>>>>>      if necessary.
>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>>> instance
>>>>>      to make it more clear.
>>>>> - Add a patch to distinguish between the shared/private state change
>>>>> and
>>>>>      the memory plug/unplug state change in RamDiscardManager.
>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>> chenyi.qiang@intel.com/
>>>>>
>>>>> ---
>>>>>
>>>>> Background
>>>>> ==========
>>>>> Confidential VMs have two classes of memory: shared and private memory.
>>>>> Shared memory is accessible from the host/VMM while private memory is
>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>> convert memory between shared/private at runtime.
>>>>>
>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>>>> private memory. The key differences between guest_memfd and normal
>>>>> memfd
>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>> VM and
>>>>> cannot be mapped, read or written by userspace.
>>>>
>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>> already).
>>>>
>>>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/
>>>
>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>>> it below with in-place page conversion. Maybe I would move it here to
>>> make it more clear.
>>>
>>>>
>>>>
>>>>>
>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>> methods
>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>>> via
>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>>> allocates new pages from the other side.
>>>>>
>>>
>>> [...]
>>>
>>>>>
>>>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>>> with
>>>>> the same granularity. The guest may perform partial conversions,
>>>>> such as
>>>>> converting a small region within a larger region. To prevent such
>>>>> invalid
>>>>> cases, all operations are performed with 4K granularity. The possible
>>>>> solutions we can think of are either to enable VFIO to support partial
>>>>> unmap
>>
>> btw the old VFIO does not split mappings but iommufd seems to be capable
>> of it - there is iopt_area_split(). What happens if you try unmapping a
>> smaller chunk that does not exactly match any mapped chunk? thanks,
> 
> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
> disable_large_page=true. That means the large IOPTE is also disabled in
> IOMMU. So it can do the split easily. See the comment in
> iommufd_vfio_set_iommu().
> 
> iommufd VFIO compatible mode is a transition from legacy VFIO to
> iommufd. For the normal iommufd, it requires the iova/length must be a
> superset of a previously mapped range. If not match, will return error.


This is all true but this also means that "The former requires complex 
changes in VFIO" is not entirely true - some code is already there. Thanks,



-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-09  8:18         ` Alexey Kardashevskiy
@ 2025-01-09  8:49           ` Chenyi Qiang
  2025-01-10  1:42             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-09  8:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 18:52, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>> Thanks Alexey for your review!
>>>>
>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>>> discard") effectively disables device assignment when using
>>>>>> guest_memfd.
>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>>>> The initial rationale for disabling device assignment was due to
>>>>>> stale
>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>> assignment
>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>> proven
>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>> against
>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>> initialization
>>>>>> and error recovery scenarios. As a result, the current implementation
>>>>>> does
>>>>>> not adequately support device assignment for confidential guests,
>>>>>> necessitating
>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>> functionality.
>>>>>>
>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>> page
>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>> page
>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>> conversion,
>>>>>> where private and shared memory share the same physical pages as the
>>>>>> backend.
>>>>>> This development may impact our solution.
>>>>>>
>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>> compatibility with the new changes and potential future directions
>>>>>> (see [3]
>>>>>> for more details). The conclusion was that, although our solution may
>>>>>> not be
>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>> now and
>>>>>> can be easily adapted to future changes.
>>>>>>
>>>>>> We are re-posting the patch series with some cleanup and have removed
>>>>>> the RFC
>>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>>>> still
>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>> related to
>>>>>> RamDiscardManager for future usage.
>>>>>>
>>>>>> The overview of the patches:
>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>> MemoryRegionSection
>>>>>>      with a given range.
>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>      RamDiscardManager, and notify the shared/private state change
>>>>>> during
>>>>>>      conversion.
>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>> RamDiscardManager
>>>>>>      i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>> state
>>>>>>      instead of shared/private state. It would affect future users of
>>>>>>      RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>>>> patch[4].
>>>>>>
>>>>>> Changes since last version:
>>>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>>>> code.
>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>> default
>>>>>>      private. This keeps alignment with virtio-mem that 1-setting in
>>>>>> bitmap
>>>>>>      represents the populated state and may help to export more
>>>>>> generic
>>>>>> code
>>>>>>      if necessary.
>>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>>>> instance
>>>>>>      to make it more clear.
>>>>>> - Add a patch to distinguish between the shared/private state change
>>>>>> and
>>>>>>      the memory plug/unplug state change in RamDiscardManager.
>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>> chenyi.qiang@intel.com/
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Background
>>>>>> ==========
>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>> memory.
>>>>>> Shared memory is accessible from the host/VMM while private memory is
>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>> convert memory between shared/private at runtime.
>>>>>>
>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>> guest
>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>> memfd
>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>> VM and
>>>>>> cannot be mapped, read or written by userspace.
>>>>>
>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>> already).
>>>>>
>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>> tabba@google.com/T/
>>>>
>>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>>>> it below with in-place page conversion. Maybe I would move it here to
>>>> make it more clear.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>> methods
>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>>>> via
>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>>>> allocates new pages from the other side.
>>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>> VFIO
>>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>>>> with
>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>> such as
>>>>>> converting a small region within a larger region. To prevent such
>>>>>> invalid
>>>>>> cases, all operations are performed with 4K granularity. The possible
>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>> partial
>>>>>> unmap
>>>
>>> btw the old VFIO does not split mappings but iommufd seems to be capable
>>> of it - there is iopt_area_split(). What happens if you try unmapping a
>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>
>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>> disable_large_page=true. That means the large IOPTE is also disabled in
>> IOMMU. So it can do the split easily. See the comment in
>> iommufd_vfio_set_iommu().
>>
>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>> iommufd. For the normal iommufd, it requires the iova/length must be a
>> superset of a previously mapped range. If not match, will return error.
> 
> 
> This is all true but this also means that "The former requires complex
> changes in VFIO" is not entirely true - some code is already there. Thanks,

Hmm, my statement is a little confusing.  The bottleneck is that the
IOMMU driver doesn't support the large page split. So if we want to
enable large page and want to do partial unmap, it requires complex change.

> 
> 
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2025-01-09  5:34     ` Chenyi Qiang
@ 2025-01-09  9:32       ` Alexey Kardashevskiy
  2025-01-10  5:13         ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-09  9:32 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 9/1/25 16:34, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> Introduce the realize()/unrealize() callbacks to initialize/uninitialize
>>> the new guest_memfd_manager object and register/unregister it in the
>>> target MemoryRegion.
>>>
>>> Guest_memfd was initially set to shared until the commit bd3bcf6962
>>> ("kvm/memory: Make memory type private by default if it has guest memfd
>>> backend"). To align with this change, the default state in
>>> guest_memfd_manager is set to private. (The bitmap is cleared to 0).
>>> Additionally, setting the default to private can also reduce the
>>> overhead of mapping shared pages into IOMMU by VFIO during the bootup
>>> stage.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>>    include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++++++
>>>    system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++++++-
>>>    system/physmem.c                     |  7 +++++++
>>>    3 files changed, 61 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>> guest-memfd-manager.h
>>> index 9dc4e0346d..d1e7f698e8 100644
>>> --- a/include/sysemu/guest-memfd-manager.h
>>> +++ b/include/sysemu/guest-memfd-manager.h
>>> @@ -42,6 +42,8 @@ struct GuestMemfdManager {
>>>    struct GuestMemfdManagerClass {
>>>        ObjectClass parent_class;
>>>    +    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr,
>>> uint64_t region_size);
>>> +    void (*unrealize)(GuestMemfdManager *gmm);
>>>        int (*state_change)(GuestMemfdManager *gmm, uint64_t offset,
>>> uint64_t size,
>>>                            bool shared_to_private);
>>>    };
>>> @@ -61,4 +63,29 @@ static inline int
>>> guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
>>>        return 0;
>>>    }
>>>    +static inline void guest_memfd_manager_realize(GuestMemfdManager *gmm,
>>> +                                              MemoryRegion *mr,
>>> uint64_t region_size)
>>> +{
>>> +    GuestMemfdManagerClass *klass;
>>> +
>>> +    g_assert(gmm);
>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>> +
>>> +    if (klass->realize) {
>>> +        klass->realize(gmm, mr, region_size);
>>
>> Ditch realize() hook and call guest_memfd_manager_realizefn() directly?
>> Not clear why these new hooks are needed.
> 
>>
>>> +    }
>>> +}
>>> +
>>> +static inline void guest_memfd_manager_unrealize(GuestMemfdManager *gmm)
>>> +{
>>> +    GuestMemfdManagerClass *klass;
>>> +
>>> +    g_assert(gmm);
>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>> +
>>> +    if (klass->unrealize) {
>>> +        klass->unrealize(gmm);
>>> +    }
>>> +}
>>
>> guest_memfd_manager_unrealizefn()?
> 
> Agree. Adding these wrappers seem unnecessary.
> 
>>
>>
>>> +
>>>    #endif
>>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>>> index 6601df5f3f..b6a32f0bfb 100644
>>> --- a/system/guest-memfd-manager.c
>>> +++ b/system/guest-memfd-manager.c
>>> @@ -366,6 +366,31 @@ static int
>>> guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
>>>        return ret;
>>>    }
>>>    +static void guest_memfd_manager_realizefn(GuestMemfdManager *gmm,
>>> MemoryRegion *mr,
>>> +                                          uint64_t region_size)
>>> +{
>>> +    uint64_t bitmap_size;
>>> +
>>> +    gmm->block_size = qemu_real_host_page_size();
>>> +    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm-
>>>> block_size;
>>
>> imho unaligned region_size should be an assert.
> 
> There's no guarantee the region_size of the MemoryRegion is PAGE_SIZE
> aligned. So the ROUND_UP() is more appropriate.

It is all about DMA so the smallest you can map is PAGE_SIZE so even if 
you round up here, it is likely going to fail to DMA-map later anyway 
(or not?).


>>> +
>>> +    gmm->mr = mr;
>>> +    gmm->bitmap_size = bitmap_size;
>>> +    gmm->bitmap = bitmap_new(bitmap_size);
>>> +
>>> +    memory_region_set_ram_discard_manager(gmm->mr,
>>> RAM_DISCARD_MANAGER(gmm));
>>> +}
>>
>> This belongs to 2/7.
>>
>>> +
>>> +static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
>>> +{
>>> +    memory_region_set_ram_discard_manager(gmm->mr, NULL);
>>> +
>>> +    g_free(gmm->bitmap);
>>> +    gmm->bitmap = NULL;
>>> +    gmm->bitmap_size = 0;
>>> +    gmm->mr = NULL;
>>
>> @gmm is being destroyed here, why bother zeroing?
> 
> OK, will remove it.
> 
>>
>>> +}
>>> +
>>
>> This function belongs to 2/7.
> 
> Will move both realizefn() and unrealizefn().

Yes.


>>
>>>    static void guest_memfd_manager_init(Object *obj)
>>>    {
>>>        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>> @@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
>>>      static void guest_memfd_manager_finalize(Object *obj)
>>>    {
>>> -    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>    }
>>>      static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>> *data)
>>> @@ -384,6 +408,8 @@ static void
>>> guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>>        RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>          gmmc->state_change = guest_memfd_state_change;
>>> +    gmmc->realize = guest_memfd_manager_realizefn;
>>> +    gmmc->unrealize = guest_memfd_manager_unrealizefn;
>>>          rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>        rdmc->register_listener = guest_memfd_rdm_register_listener;
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index dc1db3a384..532182a6dd 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -53,6 +53,7 @@
>>>    #include "sysemu/hostmem.h"
>>>    #include "sysemu/hw_accel.h"
>>>    #include "sysemu/xen-mapcache.h"
>>> +#include "sysemu/guest-memfd-manager.h"
>>>    #include "trace.h"
>>>      #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>                qemu_mutex_unlock_ramlist();
>>>                goto out_free;
>>>            }
>>> +
>>> +        GuestMemfdManager *gmm =
>>> GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
>>> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block-
>>>> mr->size);
>>
>> Wow. Quite invasive.
> 
> Yeah... It creates a manager object no matter whether the user wants to
> us	e shared passthru or not. We assume some fields like private/shared
> bitmap may also be helpful in other scenario for future usage, and if no
> passthru device, the listener would just return, so it is acceptable.

Explain these other scenarios in the commit log please as otherwise 
making this an interface of HostMemoryBackendMemfd looks way cleaner. 
Thanks,

>>
>>>        }
>>>          ram_size = (new_block->offset + new_block->max_length) >>
>>> TARGET_PAGE_BITS;
>>> @@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
>>>          if (block->guest_memfd >= 0) {
>>>            close(block->guest_memfd);
>>> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
>>> +        guest_memfd_manager_unrealize(gmm);
>>> +        object_unref(OBJECT(gmm));
>>
>> Likely don't matter but I'd do the cleanup before close() or do block-
>>> guest_memfd=-1 before the cleanup. Thanks,
>>
>>
>>>            ram_block_discard_require(false);
>>>        }
>>>    
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-10  6:38                 ` Chenyi Qiang
@ 2025-01-09 21:00                   ` Xu Yilun
  2025-01-09 21:50                     ` Xu Yilun
  2025-01-15  4:06                   ` Alexey Kardashevskiy
  1 sibling, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-09 21:00 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

> > 
> > https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> > 
> > but I am not sure if this ever saw the light of the day, did not it?
> > (ironically I am using it as a base for encrypted DMA :) )
> 
> Yeah, we are doing the same work. I saw a solution from Michael long
> time ago (when there was still
> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> 
> For your patch, it only implement the interface for
> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> the parent object HostMemoryBackend, because besides the
> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> 
> Think more about where to implement this interface. It is still
> uncertain to me. As I mentioned in another mail, maybe ram device memory
> region would be backed by guest_memfd if we support TEE IO iommufd MMIO

It is unlikely an assigned MMIO region would be backed by guest_memfd or be
implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
owned by VFIO types, and I assume it is still true for private MMIO.

But I think with TIO, MMIO regions also need conversion. So I support an
object, but maybe not guest_memfd_manager.

Thanks,
Yilun

> in future. Then a specific object is more appropriate. What's your opinion?
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-09 21:00                   ` Xu Yilun
@ 2025-01-09 21:50                     ` Xu Yilun
  2025-01-13  3:34                       ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-09 21:50 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
> > > 
> > > https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> > > 
> > > but I am not sure if this ever saw the light of the day, did not it?
> > > (ironically I am using it as a base for encrypted DMA :) )
> > 
> > Yeah, we are doing the same work. I saw a solution from Michael long
> > time ago (when there was still
> > a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> > (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> > 
> > For your patch, it only implement the interface for
> > HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> > the parent object HostMemoryBackend, because besides the
> > MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> > MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> > 
> > Think more about where to implement this interface. It is still
> > uncertain to me. As I mentioned in another mail, maybe ram device memory
> > region would be backed by guest_memfd if we support TEE IO iommufd MMIO
> 
> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
> owned by VFIO types, and I assume it is still true for private MMIO.
> 
> But I think with TIO, MMIO regions also need conversion. So I support an
> object, but maybe not guest_memfd_manager.

Sorry, I mean the name only covers private memory, but not private MMIO.

> 
> Thanks,
> Yilun
> 
> > in future. Then a specific object is more appropriate. What's your opinion?
> > 
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-09  4:29             ` Chenyi Qiang
@ 2025-01-10  0:58               ` Alexey Kardashevskiy
  2025-01-10  6:38                 ` Chenyi Qiang
  2025-01-14  6:45               ` Chenyi Qiang
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-10  0:58 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 9/1/25 15:29, Chenyi Qiang wrote:
> 
> 
> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>>> operation to perform page conversion between private and shared
>>>>>>> memory.
>>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>> not
>>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>
>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>> adjust
>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>> adding it
>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>> response
>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>
>>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>>> the RamDiscardManager interface.
>>>>>>>
>>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>>
>>>>>> This sounds about right.

btw I am using this for ages:

https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46

but I am not sure if this ever saw the light of the day, did not it? 
(ironically I am using it as a base for encrypted DMA :) )

>>>>>>
>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>> target
>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>> where a
>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>> e.g.
>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>
>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>
>>>>> virtual BIOS shows in a separate region:
>>>>>
>>>>>     Root memory region: system
>>>>>      0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>      ...
>>>>>      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>
>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>
>>> Yes, virtual BIOS memory region is initialized by
>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>> guest_memfd.
>>>
>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>>> BIOS image will be loaded and then copied to private region.
>>> After that,
>>> the loaded image will be discarded and this region become useless.
>>
>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>> ram_guest_memfd() which does not leave MR useless - we still see
>> "pc.bios" in the list so it is not discarded. What piece of code are you
>> referring to exactly?
> 
> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
> the vBIOS handling.
> 
> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
> the vBIOS image to the shared part of the guest_memfd MR.
> For TDX, it
> will copy the image to private region (not the vBIOS guest_memfd MR
> private part) and discard the shared part. So, although the memory
> region still exists, it seems useless.
> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
> vBIOS guest_memfd private memory?

This is what it looks like on my SNP VM (which, I suspect, is the same 
as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):

  Root memory region: system 

   0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20 

   00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27 

   00000000000e0000-000000001fffffff (prio 0, ram): ram1 
@00000000000e0000 KVM gmemfd=20
...
   00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26

So the pc.bios MR exists and in use (hence its appearance in "info mtree 
-f").


I added the gmemfd dumping:

--- a/system/memory.c
+++ b/system/memory.c
@@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key, 
gpointer value,
                  }
              }
          }
+        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
+            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
+        }


>>
>>
>>> So I
>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>
>>  From the above it sounds like the opposite, i.e. it should :)
>>
>>>>
>>>>>      0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>> @0000000080000000 KVM
>>>>
>>>> Anyway if there is no guest_memfd backing it and
>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>> just going to be mapped for VFIO as usual which seems... alright, right?
>>>
>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>
>>> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>> just ignore it since the MR is useless (but looks not so good).
>>
>> Sorry I am missing necessary details here, let's figure out the above.
>>
>>>
>>>>
>>>>
>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>> maybe implement with guest_memfd region is more general. We don't know
>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>> present it is.
>>>>>
>>>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>>>> change to this way.
>>>>
>>>> Seems cleaner imho.
>>>
>>> I can go this way.
> 
> [...]
> 
>>>>>>> +
>>>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>>>> *rdm,
>>>>>>> +                                            MemoryRegionSection
>>>>>>> *section,
>>>>>>> +                                            ReplayRamPopulate
>>>>>>> replay_fn,
>>>>>>> +                                            void *opaque)
>>>>>>> +{
>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>>> opaque };
>>>>>>> +
>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>> &data,
>>>>>>> +
>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>> *section, void *arg)
>>>>>>> +{
>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>> +
>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>
>>>>>>
>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>
>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>> replay_populate() returns errors.
>>>>
>>>> A trace would be appropriate imho. Thanks,
>>>
>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>> returned by replay_populate()?
>>
>> Yeah. imho these are useful as we expect this part to work in general
>> too, right? Thanks,
> 
> Something like?
> 
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> index 6b3e1ee9d6..4440ac9e59 100644
> --- a/system/guest-memfd-manager.c
> +++ b/system/guest-memfd-manager.c
> @@ -185,8 +185,14 @@ static int
> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>   {
>       struct GuestMemfdReplayData *data = arg;
>       ReplayRamPopulate replay_fn = data->fn;
> +    int ret;
> 
> -    return replay_fn(section, data->opaque);
> +    ret = replay_fn(section, data->opaque);
> +    if (ret) {
> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
> +    }
> +
> +    return ret;
>   }
> 
> How about just adding some error output in
> guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section()
> if the cb() (i.e. replay_populate()) returns error?

this will do too, yes. Thanks,


> 
>>
>>>
>>>>
>>>>>>
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>>>> *rdm,
>>>>>>> +                                             MemoryRegionSection
>>>>>>> *section,
>>>>>>> +                                             ReplayRamDiscard
>>>>>>> replay_fn,
>>>>>>> +                                             void *opaque)
>>>>>>> +{
>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>>> opaque };
>>>>>>> +
>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>>>> +
>>>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>>>> +{
>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>>> +
>>>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>>>> +{
>>>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>>
>>>>>>
>>>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>>>
>>>>> Will remove it. Thanks.
>>>>>
>>>>>>
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>>>> *data)
>>>>>>> +{
>>>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>>> +
>>>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>>>> +}
>>>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>>>> --- a/system/meson.build
>>>>>>> +++ b/system/meson.build
>>>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>>>        'dirtylimit.c',
>>>>>>>        'dma-helpers.c',
>>>>>>>        'globals.c',
>>>>>>> +  'guest-memfd-manager.c',
>>>>>>>        'memory_mapping.c',
>>>>>>>        'qdev-monitor.c',
>>>>>>>        'qtest.c',
>>>>>>
>>>>>
>>>>
>>>
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-09  8:49           ` Chenyi Qiang
@ 2025-01-10  1:42             ` Alexey Kardashevskiy
  2025-01-10  7:06               ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-10  1:42 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 9/1/25 19:49, Chenyi Qiang wrote:
> 
> 
> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>> Thanks Alexey for your review!
>>>>>
>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>>>> discard") effectively disables device assignment when using
>>>>>>> guest_memfd.
>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>> stale
>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>> assignment
>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>> proven
>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>> against
>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>> initialization
>>>>>>> and error recovery scenarios. As a result, the current implementation
>>>>>>> does
>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>> necessitating
>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>> functionality.
>>>>>>>
>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>> page
>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>> page
>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>> conversion,
>>>>>>> where private and shared memory share the same physical pages as the
>>>>>>> backend.
>>>>>>> This development may impact our solution.
>>>>>>>
>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>> compatibility with the new changes and potential future directions
>>>>>>> (see [3]
>>>>>>> for more details). The conclusion was that, although our solution may
>>>>>>> not be
>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>> now and
>>>>>>> can be easily adapted to future changes.
>>>>>>>
>>>>>>> We are re-posting the patch series with some cleanup and have removed
>>>>>>> the RFC
>>>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>>>>> still
>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>> related to
>>>>>>> RamDiscardManager for future usage.
>>>>>>>
>>>>>>> The overview of the patches:
>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>> MemoryRegionSection
>>>>>>>       with a given range.
>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>       RamDiscardManager, and notify the shared/private state change
>>>>>>> during
>>>>>>>       conversion.
>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>> RamDiscardManager
>>>>>>>       i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>> state
>>>>>>>       instead of shared/private state. It would affect future users of
>>>>>>>       RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>>>>> patch[4].
>>>>>>>
>>>>>>> Changes since last version:
>>>>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>>>>> code.
>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>> default
>>>>>>>       private. This keeps alignment with virtio-mem that 1-setting in
>>>>>>> bitmap
>>>>>>>       represents the populated state and may help to export more
>>>>>>> generic
>>>>>>> code
>>>>>>>       if necessary.
>>>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>>>>> instance
>>>>>>>       to make it more clear.
>>>>>>> - Add a patch to distinguish between the shared/private state change
>>>>>>> and
>>>>>>>       the memory plug/unplug state change in RamDiscardManager.
>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>> chenyi.qiang@intel.com/
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> Background
>>>>>>> ==========
>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>> memory.
>>>>>>> Shared memory is accessible from the host/VMM while private memory is
>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>> convert memory between shared/private at runtime.
>>>>>>>
>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>> guest
>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>> memfd
>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>> VM and
>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>
>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>> already).
>>>>>>
>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>> tabba@google.com/T/
>>>>>
>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>> make it more clear.
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>> methods
>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>>>>> via
>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>>>>> allocates new pages from the other side.
>>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>>
>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>> VFIO
>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>>>>> with
>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>> such as
>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>> invalid
>>>>>>> cases, all operations are performed with 4K granularity. The possible
>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>> partial
>>>>>>> unmap
>>>>
>>>> btw the old VFIO does not split mappings but iommufd seems to be capable
>>>> of it - there is iopt_area_split(). What happens if you try unmapping a
>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>
>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>> IOMMU. So it can do the split easily. See the comment in
>>> iommufd_vfio_set_iommu().
>>>
>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>> superset of a previously mapped range. If not match, will return error.
>>
>>
>> This is all true but this also means that "The former requires complex
>> changes in VFIO" is not entirely true - some code is already there. Thanks,
> 
> Hmm, my statement is a little confusing.  The bottleneck is that the
> IOMMU driver doesn't support the large page split. So if we want to
> enable large page and want to do partial unmap, it requires complex change.

We won't need to split large pages (if we stick to 4K for now), we need 
to split large mappings (not large pages) to allow partial unmapping and 
iopt_area_split() seems to be doing this. Thanks,


> 
>>
>>
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2025-01-09  9:32       ` Alexey Kardashevskiy
@ 2025-01-10  5:13         ` Chenyi Qiang
       [not found]           ` <59bd0e82-f269-4567-8f75-a32c9c997ca9@redhat.com>
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-10  5:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/9/2025 5:32 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 16:34, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> Introduce the realize()/unrealize() callbacks to initialize/
>>>> uninitialize
>>>> the new guest_memfd_manager object and register/unregister it in the
>>>> target MemoryRegion.
>>>>
>>>> Guest_memfd was initially set to shared until the commit bd3bcf6962
>>>> ("kvm/memory: Make memory type private by default if it has guest memfd
>>>> backend"). To align with this change, the default state in
>>>> guest_memfd_manager is set to private. (The bitmap is cleared to 0).
>>>> Additionally, setting the default to private can also reduce the
>>>> overhead of mapping shared pages into IOMMU by VFIO during the bootup
>>>> stage.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>>    include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++
>>>> ++++
>>>>    system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++
>>>> ++++-
>>>>    system/physmem.c                     |  7 +++++++
>>>>    3 files changed, 61 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>> guest-memfd-manager.h
>>>> index 9dc4e0346d..d1e7f698e8 100644
>>>> --- a/include/sysemu/guest-memfd-manager.h
>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>> @@ -42,6 +42,8 @@ struct GuestMemfdManager {
>>>>    struct GuestMemfdManagerClass {
>>>>        ObjectClass parent_class;
>>>>    +    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr,
>>>> uint64_t region_size);
>>>> +    void (*unrealize)(GuestMemfdManager *gmm);
>>>>        int (*state_change)(GuestMemfdManager *gmm, uint64_t offset,
>>>> uint64_t size,
>>>>                            bool shared_to_private);
>>>>    };
>>>> @@ -61,4 +63,29 @@ static inline int
>>>> guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
>>>>        return 0;
>>>>    }
>>>>    +static inline void guest_memfd_manager_realize(GuestMemfdManager
>>>> *gmm,
>>>> +                                              MemoryRegion *mr,
>>>> uint64_t region_size)
>>>> +{
>>>> +    GuestMemfdManagerClass *klass;
>>>> +
>>>> +    g_assert(gmm);
>>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>>> +
>>>> +    if (klass->realize) {
>>>> +        klass->realize(gmm, mr, region_size);
>>>
>>> Ditch realize() hook and call guest_memfd_manager_realizefn() directly?
>>> Not clear why these new hooks are needed.
>>
>>>
>>>> +    }
>>>> +}
>>>> +
>>>> +static inline void guest_memfd_manager_unrealize(GuestMemfdManager
>>>> *gmm)
>>>> +{
>>>> +    GuestMemfdManagerClass *klass;
>>>> +
>>>> +    g_assert(gmm);
>>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>>> +
>>>> +    if (klass->unrealize) {
>>>> +        klass->unrealize(gmm);
>>>> +    }
>>>> +}
>>>
>>> guest_memfd_manager_unrealizefn()?
>>
>> Agree. Adding these wrappers seem unnecessary.
>>
>>>
>>>
>>>> +
>>>>    #endif
>>>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-
>>>> manager.c
>>>> index 6601df5f3f..b6a32f0bfb 100644
>>>> --- a/system/guest-memfd-manager.c
>>>> +++ b/system/guest-memfd-manager.c
>>>> @@ -366,6 +366,31 @@ static int
>>>> guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
>>>>        return ret;
>>>>    }
>>>>    +static void guest_memfd_manager_realizefn(GuestMemfdManager *gmm,
>>>> MemoryRegion *mr,
>>>> +                                          uint64_t region_size)
>>>> +{
>>>> +    uint64_t bitmap_size;
>>>> +
>>>> +    gmm->block_size = qemu_real_host_page_size();
>>>> +    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm-
>>>>> block_size;
>>>
>>> imho unaligned region_size should be an assert.
>>
>> There's no guarantee the region_size of the MemoryRegion is PAGE_SIZE
>> aligned. So the ROUND_UP() is more appropriate.
> 
> It is all about DMA so the smallest you can map is PAGE_SIZE so even if
> you round up here, it is likely going to fail to DMA-map later anyway
> (or not?).

Checked the handling of VFIO, if the size is less than PAGE_SIZE, it
will just return and won't do DMA-map.

Here is a different thing. It tries to calculate the bitmap_size. The
bitmap is used to track the private/shared status of the page. So if the
size is less than PAGE_SIZE, we still use the one bit to track this
small-size range.

> 
> 
>>>> +
>>>> +    gmm->mr = mr;
>>>> +    gmm->bitmap_size = bitmap_size;
>>>> +    gmm->bitmap = bitmap_new(bitmap_size);
>>>> +
>>>> +    memory_region_set_ram_discard_manager(gmm->mr,
>>>> RAM_DISCARD_MANAGER(gmm));
>>>> +}
>>>
>>> This belongs to 2/7.
>>>
>>>> +
>>>> +static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
>>>> +{
>>>> +    memory_region_set_ram_discard_manager(gmm->mr, NULL);
>>>> +
>>>> +    g_free(gmm->bitmap);
>>>> +    gmm->bitmap = NULL;
>>>> +    gmm->bitmap_size = 0;
>>>> +    gmm->mr = NULL;
>>>
>>> @gmm is being destroyed here, why bother zeroing?
>>
>> OK, will remove it.
>>
>>>
>>>> +}
>>>> +
>>>
>>> This function belongs to 2/7.
>>
>> Will move both realizefn() and unrealizefn().
> 
> Yes.
> 
> 
>>>
>>>>    static void guest_memfd_manager_init(Object *obj)
>>>>    {
>>>>        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>> @@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
>>>>      static void guest_memfd_manager_finalize(Object *obj)
>>>>    {
>>>> -    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>    }
>>>>      static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>> *data)
>>>> @@ -384,6 +408,8 @@ static void
>>>> guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>>>        RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>          gmmc->state_change = guest_memfd_state_change;
>>>> +    gmmc->realize = guest_memfd_manager_realizefn;
>>>> +    gmmc->unrealize = guest_memfd_manager_unrealizefn;
>>>>          rdmc->get_min_granularity =
>>>> guest_memfd_rdm_get_min_granularity;
>>>>        rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index dc1db3a384..532182a6dd 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -53,6 +53,7 @@
>>>>    #include "sysemu/hostmem.h"
>>>>    #include "sysemu/hw_accel.h"
>>>>    #include "sysemu/xen-mapcache.h"
>>>> +#include "sysemu/guest-memfd-manager.h"
>>>>    #include "trace.h"
>>>>      #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block,
>>>> Error **errp)
>>>>                qemu_mutex_unlock_ramlist();
>>>>                goto out_free;
>>>>            }
>>>> +
>>>> +        GuestMemfdManager *gmm =
>>>> GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
>>>> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block-
>>>>> mr->size);
>>>
>>> Wow. Quite invasive.
>>
>> Yeah... It creates a manager object no matter whether the user wants to
>> us    e shared passthru or not. We assume some fields like private/shared
>> bitmap may also be helpful in other scenario for future usage, and if no
>> passthru device, the listener would just return, so it is acceptable.
> 
> Explain these other scenarios in the commit log please as otherwise
> making this an interface of HostMemoryBackendMemfd looks way cleaner.
> Thanks,

Thanks for the suggestion. Until now, I think making this an interface
of HostMemoryBackend is cleaner. The potential future usage for
non-HostMemoryBackend guest_memfd-backed memory region I can think of is
the the TEE I/O for iommufd P2P support? when it tries to initialize RAM
device memory region with the attribute of shared/private. But I think
it would be a long term story and we are not sure what it will be like
in future.

> 
>>>
>>>>        }
>>>>          ram_size = (new_block->offset + new_block->max_length) >>
>>>> TARGET_PAGE_BITS;
>>>> @@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
>>>>          if (block->guest_memfd >= 0) {
>>>>            close(block->guest_memfd);
>>>> +        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
>>>> +        guest_memfd_manager_unrealize(gmm);
>>>> +        object_unref(OBJECT(gmm));
>>>
>>> Likely don't matter but I'd do the cleanup before close() or do block-
>>>> guest_memfd=-1 before the cleanup. Thanks,
>>>
>>>
>>>>            ram_block_discard_require(false);
>>>>        }
>>>>    
>>>
>>
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-10  0:58               ` Alexey Kardashevskiy
@ 2025-01-10  6:38                 ` Chenyi Qiang
  2025-01-09 21:00                   ` Xu Yilun
  2025-01-15  4:06                   ` Alexey Kardashevskiy
  0 siblings, 2 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-10  6:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 15:29, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>> might
>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>> discard
>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>> memory.
>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>> hardware
>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>>> not
>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>> other
>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>
>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>> adjust
>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>> adding it
>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>> response
>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>
>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>> implement
>>>>>>>> the RamDiscardManager interface.
>>>>>>>>
>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>> Any
>>>>>>>
>>>>>>> This sounds about right.
> 
> btw I am using this for ages:
> 
> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> 
> but I am not sure if this ever saw the light of the day, did not it?
> (ironically I am using it as a base for encrypted DMA :) )

Yeah, we are doing the same work. I saw a solution from Michael long
time ago (when there was still
a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
(https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)

For your patch, it only implement the interface for
HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
the parent object HostMemoryBackend, because besides the
MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
MEMORY_BACKEND_FILE can also be guest_memfd-backed.

Think more about where to implement this interface. It is still
uncertain to me. As I mentioned in another mail, maybe ram device memory
region would be backed by guest_memfd if we support TEE IO iommufd MMIO
in future. Then a specific object is more appropriate. What's your opinion?

> 
>>>>>>>
>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>> target
>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>> where a
>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>> e.g.
>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>
>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>> like
>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>
>>>>>> virtual BIOS shows in a separate region:
>>>>>>
>>>>>>     Root memory region: system
>>>>>>      0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>      ...
>>>>>>      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>
>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>
>>>> Yes, virtual BIOS memory region is initialized by
>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>> guest_memfd.
>>>>
>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>> virtual
>>>> BIOS image will be loaded and then copied to private region.
>>>> After that,
>>>> the loaded image will be discarded and this region become useless.
>>>
>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>> ram_guest_memfd() which does not leave MR useless - we still see
>>> "pc.bios" in the list so it is not discarded. What piece of code are you
>>> referring to exactly?
>>
>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>> the vBIOS handling.
>>
>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
>> the vBIOS image to the shared part of the guest_memfd MR.
>> For TDX, it
>> will copy the image to private region (not the vBIOS guest_memfd MR
>> private part) and discard the shared part. So, although the memory
>> region still exists, it seems useless.
>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>> vBIOS guest_memfd private memory?
> 
> This is what it looks like on my SNP VM (which, I suspect, is the same
> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):

Yes, the memory region object is created on both TDX and SEV-SNP.

> 
>  Root memory region: system
>   0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>   00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>   00000000000e0000-000000001fffffff (prio 0, ram): ram1
> @00000000000e0000 KVM gmemfd=20
> ...
>   00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26
> 
> So the pc.bios MR exists and in use (hence its appearance in "info mtree
> -f").
> 
> 
> I added the gmemfd dumping:
> 
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
> gpointer value,
>                  }
>              }
>          }
> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
> +        }
> 

Then I think the virtual BIOS is another case not belonging to
HostMemoryBackend which convince us to implement the interface in a
specific object, no?

> 
>>>
>>>
>>>> So I
>>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>>
>>>  From the above it sounds like the opposite, i.e. it should :)
>>>
>>>>>
>>>>>>      0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>>> @0000000080000000 KVM
>>>>>
>>>>> Anyway if there is no guest_memfd backing it and
>>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>>> just going to be mapped for VFIO as usual which seems... alright,
>>>>> right?
>>>>
>>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>>
>>>> If we go with the HostMemoryBackend instead of guest_memfd_manager,
>>>> this
>>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>>> just ignore it since the MR is useless (but looks not so good).
>>>
>>> Sorry I am missing necessary details here, let's figure out the above.
>>>
>>>>
>>>>>
>>>>>
>>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>>> maybe implement with guest_memfd region is more general. We don't
>>>>>> know
>>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>>> present it is.
>>>>>>
>>>>>> If it is more appropriate to implement it with HostMemoryBackend,
>>>>>> I can
>>>>>> change to this way.
>>>>>
>>>>> Seems cleaner imho.
>>>>
>>>> I can go this way.
>>
>> [...]
>>
>>>>>>>> +
>>>>>>>> +static int guest_memfd_rdm_replay_populated(const
>>>>>>>> RamDiscardManager
>>>>>>>> *rdm,
>>>>>>>> +                                            MemoryRegionSection
>>>>>>>> *section,
>>>>>>>> +                                            ReplayRamPopulate
>>>>>>>> replay_fn,
>>>>>>>> +                                            void *opaque)
>>>>>>>> +{
>>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>>> +    struct GuestMemfdReplayData data = { .fn =
>>>>>>>> replay_fn, .opaque =
>>>>>>>> opaque };
>>>>>>>> +
>>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>>> &data,
>>>>>>>> +
>>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>>> *section, void *arg)
>>>>>>>> +{
>>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>>> +
>>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>>
>>>>>>>
>>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>>
>>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>>> replay_populate() returns errors.
>>>>>
>>>>> A trace would be appropriate imho. Thanks,
>>>>
>>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>>> returned by replay_populate()?
>>>
>>> Yeah. imho these are useful as we expect this part to work in general
>>> too, right? Thanks,
>>
>> Something like?
>>
>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>> index 6b3e1ee9d6..4440ac9e59 100644
>> --- a/system/guest-memfd-manager.c
>> +++ b/system/guest-memfd-manager.c
>> @@ -185,8 +185,14 @@ static int
>> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>>   {
>>       struct GuestMemfdReplayData *data = arg;
>>       ReplayRamPopulate replay_fn = data->fn;
>> +    int ret;
>>
>> -    return replay_fn(section, data->opaque);
>> +    ret = replay_fn(section, data->opaque);
>> +    if (ret) {
>> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
>> +    }
>> +
>> +    return ret;
>>   }
>>
>> How about just adding some error output in
>> guest_memfd_for_each_populated_section()/
>> guest_memfd_for_each_discarded_section()
>> if the cb() (i.e. replay_populate()) returns error?
> 
> this will do too, yes. Thanks,
> 




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10  1:42             ` Alexey Kardashevskiy
@ 2025-01-10  7:06               ` Chenyi Qiang
  2025-01-10  8:26                 ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-10  7:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 19:49, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>>> Thanks Alexey for your review!
>>>>>>
>>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated
>>>>>>>> discard") effectively disables device assignment when using
>>>>>>>> guest_memfd.
>>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>>> confidential guests, thereby blocking device assignment to these
>>>>>>>> VMs.
>>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>>> stale
>>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE
>>>>>>>> I/O
>>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>>> assignment
>>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>>> proven
>>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>>> against
>>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>>> initialization
>>>>>>>> and error recovery scenarios. As a result, the current
>>>>>>>> implementation
>>>>>>>> does
>>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>>> necessitating
>>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>>> functionality.
>>>>>>>>
>>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>>> page
>>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>>> page
>>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>>> conversion,
>>>>>>>> where private and shared memory share the same physical pages as
>>>>>>>> the
>>>>>>>> backend.
>>>>>>>> This development may impact our solution.
>>>>>>>>
>>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>>> compatibility with the new changes and potential future directions
>>>>>>>> (see [3]
>>>>>>>> for more details). The conclusion was that, although our
>>>>>>>> solution may
>>>>>>>> not be
>>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>>> now and
>>>>>>>> can be easily adapted to future changes.
>>>>>>>>
>>>>>>>> We are re-posting the patch series with some cleanup and have
>>>>>>>> removed
>>>>>>>> the RFC
>>>>>>>> label for the main enabling patches (1-6). The newly-added patch
>>>>>>>> 7 is
>>>>>>>> still
>>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>>> related to
>>>>>>>> RamDiscardManager for future usage.
>>>>>>>>
>>>>>>>> The overview of the patches:
>>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>>> MemoryRegionSection
>>>>>>>>       with a given range.
>>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>>       RamDiscardManager, and notify the shared/private state change
>>>>>>>> during
>>>>>>>>       conversion.
>>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>>> RamDiscardManager
>>>>>>>>       i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>>> state
>>>>>>>>       instead of shared/private state. It would affect future
>>>>>>>> users of
>>>>>>>>       RamDiscardManger in confidential VMs. Attach it behind as
>>>>>>>> a RFC
>>>>>>>> patch[4].
>>>>>>>>
>>>>>>>> Changes since last version:
>>>>>>>> - Add a patch to export some generic helper functions from
>>>>>>>> virtio-mem
>>>>>>>> code.
>>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>>> default
>>>>>>>>       private. This keeps alignment with virtio-mem that 1-
>>>>>>>> setting in
>>>>>>>> bitmap
>>>>>>>>       represents the populated state and may help to export more
>>>>>>>> generic
>>>>>>>> code
>>>>>>>>       if necessary.
>>>>>>>> - Add the helpers to initialize/uninitialize the
>>>>>>>> guest_memfd_manager
>>>>>>>> instance
>>>>>>>>       to make it more clear.
>>>>>>>> - Add a patch to distinguish between the shared/private state
>>>>>>>> change
>>>>>>>> and
>>>>>>>>       the memory plug/unplug state change in RamDiscardManager.
>>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>>> chenyi.qiang@intel.com/
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Background
>>>>>>>> ==========
>>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>>> memory.
>>>>>>>> Shared memory is accessible from the host/VMM while private
>>>>>>>> memory is
>>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>>> convert memory between shared/private at runtime.
>>>>>>>>
>>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>>> guest
>>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>>> memfd
>>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>>> VM and
>>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>>
>>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>>> already).
>>>>>>>
>>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>>> tabba@google.com/T/
>>>>>>
>>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I
>>>>>> mentioned
>>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>>> make it more clear.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>>> methods
>>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees
>>>>>>>> pages
>>>>>>>> via
>>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one
>>>>>>>> side and
>>>>>>>> allocates new pages from the other side.
>>>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>
>>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>>> VFIO
>>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and
>>>>>>>> unmapped
>>>>>>>> with
>>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>>> such as
>>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>>> invalid
>>>>>>>> cases, all operations are performed with 4K granularity. The
>>>>>>>> possible
>>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>>> partial
>>>>>>>> unmap
>>>>>
>>>>> btw the old VFIO does not split mappings but iommufd seems to be
>>>>> capable
>>>>> of it - there is iopt_area_split(). What happens if you try
>>>>> unmapping a
>>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>>
>>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>>> IOMMU. So it can do the split easily. See the comment in
>>>> iommufd_vfio_set_iommu().
>>>>
>>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>>> superset of a previously mapped range. If not match, will return error.
>>>
>>>
>>> This is all true but this also means that "The former requires complex
>>> changes in VFIO" is not entirely true - some code is already there.
>>> Thanks,
>>
>> Hmm, my statement is a little confusing.  The bottleneck is that the
>> IOMMU driver doesn't support the large page split. So if we want to
>> enable large page and want to do partial unmap, it requires complex
>> change.
> 
> We won't need to split large pages (if we stick to 4K for now), we need
> to split large mappings (not large pages) to allow partial unmapping and
> iopt_area_split() seems to be doing this. Thanks,

You mean we can disable large page in iommufd and then VFIO will be able
to do partial unmap. Yes, I think it is doable and we can avoid many
ioctl context switches overhead.

> 
> 
>>
>>>
>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10  7:06               ` Chenyi Qiang
@ 2025-01-10  8:26                 ` David Hildenbrand
  2025-01-10 13:20                   ` Jason Gunthorpe
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-10  8:26 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, Jason Gunthorpe
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 10.01.25 08:06, Chenyi Qiang wrote:
> 
> 
> On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 19:49, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>>>> Thanks Alexey for your review!
>>>>>>>
>>>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>> uncoordinated
>>>>>>>>> discard") effectively disables device assignment when using
>>>>>>>>> guest_memfd.
>>>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>>>> confidential guests, thereby blocking device assignment to these
>>>>>>>>> VMs.
>>>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>>>> stale
>>>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE
>>>>>>>>> I/O
>>>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>>>> assignment
>>>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>>>> proven
>>>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>>>> against
>>>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>>>> initialization
>>>>>>>>> and error recovery scenarios. As a result, the current
>>>>>>>>> implementation
>>>>>>>>> does
>>>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>>>> necessitating
>>>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>>>> functionality.
>>>>>>>>>
>>>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>>>> page
>>>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>>>> page
>>>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>>>> conversion,
>>>>>>>>> where private and shared memory share the same physical pages as
>>>>>>>>> the
>>>>>>>>> backend.
>>>>>>>>> This development may impact our solution.
>>>>>>>>>
>>>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>>>> compatibility with the new changes and potential future directions
>>>>>>>>> (see [3]
>>>>>>>>> for more details). The conclusion was that, although our
>>>>>>>>> solution may
>>>>>>>>> not be
>>>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>>>> now and
>>>>>>>>> can be easily adapted to future changes.
>>>>>>>>>
>>>>>>>>> We are re-posting the patch series with some cleanup and have
>>>>>>>>> removed
>>>>>>>>> the RFC
>>>>>>>>> label for the main enabling patches (1-6). The newly-added patch
>>>>>>>>> 7 is
>>>>>>>>> still
>>>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>>>> related to
>>>>>>>>> RamDiscardManager for future usage.
>>>>>>>>>
>>>>>>>>> The overview of the patches:
>>>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>>>> MemoryRegionSection
>>>>>>>>>        with a given range.
>>>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>>>        RamDiscardManager, and notify the shared/private state change
>>>>>>>>> during
>>>>>>>>>        conversion.
>>>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>>>> RamDiscardManager
>>>>>>>>>        i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>>>> state
>>>>>>>>>        instead of shared/private state. It would affect future
>>>>>>>>> users of
>>>>>>>>>        RamDiscardManger in confidential VMs. Attach it behind as
>>>>>>>>> a RFC
>>>>>>>>> patch[4].
>>>>>>>>>
>>>>>>>>> Changes since last version:
>>>>>>>>> - Add a patch to export some generic helper functions from
>>>>>>>>> virtio-mem
>>>>>>>>> code.
>>>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>>>> default
>>>>>>>>>        private. This keeps alignment with virtio-mem that 1-
>>>>>>>>> setting in
>>>>>>>>> bitmap
>>>>>>>>>        represents the populated state and may help to export more
>>>>>>>>> generic
>>>>>>>>> code
>>>>>>>>>        if necessary.
>>>>>>>>> - Add the helpers to initialize/uninitialize the
>>>>>>>>> guest_memfd_manager
>>>>>>>>> instance
>>>>>>>>>        to make it more clear.
>>>>>>>>> - Add a patch to distinguish between the shared/private state
>>>>>>>>> change
>>>>>>>>> and
>>>>>>>>>        the memory plug/unplug state change in RamDiscardManager.
>>>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>>>> chenyi.qiang@intel.com/
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Background
>>>>>>>>> ==========
>>>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>>>> memory.
>>>>>>>>> Shared memory is accessible from the host/VMM while private
>>>>>>>>> memory is
>>>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>>>> convert memory between shared/private at runtime.
>>>>>>>>>
>>>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>>>> guest
>>>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>>>> memfd
>>>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>>>> VM and
>>>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>>>
>>>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>>>> already).
>>>>>>>>
>>>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>>>> tabba@google.com/T/
>>>>>>>
>>>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I
>>>>>>> mentioned
>>>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>>>> make it more clear.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>>>> methods
>>>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees
>>>>>>>>> pages
>>>>>>>>> via
>>>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one
>>>>>>>>> side and
>>>>>>>>> allocates new pages from the other side.
>>>>>>>>>
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>>
>>>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>>>> VFIO
>>>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and
>>>>>>>>> unmapped
>>>>>>>>> with
>>>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>>>> such as
>>>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>>>> invalid
>>>>>>>>> cases, all operations are performed with 4K granularity. The
>>>>>>>>> possible
>>>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>>>> partial
>>>>>>>>> unmap
>>>>>>
>>>>>> btw the old VFIO does not split mappings but iommufd seems to be
>>>>>> capable
>>>>>> of it - there is iopt_area_split(). What happens if you try
>>>>>> unmapping a
>>>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>>>
>>>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>>>> IOMMU. So it can do the split easily. See the comment in
>>>>> iommufd_vfio_set_iommu().
>>>>>
>>>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>>>> superset of a previously mapped range. If not match, will return error.
>>>>
>>>>
>>>> This is all true but this also means that "The former requires complex
>>>> changes in VFIO" is not entirely true - some code is already there.
>>>> Thanks,
>>>
>>> Hmm, my statement is a little confusing.  The bottleneck is that the
>>> IOMMU driver doesn't support the large page split. So if we want to
>>> enable large page and want to do partial unmap, it requires complex
>>> change.
>>
>> We won't need to split large pages (if we stick to 4K for now), we need
>> to split large mappings (not large pages) to allow partial unmapping and
>> iopt_area_split() seems to be doing this. Thanks,
> 
> You mean we can disable large page in iommufd and then VFIO will be able
> to do partial unmap. Yes, I think it is doable and we can avoid many
> ioctl context switches overhead.

So I understand this correctly: the disable_large_pages=true will imply 
that we never have PMD mappings such that we can atomically poke a hole 
in a mapping, without temporarily having to remove a PMD mapping in the 
iommu table to insert a PTE table?

batch_iommu_map_small() seems to document that behavior.

It's interesting that that comment points out that this is purely "VFIO 
compatibility", and that it otherwise violates the iommufd invariant: 
"pairing map/unmap". So, it is against the real iommufd design ...

Back when working on virtio-mem support (RAMDiscardManager), thought 
there was not way to reliably do atomic partial unmappings.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10  8:26                 ` David Hildenbrand
@ 2025-01-10 13:20                   ` Jason Gunthorpe
  2025-01-10 13:45                     ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 13:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chenyi Qiang, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Fri, Jan 10, 2025 at 09:26:02AM +0100, David Hildenbrand wrote:
> > > > > > > > > > One limitation (also discussed in the guest_memfd
> > > > > > > > > > meeting) is that VFIO expects the DMA mapping for
> > > > > > > > > > a specific IOVA to be mapped and unmapped with the
> > > > > > > > > > same granularity.

Not just same granularity, whatever you map you have to unmap in
whole. map/unmap must be perfectly paired by userspace.

> > > > > > > > > > such as converting a small region within a larger
> > > > > > > > > > region. To prevent such invalid cases, all
> > > > > > > > > > operations are performed with 4K granularity. The
> > > > > > > > > > possible solutions we can think of are either to
> > > > > > > > > > enable VFIO to support partial unmap

Yes, you can do that, but it is aweful for performance everywhere

> > > > > > iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
> > > > > > iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
> > > > > > disable_large_page=true. That means the large IOPTE is also disabled in
> > > > > > IOMMU. So it can do the split easily. See the comment in
> > > > > > iommufd_vfio_set_iommu().

Yes. But I am working on a project to make this more general purpose
and not have the 4k limitation. There are now several use cases for
this kind of cut feature.

https://lore.kernel.org/linux-iommu/7-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/

> > > > > This is all true but this also means that "The former requires complex
> > > > > changes in VFIO" is not entirely true - some code is already there.

Well, to do it without forcing 4k requires complex changes.

> > > > Hmm, my statement is a little confusing.  The bottleneck is that the
> > > > IOMMU driver doesn't support the large page split. So if we want to
> > > > enable large page and want to do partial unmap, it requires complex
> > > > change.

Yes, this is what I'm working on.

> > > We won't need to split large pages (if we stick to 4K for now), we need
> > > to split large mappings (not large pages) to allow partial unmapping and
> > > iopt_area_split() seems to be doing this. Thanks,

Correct
 
> > You mean we can disable large page in iommufd and then VFIO will be able
> > to do partial unmap. Yes, I think it is doable and we can avoid many
> > ioctl context switches overhead.

Right

> So I understand this correctly: the disable_large_pages=true will imply that
> we never have PMD mappings such that we can atomically poke a hole in a
> mapping, without temporarily having to remove a PMD mapping in the iommu
> table to insert a PTE table?

Yes
 
> batch_iommu_map_small() seems to document that behavior.

Yes
 
> It's interesting that that comment points out that this is purely "VFIO
> compatibility", and that it otherwise violates the iommufd invariant:
> "pairing map/unmap". So, it is against the real iommufd design ...

IIRC you can only trigger split using the VFIO type 1 legacy API. We
would need to formalize split as an IOMMUFD native ioctl.

Nobody should use this stuf through the legacy type 1 API!!!!

> Back when working on virtio-mem support (RAMDiscardManager), thought there
> was not way to reliably do atomic partial unmappings.

Correct

Jason


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10 13:20                   ` Jason Gunthorpe
@ 2025-01-10 13:45                     ` David Hildenbrand
  2025-01-10 14:14                       ` Jason Gunthorpe
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-10 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chenyi Qiang, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 10.01.25 14:20, Jason Gunthorpe wrote:

Thanks for your reply, I knew CCing you would be very helpful :)

> On Fri, Jan 10, 2025 at 09:26:02AM +0100, David Hildenbrand wrote:
>>>>>>>>>>> One limitation (also discussed in the guest_memfd
>>>>>>>>>>> meeting) is that VFIO expects the DMA mapping for
>>>>>>>>>>> a specific IOVA to be mapped and unmapped with the
>>>>>>>>>>> same granularity.
> 
> Not just same granularity, whatever you map you have to unmap in
> whole. map/unmap must be perfectly paired by userspace.

Right, that's what virtio-mem ends up doing by mapping each memory block 
(e.g., 2 MiB) separately that could be unmapped separately.

It adds "overhead", but at least you don't run into "no, you cannot 
split this region because you would be out of memory/slots" or in the 
past issues with concurrent ongoing DMA.

> 
>>>>>>>>>>> such as converting a small region within a larger
>>>>>>>>>>> region. To prevent such invalid cases, all
>>>>>>>>>>> operations are performed with 4K granularity. The
>>>>>>>>>>> possible solutions we can think of are either to
>>>>>>>>>>> enable VFIO to support partial unmap
> 
> Yes, you can do that, but it is aweful for performance everywhere

Absolutely.

In your commit I read:

"Implement the cut operation to be hitless, changes to the page table
during cutting must cause zero disruption to any ongoing DMA. This is 
the expectation of the VFIO type 1 uAPI. Hitless requires HW support, it 
is incompatible with HW requiring break-before-make."

So I guess that would mean that, depending on HW support, one could 
avoid disabling large pages to still allow for atomic cuts / partial 
unmaps that don't affect concurrent DMA.

What would be your suggestion here to avoid the "map each 4k page 
individually so we can unmap it individually" ? I didn't completely 
grasp that, sorry.

 From "IIRC you can only trigger split using the VFIO type 1 legacy API. 
We would need to formalize split as an IOMMUFD native ioctl.
Nobody should use this stuf through the legacy type 1 API!!!!"

I assume you mean that we can only avoid the 4k map/unmap if we add 
proper support to IOMMUFD native ioctl, and not try making it fly 
somehow with the legacy type 1 API?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10 13:45                     ` David Hildenbrand
@ 2025-01-10 14:14                       ` Jason Gunthorpe
  2025-01-10 14:50                         ` David Hildenbrand
  2025-01-15  3:39                         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 98+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 14:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chenyi Qiang, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote:
> 
> In your commit I read:
> 
> "Implement the cut operation to be hitless, changes to the page table
> during cutting must cause zero disruption to any ongoing DMA. This is the
> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is
> incompatible with HW requiring break-before-make."
> 
> So I guess that would mean that, depending on HW support, one could avoid
> disabling large pages to still allow for atomic cuts / partial unmaps that
> don't affect concurrent DMA.

Yes. Most x86 server HW will do this, though ARM support is a bit newish.

> What would be your suggestion here to avoid the "map each 4k page
> individually so we can unmap it individually" ? I didn't completely grasp
> that, sorry.

Map in large ranges in the VMM, lets say 1G of shared memory as a
single mapping (called an iommufd area)

When the guest makes a 2M chunk of it private you do a ioctl to
iommufd to split the area into three, leaving the 2M chunk as a
seperate area.

The new iommufd ioctl to split areas will go down into the iommu driver
and atomically cut the 1G PTEs into smaller PTEs as necessary so that
no PTE spans the edges of the 2M area.

Then userspace can unmap the 2M area and leave the remainder of the 1G
area mapped.

All of this would be fully hitless to ongoing DMA.

The iommufs code is there to do this assuming the areas are mapped at
4k, what is missing is the iommu driver side to atomically resize
large PTEs.

> From "IIRC you can only trigger split using the VFIO type 1 legacy API. We
> would need to formalize split as an IOMMUFD native ioctl.
> Nobody should use this stuf through the legacy type 1 API!!!!"
> 
> I assume you mean that we can only avoid the 4k map/unmap if we add proper
> support to IOMMUFD native ioctl, and not try making it fly somehow with the
> legacy type 1 API?

The thread was talking about the built-in support in iommufd to split
mappings. That built-in support is only accessible through legacy APIs
and should never be used in new qemu code. To use that built in
support in new code we need to build new APIs. The advantage of the
built-in support is qemu can map in large regions (which is more
efficient) and the kernel will break it down to 4k for the iommu
driver.

Mapping 4k at a time through the uAPI would be outrageously
inefficient.

Jason

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10 14:14                       ` Jason Gunthorpe
@ 2025-01-10 14:50                         ` David Hildenbrand
  2025-01-15  3:39                         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2025-01-10 14:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chenyi Qiang, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 10.01.25 15:14, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote:
>>
>> In your commit I read:
>>
>> "Implement the cut operation to be hitless, changes to the page table
>> during cutting must cause zero disruption to any ongoing DMA. This is the
>> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is
>> incompatible with HW requiring break-before-make."
>>
>> So I guess that would mean that, depending on HW support, one could avoid
>> disabling large pages to still allow for atomic cuts / partial unmaps that
>> don't affect concurrent DMA.
> 
> Yes. Most x86 server HW will do this, though ARM support is a bit newish.
> 
>> What would be your suggestion here to avoid the "map each 4k page
>> individually so we can unmap it individually" ? I didn't completely grasp
>> that, sorry.
> 
> Map in large ranges in the VMM, lets say 1G of shared memory as a
> single mapping (called an iommufd area)
> 
> When the guest makes a 2M chunk of it private you do a ioctl to
> iommufd to split the area into three, leaving the 2M chunk as a
> seperate area.
> 
> The new iommufd ioctl to split areas will go down into the iommu driver
> and atomically cut the 1G PTEs into smaller PTEs as necessary so that
> no PTE spans the edges of the 2M area.
> 
> Then userspace can unmap the 2M area and leave the remainder of the 1G
> area mapped.
> 
> All of this would be fully hitless to ongoing DMA.
> 
> The iommufs code is there to do this assuming the areas are mapped at
> 4k, what is missing is the iommu driver side to atomically resize
> large PTEs.
> 
>>  From "IIRC you can only trigger split using the VFIO type 1 legacy API. We
>> would need to formalize split as an IOMMUFD native ioctl.
>> Nobody should use this stuf through the legacy type 1 API!!!!"
>>
>> I assume you mean that we can only avoid the 4k map/unmap if we add proper
>> support to IOMMUFD native ioctl, and not try making it fly somehow with the
>> legacy type 1 API?
> 
> The thread was talking about the built-in support in iommufd to split
> mappings. That built-in support is only accessible through legacy APIs
> and should never be used in new qemu code. To use that built in
> support in new code we need to build new APIs. The advantage of the
> built-in support is qemu can map in large regions (which is more
> efficient) and the kernel will break it down to 4k for the iommu
> driver.
> 
> Mapping 4k at a time through the uAPI would be outrageously
> inefficient.

Got it, makes all sense, thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-13  3:34                       ` Chenyi Qiang
@ 2025-01-12 22:23                         ` Xu Yilun
  2025-01-14  1:14                           ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-12 22:23 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 13, 2025 at 11:34:44AM +0800, Chenyi Qiang wrote:
> 
> 
> On 1/10/2025 5:50 AM, Xu Yilun wrote:
> > On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
> >>>>
> >>>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> >>>>
> >>>> but I am not sure if this ever saw the light of the day, did not it?
> >>>> (ironically I am using it as a base for encrypted DMA :) )
> >>>
> >>> Yeah, we are doing the same work. I saw a solution from Michael long
> >>> time ago (when there was still
> >>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> >>> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> >>>
> >>> For your patch, it only implement the interface for
> >>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> >>> the parent object HostMemoryBackend, because besides the
> >>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> >>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> >>>
> >>> Think more about where to implement this interface. It is still
> >>> uncertain to me. As I mentioned in another mail, maybe ram device memory
> >>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
> >>
> >> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
> >> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
> >> owned by VFIO types, and I assume it is still true for private MMIO.
> >>
> >> But I think with TIO, MMIO regions also need conversion. So I support an
> >> object, but maybe not guest_memfd_manager.
> > 
> > Sorry, I mean the name only covers private memory, but not private MMIO.
> 
> So you suggest renaming the object to cover the private MMIO. Then how

Yes.

> about page_conversion_manager, or page_attribute_manager?

Maybe memory_attribute_manager? Strictly speaking MMIO resource is not
backed by pages.

Thanks,
Yilun

> 
> > 
> >>
> >> Thanks,
> >> Yilun
> >>
> >>> in future. Then a specific object is more appropriate. What's your opinion?
> >>>
> >>
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-09 21:50                     ` Xu Yilun
@ 2025-01-13  3:34                       ` Chenyi Qiang
  2025-01-12 22:23                         ` Xu Yilun
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-13  3:34 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/10/2025 5:50 AM, Xu Yilun wrote:
> On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
>>>>
>>>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>>>
>>>> but I am not sure if this ever saw the light of the day, did not it?
>>>> (ironically I am using it as a base for encrypted DMA :) )
>>>
>>> Yeah, we are doing the same work. I saw a solution from Michael long
>>> time ago (when there was still
>>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
>>> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
>>>
>>> For your patch, it only implement the interface for
>>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
>>> the parent object HostMemoryBackend, because besides the
>>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
>>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
>>>
>>> Think more about where to implement this interface. It is still
>>> uncertain to me. As I mentioned in another mail, maybe ram device memory
>>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
>>
>> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
>> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
>> owned by VFIO types, and I assume it is still true for private MMIO.
>>
>> But I think with TIO, MMIO regions also need conversion. So I support an
>> object, but maybe not guest_memfd_manager.
> 
> Sorry, I mean the name only covers private memory, but not private MMIO.

So you suggest renaming the object to cover the private MMIO. Then how
about page_conversion_manager, or page_attribute_manager?

> 
>>
>> Thanks,
>> Yilun
>>
>>> in future. Then a specific object is more appropriate. What's your opinion?
>>>
>>



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-08 10:56     ` Chenyi Qiang
  2025-01-08 11:20       ` Alexey Kardashevskiy
@ 2025-01-13 10:54       ` David Hildenbrand
  2025-01-14  1:10         ` Chenyi Qiang
  2025-01-15  4:05         ` Alexey Kardashevskiy
  1 sibling, 2 replies; 98+ messages in thread
From: David Hildenbrand @ 2025-01-13 10:54 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 08.01.25 11:56, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>> disable ram block discard. However, guest_memfd relies on the discard
>>> operation to perform page conversion between private and shared memory.
>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>> device to a confidential VM via shared memory (unprotected memory
>>> pages). Blocking shared page discard can solve this problem, but it
>>> could cause guests to consume twice the memory with VFIO, which is not
>>> acceptable in some cases. An alternative solution is to convey other
>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>
>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>> VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other, so the similar work that needs to happen in response
>>> to virtio-mem changes needs to happen for page conversion events.
>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>
>>> However, guest_memfd is not an object so it cannot directly implement
>>> the RamDiscardManager interface.
>>>
>>> One solution is to implement the interface in HostMemoryBackend. Any
>>
>> This sounds about right.
>>
>>> guest_memfd-backed host memory backend can register itself in the target
>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>> the virtual BIOS MemoryRegion.
>>
>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>> in "info mtree -f"? Do we really want this memory to be DMAable?
> 
> virtual BIOS shows in a separate region:
> 
>   Root memory region: system
>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>    ...
>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
> @0000000080000000 KVM
> 
> We also consider to implement the interface in HostMemoryBackend, but
> maybe implement with guest_memfd region is more general. We don't know
> if any DMAable memory would belong to HostMemoryBackend although at
> present it is.
> 
> If it is more appropriate to implement it with HostMemoryBackend, I can
> change to this way.

Not sure that's the right place. Isn't it the (cc) machine that controls 
the state?

It's not really the memory backend, that's just the memory provider.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
  2024-12-13  7:08 ` [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
@ 2025-01-13 10:56   ` David Hildenbrand
  2025-01-14  1:38     ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-13 10:56 UTC (permalink / raw)
  To: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13.12.24 08:08, Chenyi Qiang wrote:
> As guest_memfd is now managed by guest_memfd_manager with
> RamDiscardManager, only block uncoordinated discard.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   system/physmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 532182a6dd..585090b063 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>           assert(kvm_enabled());
>           assert(new_block->guest_memfd < 0);
>   
> -        ret = ram_block_discard_require(true);
> +        ret = ram_block_coordinated_discard_require(true);
>           if (ret < 0) {
>               error_setg_errno(errp, -ret,
>                                "cannot set up private guest memory: discard currently blocked");

Would that also unlock virtio-mem by accident?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-13 10:54       ` David Hildenbrand
@ 2025-01-14  1:10         ` Chenyi Qiang
  2025-01-15  4:05         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-14  1:10 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

Thanks David for your review!

On 1/13/2025 6:54 PM, David Hildenbrand wrote:
> On 08.01.25 11:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
>>
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Not sure that's the right place. Isn't it the (cc) machine that controls
> the state?
> 
> It's not really the memory backend, that's just the memory provider.

Yes, the cc machine defines the require_guest_memfd. And besides the
normal memory, there's also some other memory region requires the state
control. See in another thread, For example, private mmio may require
the state change notification but not belong to memory backend. So I
think it is still better to define a specific object to control the state.

> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-12 22:23                         ` Xu Yilun
@ 2025-01-14  1:14                           ` Chenyi Qiang
  0 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-14  1:14 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/13/2025 6:23 AM, Xu Yilun wrote:
> On Mon, Jan 13, 2025 at 11:34:44AM +0800, Chenyi Qiang wrote:
>>
>>
>> On 1/10/2025 5:50 AM, Xu Yilun wrote:
>>> On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
>>>>>>
>>>>>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>>>>>
>>>>>> but I am not sure if this ever saw the light of the day, did not it?
>>>>>> (ironically I am using it as a base for encrypted DMA :) )
>>>>>
>>>>> Yeah, we are doing the same work. I saw a solution from Michael long
>>>>> time ago (when there was still
>>>>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
>>>>> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
>>>>>
>>>>> For your patch, it only implement the interface for
>>>>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
>>>>> the parent object HostMemoryBackend, because besides the
>>>>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
>>>>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
>>>>>
>>>>> Think more about where to implement this interface. It is still
>>>>> uncertain to me. As I mentioned in another mail, maybe ram device memory
>>>>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
>>>>
>>>> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
>>>> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
>>>> owned by VFIO types, and I assume it is still true for private MMIO.
>>>>
>>>> But I think with TIO, MMIO regions also need conversion. So I support an
>>>> object, but maybe not guest_memfd_manager.
>>>
>>> Sorry, I mean the name only covers private memory, but not private MMIO.
>>
>> So you suggest renaming the object to cover the private MMIO. Then how
> 
> Yes.
> 
>> about page_conversion_manager, or page_attribute_manager?
> 
> Maybe memory_attribute_manager? Strictly speaking MMIO resource is not
> backed by pages.

Looks good to me. Thanks!

> 
> Thanks,
> Yilun
> 
>>
>>>
>>>>
>>>> Thanks,
>>>> Yilun
>>>>
>>>>> in future. Then a specific object is more appropriate. What's your opinion?
>>>>>
>>>>
>>
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
  2025-01-13 10:56   ` David Hildenbrand
@ 2025-01-14  1:38     ` Chenyi Qiang
       [not found]       ` <e1141052-1dec-435b-8635-a41881fedd4c@redhat.com>
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-14  1:38 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/13/2025 6:56 PM, David Hildenbrand wrote:
> On 13.12.24 08:08, Chenyi Qiang wrote:
>> As guest_memfd is now managed by guest_memfd_manager with
>> RamDiscardManager, only block uncoordinated discard.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   system/physmem.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 532182a6dd..585090b063 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block,
>> Error **errp)
>>           assert(kvm_enabled());
>>           assert(new_block->guest_memfd < 0);
>>   -        ret = ram_block_discard_require(true);
>> +        ret = ram_block_coordinated_discard_require(true);
>>           if (ret < 0) {
>>               error_setg_errno(errp, -ret,
>>                                "cannot set up private guest memory:
>> discard currently blocked");
> 
> Would that also unlock virtio-mem by accident?

Hum, that's true. At present, the rdm in MR can only point to one
instance, thus if we unlock virtio-mem and try to use it with
guest_memfd, it would trigger assert in
memory_region_set_ram_discard_manager().

Maybe we need to add some explicit check in virtio-mem to exclude it
with guest_memfd at present?

> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-09  4:29             ` Chenyi Qiang
  2025-01-10  0:58               ` Alexey Kardashevskiy
@ 2025-01-14  6:45               ` Chenyi Qiang
  1 sibling, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-14  6:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/9/2025 12:29 PM, Chenyi Qiang wrote:
> 
> 
> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>>> operation to perform page conversion between private and shared
>>>>>>> memory.
>>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>> not
>>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>
>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>> adjust
>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>> adding it
>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>> response
>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>
>>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>>> the RamDiscardManager interface.
>>>>>>>
>>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>>
>>>>>> This sounds about right.
>>>>>>
>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>> target
>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>> where a
>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>> e.g.
>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>
>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>
>>>>> virtual BIOS shows in a separate region:
>>>>>
>>>>>    Root memory region: system
>>>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>     ...
>>>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>
>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>
>>> Yes, virtual BIOS memory region is initialized by
>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>> guest_memfd.
>>>
>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>>> BIOS image will be loaded and then copied to private region.
>>> After that,
>>> the loaded image will be discarded and this region become useless.
>>
>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>> ram_guest_memfd() which does not leave MR useless - we still see
>> "pc.bios" in the list so it is not discarded. What piece of code are you
>> referring to exactly?
> 
> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
> the vBIOS handling.
> 
> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
> the vBIOS image to the shared part of the guest_memfd MR. For TDX, it
> will copy the image to private region (not the vBIOS guest_memfd MR
> private part) and discard the shared part. So, although the memory
> region still exists, it seems useless.

Correct myself. After some discussion internally, I found I
misunderstood the vBIOS handling in TDX. The memory region is valid. It
copies the vBIOS image to the private region (vBIOS guest_memfd private
part). Sorry for confusion.

> 
> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
> vBIOS guest_memfd private memory?
> 




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-10 14:14                       ` Jason Gunthorpe
  2025-01-10 14:50                         ` David Hildenbrand
@ 2025-01-15  3:39                         ` Alexey Kardashevskiy
  2025-01-15 12:49                           ` Jason Gunthorpe
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-15  3:39 UTC (permalink / raw)
  To: Jason Gunthorpe, David Hildenbrand
  Cc: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 11/1/25 01:14, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote:
>>
>> In your commit I read:
>>
>> "Implement the cut operation to be hitless, changes to the page table
>> during cutting must cause zero disruption to any ongoing DMA. This is the
>> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is
>> incompatible with HW requiring break-before-make."
>>
>> So I guess that would mean that, depending on HW support, one could avoid
>> disabling large pages to still allow for atomic cuts / partial unmaps that
>> don't affect concurrent DMA.
> 
> Yes. Most x86 server HW will do this, though ARM support is a bit newish.
> 
>> What would be your suggestion here to avoid the "map each 4k page
>> individually so we can unmap it individually" ? I didn't completely grasp
>> that, sorry.
> 
> Map in large ranges in the VMM, lets say 1G of shared memory as a
> single mapping (called an iommufd area)
> 
> When the guest makes a 2M chunk of it private you do a ioctl to
> iommufd to split the area into three, leaving the 2M chunk as a
> seperate area.
> 
> The new iommufd ioctl to split areas will go down into the iommu driver
> and atomically cut the 1G PTEs into smaller PTEs as necessary so that
> no PTE spans the edges of the 2M area.
> 
> Then userspace can unmap the 2M area and leave the remainder of the 1G
> area mapped.
> 
> All of this would be fully hitless to ongoing DMA.
> 
> The iommufs code is there to do this assuming the areas are mapped at
> 4k, what is missing is the iommu driver side to atomically resize
> large PTEs.
> 
>>  From "IIRC you can only trigger split using the VFIO type 1 legacy API. We
>> would need to formalize split as an IOMMUFD native ioctl.
>> Nobody should use this stuf through the legacy type 1 API!!!!"
>>
>> I assume you mean that we can only avoid the 4k map/unmap if we add proper
>> support to IOMMUFD native ioctl, and not try making it fly somehow with the
>> legacy type 1 API?
> 
> The thread was talking about the built-in support in iommufd to split
> mappings. 

Just to clarify - I am talking about splitting only "iommufd areas", not 
large pages. If all IOMMU PTEs are 4k and areas are bigger than 4K => 
the hw support is not needed to allow splitting. The comments above and 
below seem to confuse large pages with large areas (well, I am consufed, 
at least).


> That built-in support is only accessible through legacy APIs
> and should never be used in new qemu code. To use that built in
> support in new code we need to build new APIs.

Why would not IOMMU_IOAS_MAP/UNMAP uAPI work? Thanks,

> The advantage of the
> built-in support is qemu can map in large regions (which is more
> efficient) and the kernel will break it down to 4k for the iommu
> driver.
> Mapping 4k at a time through the uAPI would be outrageously
> inefficient.

> 
> Jason
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-13 10:54       ` David Hildenbrand
  2025-01-14  1:10         ` Chenyi Qiang
@ 2025-01-15  4:05         ` Alexey Kardashevskiy
       [not found]           ` <f3aaffe7-7045-4288-8675-349115a867ce@redhat.com>
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-15  4:05 UTC (permalink / raw)
  To: David Hildenbrand, Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 13/1/25 21:54, David Hildenbrand wrote:
> On 08.01.25 11:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the 
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
>>
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Not sure that's the right place. Isn't it the (cc) machine that controls 
> the state?

KVM does, via MemoryRegion->RAMBlock->guest_memfd.

> It's not really the memory backend, that's just the memory provider.

Sorry but is not "providing memory" the purpose of "memory backend"? :)


-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-10  6:38                 ` Chenyi Qiang
  2025-01-09 21:00                   ` Xu Yilun
@ 2025-01-15  4:06                   ` Alexey Kardashevskiy
  2025-01-15  6:15                     ` Chenyi Qiang
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-15  4:06 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 10/1/25 17:38, Chenyi Qiang wrote:
> 
> 
> On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 15:29, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>>> might
>>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>>> discard
>>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>>> memory.
>>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>>> hardware
>>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>>>> not
>>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>>> other
>>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>>
>>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>>> adjust
>>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>>> adding it
>>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>>> response
>>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>>
>>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>>> implement
>>>>>>>>> the RamDiscardManager interface.
>>>>>>>>>
>>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>>> Any
>>>>>>>>
>>>>>>>> This sounds about right.
>>
>> btw I am using this for ages:
>>
>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>
>> but I am not sure if this ever saw the light of the day, did not it?
>> (ironically I am using it as a base for encrypted DMA :) )
> 
> Yeah, we are doing the same work. I saw a solution from Michael long
> time ago (when there was still
> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> 
> For your patch, it only implement the interface for
> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> the parent object HostMemoryBackend, because besides the
> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> 
> Think more about where to implement this interface. It is still
> uncertain to me. As I mentioned in another mail, maybe ram device memory
> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
> in future. Then a specific object is more appropriate. What's your opinion?

I do not know about this. Unlike RAM, MMIO can only do "in-place 
conversion" and the interface to do so is not straight forward and VFIO 
owns MMIO anyway so the uAPI will be in iommufd, here is a gist of it:

https://github.com/aik/linux/commit/89e45c0404fa5006b2a4de33a4d582adf1ba9831

"guest request" is a communication channel from the VM to the secure FW 
(AMD's "PSP") to make MMIO allow encrypted access.


>>
>>>>>>>>
>>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>>> target
>>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>>> where a
>>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>>> e.g.
>>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>>
>>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>>> like
>>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>>
>>>>>>> virtual BIOS shows in a separate region:
>>>>>>>
>>>>>>>      Root memory region: system
>>>>>>>       0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>>       ...
>>>>>>>       00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>>
>>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>>
>>>>> Yes, virtual BIOS memory region is initialized by
>>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>>> guest_memfd.
>>>>>
>>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>>> virtual
>>>>> BIOS image will be loaded and then copied to private region.
>>>>> After that,
>>>>> the loaded image will be discarded and this region become useless.
>>>>
>>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>>> ram_guest_memfd() which does not leave MR useless - we still see
>>>> "pc.bios" in the list so it is not discarded. What piece of code are you
>>>> referring to exactly?
>>>
>>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>>> the vBIOS handling.
>>>
>>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
>>> the vBIOS image to the shared part of the guest_memfd MR.
>>> For TDX, it
>>> will copy the image to private region (not the vBIOS guest_memfd MR
>>> private part) and discard the shared part. So, although the memory
>>> region still exists, it seems useless.
>>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>>> vBIOS guest_memfd private memory?
>>
>> This is what it looks like on my SNP VM (which, I suspect, is the same
>> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):
> 
> Yes, the memory region object is created on both TDX and SEV-SNP.
> 
>>
>>   Root memory region: system
>>    0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>>    00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>>    00000000000e0000-000000001fffffff (prio 0, ram): ram1
>> @00000000000e0000 KVM gmemfd=20
>> ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26
>>
>> So the pc.bios MR exists and in use (hence its appearance in "info mtree
>> -f").
>>
>>
>> I added the gmemfd dumping:
>>
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
>> gpointer value,
>>                   }
>>               }
>>           }
>> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
>> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
>> +        }
>>
> 
> Then I think the virtual BIOS is another case not belonging to
> HostMemoryBackend which convince us to implement the interface in a
> specific object, no?

TBH I have no idea why pc.rom and pc.bios are separate memory regions 
but in any case why do these 2 areas need to be treated any different 
than the rest of RAM? Thanks,


>>
>>>>
>>>>
>>>>> So I
>>>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>>>
>>>>   From the above it sounds like the opposite, i.e. it should :)
>>>>
>>>>>>
>>>>>>>       0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>>>> @0000000080000000 KVM
>>>>>>
>>>>>> Anyway if there is no guest_memfd backing it and
>>>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>>>> just going to be mapped for VFIO as usual which seems... alright,
>>>>>> right?
>>>>>
>>>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>>>
>>>>> If we go with the HostMemoryBackend instead of guest_memfd_manager,
>>>>> this
>>>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>>>> just ignore it since the MR is useless (but looks not so good).
>>>>
>>>> Sorry I am missing necessary details here, let's figure out the above.
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>>>> maybe implement with guest_memfd region is more general. We don't
>>>>>>> know
>>>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>>>> present it is.
>>>>>>>
>>>>>>> If it is more appropriate to implement it with HostMemoryBackend,
>>>>>>> I can
>>>>>>> change to this way.
>>>>>>
>>>>>> Seems cleaner imho.
>>>>>
>>>>> I can go this way.
>>>
>>> [...]
>>>
>>>>>>>>> +
>>>>>>>>> +static int guest_memfd_rdm_replay_populated(const
>>>>>>>>> RamDiscardManager
>>>>>>>>> *rdm,
>>>>>>>>> +                                            MemoryRegionSection
>>>>>>>>> *section,
>>>>>>>>> +                                            ReplayRamPopulate
>>>>>>>>> replay_fn,
>>>>>>>>> +                                            void *opaque)
>>>>>>>>> +{
>>>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>>>> +    struct GuestMemfdReplayData data = { .fn =
>>>>>>>>> replay_fn, .opaque =
>>>>>>>>> opaque };
>>>>>>>>> +
>>>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>>>> &data,
>>>>>>>>> +
>>>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>>>> *section, void *arg)
>>>>>>>>> +{
>>>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>>>> +
>>>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>>>
>>>>>>>>
>>>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>>>
>>>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>>>> replay_populate() returns errors.
>>>>>>
>>>>>> A trace would be appropriate imho. Thanks,
>>>>>
>>>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>>>> returned by replay_populate()?
>>>>
>>>> Yeah. imho these are useful as we expect this part to work in general
>>>> too, right? Thanks,
>>>
>>> Something like?
>>>
>>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>>> index 6b3e1ee9d6..4440ac9e59 100644
>>> --- a/system/guest-memfd-manager.c
>>> +++ b/system/guest-memfd-manager.c
>>> @@ -185,8 +185,14 @@ static int
>>> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>>>    {
>>>        struct GuestMemfdReplayData *data = arg;
>>>        ReplayRamPopulate replay_fn = data->fn;
>>> +    int ret;
>>>
>>> -    return replay_fn(section, data->opaque);
>>> +    ret = replay_fn(section, data->opaque);
>>> +    if (ret) {
>>> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
>>> +    }
>>> +
>>> +    return ret;
>>>    }
>>>
>>> How about just adding some error output in
>>> guest_memfd_for_each_populated_section()/
>>> guest_memfd_for_each_discarded_section()
>>> if the cb() (i.e. replay_populate()) returns error?
>>
>> this will do too, yes. Thanks,
>>
> 
> 
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-15  4:06                   ` Alexey Kardashevskiy
@ 2025-01-15  6:15                     ` Chenyi Qiang
       [not found]                       ` <2b2730f3-6e1a-4def-b126-078cf6249759@amd.com>
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-15  6:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/15/2025 12:06 PM, Alexey Kardashevskiy wrote:
> On 10/1/25 17:38, Chenyi Qiang wrote:
>>
>>
>> On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 15:29, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>>>> might
>>>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>>>> discard
>>>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>>>> memory.
>>>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>>>> hardware
>>>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>>>> pages). Blocking shared page discard can solve this problem,
>>>>>>>>>> but it
>>>>>>>>>> could cause guests to consume twice the memory with VFIO,
>>>>>>>>>> which is
>>>>>>>>>> not
>>>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>>>> other
>>>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>>>
>>>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>>>> adjust
>>>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>>>> adding it
>>>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>>>> response
>>>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>>>
>>>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>>>> implement
>>>>>>>>>> the RamDiscardManager interface.
>>>>>>>>>>
>>>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>>>> Any
>>>>>>>>>
>>>>>>>>> This sounds about right.
>>>
>>> btw I am using this for ages:
>>>
>>> https://github.com/aik/qemu/
>>> commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>>
>>> but I am not sure if this ever saw the light of the day, did not it?
>>> (ironically I am using it as a base for encrypted DMA :) )
>>
>> Yeah, we are doing the same work. I saw a solution from Michael long
>> time ago (when there was still
>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
>> (https://github.com/AMDESE/qemu/
>> commit/3bf5255fc48d648724d66410485081ace41d8ee6)
>>
>> For your patch, it only implement the interface for
>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
>> the parent object HostMemoryBackend, because besides the
>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
>>
>> Think more about where to implement this interface. It is still
>> uncertain to me. As I mentioned in another mail, maybe ram device memory
>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
>> in future. Then a specific object is more appropriate. What's your
>> opinion?
> 
> I do not know about this. Unlike RAM, MMIO can only do "in-place
> conversion" and the interface to do so is not straight forward and VFIO
> owns MMIO anyway so the uAPI will be in iommufd, here is a gist of it:
> 
> https://github.com/aik/linux/
> commit/89e45c0404fa5006b2a4de33a4d582adf1ba9831
> 
> "guest request" is a communication channel from the VM to the secure FW
> (AMD's "PSP") to make MMIO allow encrypted access.

It is still uncertain how to implement the private MMIO. Our assumption
is the private MMIO would also create a memory region with
guest_memfd-like backend. Its mr->ram is true and should be managed by
RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
listener.

> 
> 
>>>
>>>>>>>>>
>>>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>>>> target
>>>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>>>> where a
>>>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>>>> e.g.
>>>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>>>
>>>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>>>> like
>>>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>>>
>>>>>>>> virtual BIOS shows in a separate region:
>>>>>>>>
>>>>>>>>      Root memory region: system
>>>>>>>>       0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>>>       ...
>>>>>>>>       00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>>>
>>>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>>>
>>>>>> Yes, virtual BIOS memory region is initialized by
>>>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>>>> guest_memfd.
>>>>>>
>>>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>>>> virtual
>>>>>> BIOS image will be loaded and then copied to private region.
>>>>>> After that,
>>>>>> the loaded image will be discarded and this region become useless.
>>>>>
>>>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>>>> ram_guest_memfd() which does not leave MR useless - we still see
>>>>> "pc.bios" in the list so it is not discarded. What piece of code
>>>>> are you
>>>>> referring to exactly?
>>>>
>>>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>>>> the vBIOS handling.
>>>>
>>>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and
>>>> loads
>>>> the vBIOS image to the shared part of the guest_memfd MR.
>>>> For TDX, it
>>>> will copy the image to private region (not the vBIOS guest_memfd MR
>>>> private part) and discard the shared part. So, although the memory
>>>> region still exists, it seems useless.
>>>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>>>> vBIOS guest_memfd private memory?
>>>
>>> This is what it looks like on my SNP VM (which, I suspect, is the same
>>> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this
>>> matter):
>>
>> Yes, the memory region object is created on both TDX and SEV-SNP.
>>
>>>
>>>   Root memory region: system
>>>    0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>>>    00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>>>    00000000000e0000-000000001fffffff (prio 0, ram): ram1
>>> @00000000000e0000 KVM gmemfd=20
>>> ...
>>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>> gmemfd=26
>>>
>>> So the pc.bios MR exists and in use (hence its appearance in "info mtree
>>> -f").
>>>
>>>
>>> I added the gmemfd dumping:
>>>
>>> --- a/system/memory.c
>>> +++ b/system/memory.c
>>> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
>>> gpointer value,
>>>                   }
>>>               }
>>>           }
>>> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
>>> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
>>> +        }
>>>
>>
>> Then I think the virtual BIOS is another case not belonging to
>> HostMemoryBackend which convince us to implement the interface in a
>> specific object, no?
> 
> TBH I have no idea why pc.rom and pc.bios are separate memory regions
> but in any case why do these 2 areas need to be treated any different
> than the rest of RAM? Thanks,

I think no difference. That's why I suggest implementing the RDM
interface in a specific object to cover both instead of the only
HostMemoryBackend.

> 
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
  2025-01-15  3:39                         ` Alexey Kardashevskiy
@ 2025-01-15 12:49                           ` Jason Gunthorpe
       [not found]                             ` <cc3428b1-22b7-432a-9c74-12b7e36b6cc6@redhat.com>
  0 siblings, 1 reply; 98+ messages in thread
From: Jason Gunthorpe @ 2025-01-15 12:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Hildenbrand, Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Wed, Jan 15, 2025 at 02:39:55PM +1100, Alexey Kardashevskiy wrote:
> > The thread was talking about the built-in support in iommufd to split
> > mappings.
> 
> Just to clarify - I am talking about splitting only "iommufd areas", not
> large pages.

In generality it is the same thing as you cannot generally guarantee
that an area split doesn't also cross a large page.

> If all IOMMU PTEs are 4k and areas are bigger than 4K => the hw
> support is not needed to allow splitting. The comments above and below seem
> to confuse large pages with large areas (well, I am consufed, at least).

Yes, in that special case yes.

> > That built-in support is only accessible through legacy APIs
> > and should never be used in new qemu code. To use that built in
> > support in new code we need to build new APIs.
> 
> Why would not IOMMU_IOAS_MAP/UNMAP uAPI work? Thanks,

I don't want to overload those APIs, I prefer to see a new API that is
just about splitting areas. Splitting is a special operation that can
fail depending on driver support.

Jason


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
       [not found]           ` <f3aaffe7-7045-4288-8675-349115a867ce@redhat.com>
@ 2025-01-20 17:21             ` Peter Xu
  2025-01-20 17:54               ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-20 17:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
> Sorry, I was traveling end of last week. I wrote a mail on the train and
> apparently it was swallowed somehow ...
> 
> > > Not sure that's the right place. Isn't it the (cc) machine that controls
> > > the state?
> > 
> > KVM does, via MemoryRegion->RAMBlock->guest_memfd.
> 
> Right; I consider KVM part of the machine.
> 
> 
> > 
> > > It's not really the memory backend, that's just the memory provider.
> > 
> > Sorry but is not "providing memory" the purpose of "memory backend"? :)
> 
> Hehe, what I wanted to say is that a memory backend is just something to
> create a RAMBlock. There are different ways to create a RAMBlock, even
> guest_memfd ones.
> 
> guest_memfd is stored per RAMBlock. I assume the state should be stored per
> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
> 
> Now, the question is, who is the manager?
> 
> 1) The machine. KVM requests the machine to perform the transition, and the
> machine takes care of updating the guest_memfd state and notifying any
> listeners.
> 
> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
> RAMBlock would have to become an object, or we allocate separate objects.
> 
> I'm leaning towards 1), but I might be missing something.

A pure question: how do we process the bios gmemfds?  I assume they're
shared when VM starts if QEMU needs to load the bios into it, but are they
always shared, or can they be converted to private later?

I wonder if it's possible (now, or in the future so it can be >2 fds) that
a VM can contain multiple guest_memfds, meanwhile they request different
security levels. Then it could be more future proof that such idea be
managed per-fd / per-ramblock / .. rather than per-VM. For example, always
shared gmemfds can avoid the manager but be treated like normal memories,
while some gmemfds can still be confidential to install the manager.

But I'd confess this is pretty much whild guesses as of now.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 17:21             ` Peter Xu
@ 2025-01-20 17:54               ` David Hildenbrand
  2025-01-20 18:33                 ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-20 17:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 20.01.25 18:21, Peter Xu wrote:
> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>> Sorry, I was traveling end of last week. I wrote a mail on the train and
>> apparently it was swallowed somehow ...
>>
>>>> Not sure that's the right place. Isn't it the (cc) machine that controls
>>>> the state?
>>>
>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>
>> Right; I consider KVM part of the machine.
>>
>>
>>>
>>>> It's not really the memory backend, that's just the memory provider.
>>>
>>> Sorry but is not "providing memory" the purpose of "memory backend"? :)
>>
>> Hehe, what I wanted to say is that a memory backend is just something to
>> create a RAMBlock. There are different ways to create a RAMBlock, even
>> guest_memfd ones.
>>
>> guest_memfd is stored per RAMBlock. I assume the state should be stored per
>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>
>> Now, the question is, who is the manager?
>>
>> 1) The machine. KVM requests the machine to perform the transition, and the
>> machine takes care of updating the guest_memfd state and notifying any
>> listeners.
>>
>> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
>> RAMBlock would have to become an object, or we allocate separate objects.
>>
>> I'm leaning towards 1), but I might be missing something.
> 
> A pure question: how do we process the bios gmemfds?  I assume they're
> shared when VM starts if QEMU needs to load the bios into it, but are they
> always shared, or can they be converted to private later?

You're probably looking for memory_region_init_ram_guest_memfd().

> 
> I wonder if it's possible (now, or in the future so it can be >2 fds) that
> a VM can contain multiple guest_memfds, meanwhile they request different
> security levels. Then it could be more future proof that such idea be
> managed per-fd / per-ramblock / .. rather than per-VM. For example, always
> shared gmemfds can avoid the manager but be treated like normal memories,
> while some gmemfds can still be confidential to install the manager.

I think all of that is possible with whatever design we chose.

The situation is:

* guest_memfd is per RAMBlock (block->guest_memfd set in ram_block_add)
* Some RAMBlocks have a memory backend, others do not. In particular,
   the ones calling memory_region_init_ram_guest_memfd() do not.

So the *guest_memfd information* (fd, bitmap) really must be stored per 
RAMBlock.

The question *which object* implements the RamDiscardManager interface 
to manage the RAMBlocks that have a guest_memfd.

We either need

1) Something attached to the RAMBlock or the RAMBlock itself. This
    series does it via a new object attached to the RAMBlock.
2) A per-VM entity (e.g., machine, distinct management object)

In case of 1) KVM looks up the RAMBlock->object to trigger the state 
change. That object will inform all listeners.

In case of 2) KVM calls the per-VM entity (e.g., guest_memfd manager), 
which looks up the RAMBlock and triggers the state change. It will 
inform all listeners.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2024-12-13  7:08 ` [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
  2024-12-18  6:45   ` Chenyi Qiang
  2025-01-08  4:48   ` Alexey Kardashevskiy
@ 2025-01-20 18:09   ` Peter Xu
  2025-01-21  9:00     ` Chenyi Qiang
  2 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-20 18:09 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Two trivial comments I spot:

On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
> +struct GuestMemfdManager {
> +    Object parent;
> +
> +    /* Managed memory region. */
> +    MemoryRegion *mr;
> +
> +    /*
> +     * 1-setting of the bit represents the memory is populated (shared).
> +     */
> +    int32_t bitmap_size;
> +    unsigned long *bitmap;

Might be clearer to name the bitmap directly as what it represents.  E.g.,
shared_bitmap?

> +
> +    /* block size and alignment */
> +    uint64_t block_size;

Can we always fetch it from the MR/ramblock? If this is needed, better add
some comment explaining why.

> +
> +    /* listeners to notify on populate/discard activity. */
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 17:54               ` David Hildenbrand
@ 2025-01-20 18:33                 ` Peter Xu
  2025-01-20 18:47                   ` David Hildenbrand
  2025-01-21  1:35                   ` Chenyi Qiang
  0 siblings, 2 replies; 98+ messages in thread
From: Peter Xu @ 2025-01-20 18:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
> On 20.01.25 18:21, Peter Xu wrote:
> > On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
> > > Sorry, I was traveling end of last week. I wrote a mail on the train and
> > > apparently it was swallowed somehow ...
> > > 
> > > > > Not sure that's the right place. Isn't it the (cc) machine that controls
> > > > > the state?
> > > > 
> > > > KVM does, via MemoryRegion->RAMBlock->guest_memfd.
> > > 
> > > Right; I consider KVM part of the machine.
> > > 
> > > 
> > > > 
> > > > > It's not really the memory backend, that's just the memory provider.
> > > > 
> > > > Sorry but is not "providing memory" the purpose of "memory backend"? :)
> > > 
> > > Hehe, what I wanted to say is that a memory backend is just something to
> > > create a RAMBlock. There are different ways to create a RAMBlock, even
> > > guest_memfd ones.
> > > 
> > > guest_memfd is stored per RAMBlock. I assume the state should be stored per
> > > RAMBlock as well, maybe as part of a "guest_memfd state" thing.
> > > 
> > > Now, the question is, who is the manager?
> > > 
> > > 1) The machine. KVM requests the machine to perform the transition, and the
> > > machine takes care of updating the guest_memfd state and notifying any
> > > listeners.
> > > 
> > > 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
> > > RAMBlock would have to become an object, or we allocate separate objects.
> > > 
> > > I'm leaning towards 1), but I might be missing something.
> > 
> > A pure question: how do we process the bios gmemfds?  I assume they're
> > shared when VM starts if QEMU needs to load the bios into it, but are they
> > always shared, or can they be converted to private later?
> 
> You're probably looking for memory_region_init_ram_guest_memfd().

Yes, but I didn't see whether such gmemfd needs conversions there.  I saw
an answer though from Chenyi in another email:

https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/

So I suppose the BIOS region must support private / share conversions too,
just like the rest part.

Though in that case, I'm not 100% sure whether that could also be done by
reusing the major guest memfd with some specific offset regions.

> 
> > 
> > I wonder if it's possible (now, or in the future so it can be >2 fds) that
> > a VM can contain multiple guest_memfds, meanwhile they request different
> > security levels. Then it could be more future proof that such idea be
> > managed per-fd / per-ramblock / .. rather than per-VM. For example, always
> > shared gmemfds can avoid the manager but be treated like normal memories,
> > while some gmemfds can still be confidential to install the manager.
> 
> I think all of that is possible with whatever design we chose.
> 
> The situation is:
> 
> * guest_memfd is per RAMBlock (block->guest_memfd set in ram_block_add)
> * Some RAMBlocks have a memory backend, others do not. In particular,
>   the ones calling memory_region_init_ram_guest_memfd() do not.
> 
> So the *guest_memfd information* (fd, bitmap) really must be stored per
> RAMBlock.
> 
> The question *which object* implements the RamDiscardManager interface to
> manage the RAMBlocks that have a guest_memfd.
> 
> We either need
> 
> 1) Something attached to the RAMBlock or the RAMBlock itself. This
>    series does it via a new object attached to the RAMBlock.
> 2) A per-VM entity (e.g., machine, distinct management object)
> 
> In case of 1) KVM looks up the RAMBlock->object to trigger the state change.
> That object will inform all listeners.
> 
> In case of 2) KVM calls the per-VM entity (e.g., guest_memfd manager), which
> looks up the RAMBlock and triggers the state change. It will inform all
> listeners.

(after I finished reading the whole discussion..)

Looks like Yilun raised another point, on how to reuse the same object for
device TIO support here (conversions for device MMIOs):

https://lore.kernel.org/r/https://lore.kernel.org/all/Z4RA1vMGFECmYNXp@yilunxu-OptiPlex-7050/

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/7] Enable shared device assignment
       [not found]                             ` <cc3428b1-22b7-432a-9c74-12b7e36b6cc6@redhat.com>
@ 2025-01-20 18:39                               ` Jason Gunthorpe
  0 siblings, 0 replies; 98+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 18:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 01:57:36PM +0100, David Hildenbrand wrote:
> > I don't want to overload those APIs, I prefer to see a new API that is
> > just about splitting areas. Splitting is a special operation that can
> > fail depending on driver support.
> 
> So we'd just always perform a split-before-unmap. If split fails, we're in
> trouble, just like we would be when unmap would fail.
> 
> If the split succeeded, the unmap will succeed *and* be atomic.
> 
> That sounds reasonable to me and virtio-mem could benefit from that as well.

Yeah, we just went through removing implicit split on unmap behaviors
from the code in iommu and I think it was a mistake that existed at
all.

Very few places even want/expect to do split. Instead we ended up with
this situation where it was hard to tell who was even using it, if at
all. Turned out nobody used it.

So I'd like the very special behavior marked out in code. If
performance is a concern I'd prefer to see split gain a auto-unmap
option.

Jason


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 18:33                 ` Peter Xu
@ 2025-01-20 18:47                   ` David Hildenbrand
  2025-01-20 20:19                     ` Peter Xu
  2025-01-21  1:35                   ` Chenyi Qiang
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-20 18:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 20.01.25 19:33, Peter Xu wrote:
> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>> On 20.01.25 18:21, Peter Xu wrote:
>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>> Sorry, I was traveling end of last week. I wrote a mail on the train and
>>>> apparently it was swallowed somehow ...
>>>>
>>>>>> Not sure that's the right place. Isn't it the (cc) machine that controls
>>>>>> the state?
>>>>>
>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>
>>>> Right; I consider KVM part of the machine.
>>>>
>>>>
>>>>>
>>>>>> It's not really the memory backend, that's just the memory provider.
>>>>>
>>>>> Sorry but is not "providing memory" the purpose of "memory backend"? :)
>>>>
>>>> Hehe, what I wanted to say is that a memory backend is just something to
>>>> create a RAMBlock. There are different ways to create a RAMBlock, even
>>>> guest_memfd ones.
>>>>
>>>> guest_memfd is stored per RAMBlock. I assume the state should be stored per
>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>
>>>> Now, the question is, who is the manager?
>>>>
>>>> 1) The machine. KVM requests the machine to perform the transition, and the
>>>> machine takes care of updating the guest_memfd state and notifying any
>>>> listeners.
>>>>
>>>> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
>>>> RAMBlock would have to become an object, or we allocate separate objects.
>>>>
>>>> I'm leaning towards 1), but I might be missing something.
>>>
>>> A pure question: how do we process the bios gmemfds?  I assume they're
>>> shared when VM starts if QEMU needs to load the bios into it, but are they
>>> always shared, or can they be converted to private later?
>>
>> You're probably looking for memory_region_init_ram_guest_memfd().
> 
> Yes, but I didn't see whether such gmemfd needs conversions there.  I saw
> an answer though from Chenyi in another email:
> 
> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/
> 
> So I suppose the BIOS region must support private / share conversions too,
> just like the rest part.
> 
> Though in that case, I'm not 100% sure whether that could also be done by
> reusing the major guest memfd with some specific offset regions.
> 
>>
>>>
>>> I wonder if it's possible (now, or in the future so it can be >2 fds) that
>>> a VM can contain multiple guest_memfds, meanwhile they request different
>>> security levels. Then it could be more future proof that such idea be
>>> managed per-fd / per-ramblock / .. rather than per-VM. For example, always
>>> shared gmemfds can avoid the manager but be treated like normal memories,
>>> while some gmemfds can still be confidential to install the manager.
>>
>> I think all of that is possible with whatever design we chose.
>>
>> The situation is:
>>
>> * guest_memfd is per RAMBlock (block->guest_memfd set in ram_block_add)
>> * Some RAMBlocks have a memory backend, others do not. In particular,
>>    the ones calling memory_region_init_ram_guest_memfd() do not.
>>
>> So the *guest_memfd information* (fd, bitmap) really must be stored per
>> RAMBlock.
>>
>> The question *which object* implements the RamDiscardManager interface to
>> manage the RAMBlocks that have a guest_memfd.
>>
>> We either need
>>
>> 1) Something attached to the RAMBlock or the RAMBlock itself. This
>>     series does it via a new object attached to the RAMBlock.
>> 2) A per-VM entity (e.g., machine, distinct management object)
>>
>> In case of 1) KVM looks up the RAMBlock->object to trigger the state change.
>> That object will inform all listeners.
>>
>> In case of 2) KVM calls the per-VM entity (e.g., guest_memfd manager), which
>> looks up the RAMBlock and triggers the state change. It will inform all
>> listeners.
> 
> (after I finished reading the whole discussion..)
> 
> Looks like Yilun raised another point, on how to reuse the same object for
> device TIO support here (conversions for device MMIOs):

I don't grasp the full picture, but I suspect it would not be RAM (no 
RAMBlock?)?

"memory_attribute_manager" is weird if it is not memory, but 
memory-mapped I/O ... :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 18:47                   ` David Hildenbrand
@ 2025-01-20 20:19                     ` Peter Xu
  2025-01-20 20:25                       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-20 20:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 07:47:18PM +0100, David Hildenbrand wrote:
> "memory_attribute_manager" is weird if it is not memory, but memory-mapped
> I/O ... :)

What you said sounds like a better name already than GuestMemfdManager in
this patch.. :) To me it's ok to call MMIO as part of "memory" too, and
"attribute" can describe the shareable / private (as an attribute).  I'm
guessing Yilun and Chenyi will figure that out..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 20:19                     ` Peter Xu
@ 2025-01-20 20:25                       ` David Hildenbrand
  2025-01-20 20:43                         ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-20 20:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 20.01.25 21:19, Peter Xu wrote:
> On Mon, Jan 20, 2025 at 07:47:18PM +0100, David Hildenbrand wrote:
>> "memory_attribute_manager" is weird if it is not memory, but memory-mapped
>> I/O ... :)
> 
> What you said sounds like a better name already than GuestMemfdManager in
> this patch.. 

Agreed.

:) To me it's ok to call MMIO as part of "memory" too, and
> "attribute" can describe the shareable / private (as an attribute).  I'm
> guessing Yilun and Chenyi will figure that out..

Yes, calling it "attributes" popped up during RFC discussion: in theory, 
disacard vs. populated and shared vs. private could co-exist (maybe in 
the future with virtio-mem or something similar).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 20:25                       ` David Hildenbrand
@ 2025-01-20 20:43                         ` Peter Xu
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Xu @ 2025-01-20 20:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 09:25:51PM +0100, David Hildenbrand wrote:
> Yes, calling it "attributes" popped up during RFC discussion: in theory,
> disacard vs. populated and shared vs. private could co-exist (maybe in the
> future with virtio-mem or something similar).

Yes makes sense. The attribute then can be easily converted into something
like "user_accessible" / ...  Then that can equal to "populated (plugged)
&& shared" ultimately.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
       [not found]                       ` <2b2730f3-6e1a-4def-b126-078cf6249759@amd.com>
@ 2025-01-20 20:46                         ` Peter Xu
  2024-06-24 16:31                           ` Xu Yilun
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-20 20:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Chenyi Qiang, David Hildenbrand, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > It is still uncertain how to implement the private MMIO. Our assumption
> > is the private MMIO would also create a memory region with
> > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > listener.
> 
> My current working approach is to leave it as is in QEMU and VFIO.

Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
currently QEMU heavily rely on that flag for any possible direct accesses.
E.g., in memory_access_is_direct().

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 18:33                 ` Peter Xu
  2025-01-20 18:47                   ` David Hildenbrand
@ 2025-01-21  1:35                   ` Chenyi Qiang
  2025-01-21 16:35                     ` Peter Xu
  1 sibling, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-21  1:35 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: Alexey Kardashevskiy, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun



On 1/21/2025 2:33 AM, Peter Xu wrote:
> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>> On 20.01.25 18:21, Peter Xu wrote:
>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>> Sorry, I was traveling end of last week. I wrote a mail on the train and
>>>> apparently it was swallowed somehow ...
>>>>
>>>>>> Not sure that's the right place. Isn't it the (cc) machine that controls
>>>>>> the state?
>>>>>
>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>
>>>> Right; I consider KVM part of the machine.
>>>>
>>>>
>>>>>
>>>>>> It's not really the memory backend, that's just the memory provider.
>>>>>
>>>>> Sorry but is not "providing memory" the purpose of "memory backend"? :)
>>>>
>>>> Hehe, what I wanted to say is that a memory backend is just something to
>>>> create a RAMBlock. There are different ways to create a RAMBlock, even
>>>> guest_memfd ones.
>>>>
>>>> guest_memfd is stored per RAMBlock. I assume the state should be stored per
>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>
>>>> Now, the question is, who is the manager?
>>>>
>>>> 1) The machine. KVM requests the machine to perform the transition, and the
>>>> machine takes care of updating the guest_memfd state and notifying any
>>>> listeners.
>>>>
>>>> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
>>>> RAMBlock would have to become an object, or we allocate separate objects.
>>>>
>>>> I'm leaning towards 1), but I might be missing something.
>>>
>>> A pure question: how do we process the bios gmemfds?  I assume they're
>>> shared when VM starts if QEMU needs to load the bios into it, but are they
>>> always shared, or can they be converted to private later?
>>
>> You're probably looking for memory_region_init_ram_guest_memfd().
> 
> Yes, but I didn't see whether such gmemfd needs conversions there.  I saw
> an answer though from Chenyi in another email:
> 
> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/
> 
> So I suppose the BIOS region must support private / share conversions too,
> just like the rest part.

Yes, the BIOS region can support conversion as well. I think guest_memfd
backed memory regions all follow the same sequence during setup time:

guest_memfd is shared when the guest_memfd fd is created by
kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
converted to private just after kvm_set_user_memory_region() in
kvm_set_phys_mem(). So at the boot time of cc VM, the default attribute
is private. During runtime, the vBIOS can also do the conversion if it
wants.

> 
> Though in that case, I'm not 100% sure whether that could also be done by
> reusing the major guest memfd with some specific offset regions.

Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
will have its own slot. So the vBIOS can use its own guest_memfd to get
the specific offset regions.

> 
>>
>>>
>>> I wonder if it's possible (now, or in the future so it can be >2 fds) that
>>> a VM can contain multiple guest_memfds, meanwhile they request different
>>> security levels. Then it could be more future proof that such idea be
>>> managed per-fd / per-ramblock / .. rather than per-VM. For example, always
>>> shared gmemfds can avoid the manager but be treated like normal memories,
>>> while some gmemfds can still be confidential to install the manager.
>>
>> I think all of that is possible with whatever design we chose.
>>
>> The situation is:
>>
>> * guest_memfd is per RAMBlock (block->guest_memfd set in ram_block_add)
>> * Some RAMBlocks have a memory backend, others do not. In particular,
>>   the ones calling memory_region_init_ram_guest_memfd() do not.
>>
>> So the *guest_memfd information* (fd, bitmap) really must be stored per
>> RAMBlock.
>>
>> The question *which object* implements the RamDiscardManager interface to
>> manage the RAMBlocks that have a guest_memfd.
>>
>> We either need
>>
>> 1) Something attached to the RAMBlock or the RAMBlock itself. This
>>    series does it via a new object attached to the RAMBlock.
>> 2) A per-VM entity (e.g., machine, distinct management object)
>>
>> In case of 1) KVM looks up the RAMBlock->object to trigger the state change.
>> That object will inform all listeners.
>>
>> In case of 2) KVM calls the per-VM entity (e.g., guest_memfd manager), which
>> looks up the RAMBlock and triggers the state change. It will inform all
>> listeners.
> 
> (after I finished reading the whole discussion..)
> 
> Looks like Yilun raised another point, on how to reuse the same object for
> device TIO support here (conversions for device MMIOs):
> 
> https://lore.kernel.org/r/https://lore.kernel.org/all/Z4RA1vMGFECmYNXp@yilunxu-OptiPlex-7050/
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
       [not found]       ` <e1141052-1dec-435b-8635-a41881fedd4c@redhat.com>
@ 2025-01-21  6:26         ` Chenyi Qiang
  2025-01-21  8:05           ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-21  6:26 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/20/2025 9:11 PM, David Hildenbrand wrote:
> On 14.01.25 02:38, Chenyi Qiang wrote:
>>
>>
>> On 1/13/2025 6:56 PM, David Hildenbrand wrote:
>>> On 13.12.24 08:08, Chenyi Qiang wrote:
>>>> As guest_memfd is now managed by guest_memfd_manager with
>>>> RamDiscardManager, only block uncoordinated discard.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>>    system/physmem.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index 532182a6dd..585090b063 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block,
>>>> Error **errp)
>>>>            assert(kvm_enabled());
>>>>            assert(new_block->guest_memfd < 0);
>>>>    -        ret = ram_block_discard_require(true);
>>>> +        ret = ram_block_coordinated_discard_require(true);
>>>>            if (ret < 0) {
>>>>                error_setg_errno(errp, -ret,
>>>>                                 "cannot set up private guest memory:
>>>> discard currently blocked");
>>>
>>> Would that also unlock virtio-mem by accident?
>>
>> Hum, that's true. At present, the rdm in MR can only point to one
>> instance, thus if we unlock virtio-mem and try to use it with
>> guest_memfd, it would trigger assert in
>> memory_region_set_ram_discard_manager().
>>
>> Maybe we need to add some explicit check in virtio-mem to exclude it
>> with guest_memfd at present?
> 
> Likely we should make memory_region_set_ram_discard_manager() fail if
> there is already something, and handle it in the callers?
> 
> In case of virtio-mem, we'd have to undo what we did and fail realize().
> 
> In case of CC, we'd have to bail out in a different way.
> 
> 
> Then, I think if we see new_block->guest_memfd here, that we can assume
> that any coordinated discard corresponds to only the guest_memfd one,
> not to anything else?

LGTM. In case of CC, I think we can also check the
memory_region_set_ram_discard_manager() failure, undo what we did and
make the ram_block_add() fail (set errno).

> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
  2025-01-21  6:26         ` Chenyi Qiang
@ 2025-01-21  8:05           ` David Hildenbrand
  0 siblings, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2025-01-21  8:05 UTC (permalink / raw)
  To: Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 21.01.25 07:26, Chenyi Qiang wrote:
> 
> 
> On 1/20/2025 9:11 PM, David Hildenbrand wrote:
>> On 14.01.25 02:38, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/13/2025 6:56 PM, David Hildenbrand wrote:
>>>> On 13.12.24 08:08, Chenyi Qiang wrote:
>>>>> As guest_memfd is now managed by guest_memfd_manager with
>>>>> RamDiscardManager, only block uncoordinated discard.
>>>>>
>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> ---
>>>>>     system/physmem.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>> index 532182a6dd..585090b063 100644
>>>>> --- a/system/physmem.c
>>>>> +++ b/system/physmem.c
>>>>> @@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block,
>>>>> Error **errp)
>>>>>             assert(kvm_enabled());
>>>>>             assert(new_block->guest_memfd < 0);
>>>>>     -        ret = ram_block_discard_require(true);
>>>>> +        ret = ram_block_coordinated_discard_require(true);
>>>>>             if (ret < 0) {
>>>>>                 error_setg_errno(errp, -ret,
>>>>>                                  "cannot set up private guest memory:
>>>>> discard currently blocked");
>>>>
>>>> Would that also unlock virtio-mem by accident?
>>>
>>> Hum, that's true. At present, the rdm in MR can only point to one
>>> instance, thus if we unlock virtio-mem and try to use it with
>>> guest_memfd, it would trigger assert in
>>> memory_region_set_ram_discard_manager().
>>>
>>> Maybe we need to add some explicit check in virtio-mem to exclude it
>>> with guest_memfd at present?
>>
>> Likely we should make memory_region_set_ram_discard_manager() fail if
>> there is already something, and handle it in the callers?
>>
>> In case of virtio-mem, we'd have to undo what we did and fail realize().
>>
>> In case of CC, we'd have to bail out in a different way.
>>
>>
>> Then, I think if we see new_block->guest_memfd here, that we can assume
>> that any coordinated discard corresponds to only the guest_memfd one,
>> not to anything else?
> 
> LGTM. In case of CC, I think we can also check the
> memory_region_set_ram_discard_manager() failure, undo what we did and
> make the ram_block_add() fail (set errno).

As we have memory_region_has_ram_discard_manager(), we could also check 
that instead of failing memory_region_set_ram_discard_manager().

But failing memory_region_set_ram_discard_manager() will force everybody 
to handle that, so it might be the better choice.

Of course, setting it to "NULL" should be guaranteed to never fail.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-20 18:09   ` Peter Xu
@ 2025-01-21  9:00     ` Chenyi Qiang
  2025-01-21  9:26       ` David Hildenbrand
  2025-01-21 15:38       ` Peter Xu
  0 siblings, 2 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-21  9:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Thanks Peter for your review!

On 1/21/2025 2:09 AM, Peter Xu wrote:
> Two trivial comments I spot:
> 
> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>> +struct GuestMemfdManager {
>> +    Object parent;
>> +
>> +    /* Managed memory region. */
>> +    MemoryRegion *mr;
>> +
>> +    /*
>> +     * 1-setting of the bit represents the memory is populated (shared).
>> +     */
>> +    int32_t bitmap_size;
>> +    unsigned long *bitmap;
> 
> Might be clearer to name the bitmap directly as what it represents.  E.g.,
> shared_bitmap?

Make sense.

> 
>> +
>> +    /* block size and alignment */
>> +    uint64_t block_size;
> 
> Can we always fetch it from the MR/ramblock? If this is needed, better add
> some comment explaining why.

The block_size is the granularity used to track the private/shared
attribute in the bitmap. It is currently hardcoded to 4K as guest_memfd
may manipulate the page conversion in at least 4K size and alignment.
I think It is somewhat a variable to cache the size and can avoid many
getpagesize() calls.

> 
>> +
>> +    /* listeners to notify on populate/discard activity. */
>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21  9:00     ` Chenyi Qiang
@ 2025-01-21  9:26       ` David Hildenbrand
  2025-01-21 10:16         ` Chenyi Qiang
  2025-01-21 15:38       ` Peter Xu
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-21  9:26 UTC (permalink / raw)
  To: Chenyi Qiang, Peter Xu
  Cc: Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 21.01.25 10:00, Chenyi Qiang wrote:
> Thanks Peter for your review!
> 
> On 1/21/2025 2:09 AM, Peter Xu wrote:
>> Two trivial comments I spot:
>>
>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>> +struct GuestMemfdManager {
>>> +    Object parent;
>>> +
>>> +    /* Managed memory region. */
>>> +    MemoryRegion *mr;
>>> +
>>> +    /*
>>> +     * 1-setting of the bit represents the memory is populated (shared).
>>> +     */
>>> +    int32_t bitmap_size;
>>> +    unsigned long *bitmap;
>>
>> Might be clearer to name the bitmap directly as what it represents.  E.g.,
>> shared_bitmap?
> 
> Make sense.
> 

BTW, I was wondering if this information should be stored/linked from 
the RAMBlock, where we already store the guest_memdfd "int guest_memfd;".

For example, having a "struct guest_memfd_state", and either embedding 
it in the RAMBlock or dynamically allocating and linking it.

Alternatively, it would be such an object that we would simply link from 
the RAMBlock. (depending on which object will implement the manager 
interface)

In any case, having all guest_memfd state that belongs to a RAMBlock at 
a single location might be cleanest.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21  9:26       ` David Hildenbrand
@ 2025-01-21 10:16         ` Chenyi Qiang
  2025-01-21 10:26           ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-21 10:16 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/21/2025 5:26 PM, David Hildenbrand wrote:
> On 21.01.25 10:00, Chenyi Qiang wrote:
>> Thanks Peter for your review!
>>
>> On 1/21/2025 2:09 AM, Peter Xu wrote:
>>> Two trivial comments I spot:
>>>
>>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>>> +struct GuestMemfdManager {
>>>> +    Object parent;
>>>> +
>>>> +    /* Managed memory region. */
>>>> +    MemoryRegion *mr;
>>>> +
>>>> +    /*
>>>> +     * 1-setting of the bit represents the memory is populated
>>>> (shared).
>>>> +     */
>>>> +    int32_t bitmap_size;
>>>> +    unsigned long *bitmap;
>>>
>>> Might be clearer to name the bitmap directly as what it represents. 
>>> E.g.,
>>> shared_bitmap?
>>
>> Make sense.
>>
> 
> BTW, I was wondering if this information should be stored/linked from
> the RAMBlock, where we already store the guest_memdfd "int guest_memfd;".
> 
> For example, having a "struct guest_memfd_state", and either embedding
> it in the RAMBlock or dynamically allocating and linking it.
> 
> Alternatively, it would be such an object that we would simply link from
> the RAMBlock. (depending on which object will implement the manager
> interface)
> 
> In any case, having all guest_memfd state that belongs to a RAMBlock at
> a single location might be cleanest.

Good suggestion. Follow the design of this series, we can add link to
the guest_memfd_manager object in RAMBlock.

> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21 10:16         ` Chenyi Qiang
@ 2025-01-21 10:26           ` David Hildenbrand
  2025-01-22  6:43             ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2025-01-21 10:26 UTC (permalink / raw)
  To: Chenyi Qiang, Peter Xu
  Cc: Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 21.01.25 11:16, Chenyi Qiang wrote:
> 
> 
> On 1/21/2025 5:26 PM, David Hildenbrand wrote:
>> On 21.01.25 10:00, Chenyi Qiang wrote:
>>> Thanks Peter for your review!
>>>
>>> On 1/21/2025 2:09 AM, Peter Xu wrote:
>>>> Two trivial comments I spot:
>>>>
>>>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>>>> +struct GuestMemfdManager {
>>>>> +    Object parent;
>>>>> +
>>>>> +    /* Managed memory region. */
>>>>> +    MemoryRegion *mr;
>>>>> +
>>>>> +    /*
>>>>> +     * 1-setting of the bit represents the memory is populated
>>>>> (shared).
>>>>> +     */
>>>>> +    int32_t bitmap_size;
>>>>> +    unsigned long *bitmap;
>>>>
>>>> Might be clearer to name the bitmap directly as what it represents.
>>>> E.g.,
>>>> shared_bitmap?
>>>
>>> Make sense.
>>>
>>
>> BTW, I was wondering if this information should be stored/linked from
>> the RAMBlock, where we already store the guest_memdfd "int guest_memfd;".
>>
>> For example, having a "struct guest_memfd_state", and either embedding
>> it in the RAMBlock or dynamically allocating and linking it.
>>
>> Alternatively, it would be such an object that we would simply link from
>> the RAMBlock. (depending on which object will implement the manager
>> interface)
>>
>> In any case, having all guest_memfd state that belongs to a RAMBlock at
>> a single location might be cleanest.
> 
> Good suggestion. Follow the design of this series, we can add link to
> the guest_memfd_manager object in RAMBlock.

Or we'll move / link that to the RAM memory region, because that's what 
the object actually controls.

It starts getting a bit blury what should be part of the RAMBlock and 
what should be part of the "owning" RAM memory region :(

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2024-06-24 16:31                           ` Xu Yilun
@ 2025-01-21 15:18                             ` Peter Xu
  2025-01-22  4:30                               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-21 15:18 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > is the private MMIO would also create a memory region with
> > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > listener.
> > > 
> > > My current working approach is to leave it as is in QEMU and VFIO.
> > 
> > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> 
> The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> normal assigned MMIO is always set ram=true,
> 
> void memory_region_init_ram_device_ptr(MemoryRegion *mr,
>                                        Object *owner,
>                                        const char *name,
>                                        uint64_t size,
>                                        void *ptr)
> {
>     memory_region_init(mr, owner, name, size);
>     mr->ram = true;
> 
> 
> So I don't think ram=true is a problem here.

I see.  If there's always a host pointer then it looks valid.  So it means
the device private MMIOs are always mappable since the start?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21  9:00     ` Chenyi Qiang
  2025-01-21  9:26       ` David Hildenbrand
@ 2025-01-21 15:38       ` Peter Xu
  2025-01-24  3:40         ` Chenyi Qiang
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-21 15:38 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

On Tue, Jan 21, 2025 at 05:00:45PM +0800, Chenyi Qiang wrote:
> >> +
> >> +    /* block size and alignment */
> >> +    uint64_t block_size;
> > 
> > Can we always fetch it from the MR/ramblock? If this is needed, better add
> > some comment explaining why.
> 
> The block_size is the granularity used to track the private/shared
> attribute in the bitmap. It is currently hardcoded to 4K as guest_memfd
> may manipulate the page conversion in at least 4K size and alignment.
> I think It is somewhat a variable to cache the size and can avoid many
> getpagesize() calls.

Though qemu does it frequently.. e.g. qemu_real_host_page_size() wraps
that.  So IIUC that's not a major concern, and if it's a concern maybe we
can cache it globally instead.

OTOH, this is not a per-ramblock limitation either, IIUC.  So maybe instead
of caching it per manager, we could have memory_attr_manager_get_psize()
helper (or any better name..):

memory_attr_manager_get_psize(MemoryAttrManager *mgr)
{
        /* Due to limitation of ... always notify with host psize */
        return qemu_real_host_page_size();
}

Then in the future if necessary, switch to:

memory_attr_manager_get_psize(MemoryAttrManager *mgr)
{
        return mgr->mr->ramblock->pagesize;
}

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21  1:35                   ` Chenyi Qiang
@ 2025-01-21 16:35                     ` Peter Xu
  2025-01-22  3:28                       ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-21 16:35 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Tue, Jan 21, 2025 at 09:35:26AM +0800, Chenyi Qiang wrote:
> 
> 
> On 1/21/2025 2:33 AM, Peter Xu wrote:
> > On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
> >> On 20.01.25 18:21, Peter Xu wrote:
> >>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
> >>>> Sorry, I was traveling end of last week. I wrote a mail on the train and
> >>>> apparently it was swallowed somehow ...
> >>>>
> >>>>>> Not sure that's the right place. Isn't it the (cc) machine that controls
> >>>>>> the state?
> >>>>>
> >>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
> >>>>
> >>>> Right; I consider KVM part of the machine.
> >>>>
> >>>>
> >>>>>
> >>>>>> It's not really the memory backend, that's just the memory provider.
> >>>>>
> >>>>> Sorry but is not "providing memory" the purpose of "memory backend"? :)
> >>>>
> >>>> Hehe, what I wanted to say is that a memory backend is just something to
> >>>> create a RAMBlock. There are different ways to create a RAMBlock, even
> >>>> guest_memfd ones.
> >>>>
> >>>> guest_memfd is stored per RAMBlock. I assume the state should be stored per
> >>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
> >>>>
> >>>> Now, the question is, who is the manager?
> >>>>
> >>>> 1) The machine. KVM requests the machine to perform the transition, and the
> >>>> machine takes care of updating the guest_memfd state and notifying any
> >>>> listeners.
> >>>>
> >>>> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
> >>>> RAMBlock would have to become an object, or we allocate separate objects.
> >>>>
> >>>> I'm leaning towards 1), but I might be missing something.
> >>>
> >>> A pure question: how do we process the bios gmemfds?  I assume they're
> >>> shared when VM starts if QEMU needs to load the bios into it, but are they
> >>> always shared, or can they be converted to private later?
> >>
> >> You're probably looking for memory_region_init_ram_guest_memfd().
> > 
> > Yes, but I didn't see whether such gmemfd needs conversions there.  I saw
> > an answer though from Chenyi in another email:
> > 
> > https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/
> > 
> > So I suppose the BIOS region must support private / share conversions too,
> > just like the rest part.
> 
> Yes, the BIOS region can support conversion as well. I think guest_memfd
> backed memory regions all follow the same sequence during setup time:
> 
> guest_memfd is shared when the guest_memfd fd is created by
> kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
> converted to private just after kvm_set_user_memory_region() in
> kvm_set_phys_mem(). So at the boot time of cc VM, the default attribute
> is private. During runtime, the vBIOS can also do the conversion if it
> wants.

I see.

> 
> > 
> > Though in that case, I'm not 100% sure whether that could also be done by
> > reusing the major guest memfd with some specific offset regions.
> 
> Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
> will have its own slot. So the vBIOS can use its own guest_memfd to get
> the specific offset regions.

Sorry to be confusing, please feel free to ignore my previous comment.
That came from a very limited mindset that maybe one confidential VM should
only have one gmemfd..

Now I see it looks like it's by design open to multiple gmemfds for each
VM, then it's definitely ok that bios has its own.

Do you know why the bios needs to be convertable?  I wonder whether the VM
can copy it over to a private region and do whatever it wants, e.g.  attest
the bios being valid.  However this is also more of a pure question.. and
it can be offtopic to this series, so feel free to ignore.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21 16:35                     ` Peter Xu
@ 2025-01-22  3:28                       ` Chenyi Qiang
  2025-01-22  5:38                         ` Xiaoyao Li
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-22  3:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Alexey Kardashevskiy, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun, Xiaoyao Li



On 1/22/2025 12:35 AM, Peter Xu wrote:
> On Tue, Jan 21, 2025 at 09:35:26AM +0800, Chenyi Qiang wrote:
>>
>>
>> On 1/21/2025 2:33 AM, Peter Xu wrote:
>>> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>>>> On 20.01.25 18:21, Peter Xu wrote:
>>>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>>>> Sorry, I was traveling end of last week. I wrote a mail on the train and
>>>>>> apparently it was swallowed somehow ...
>>>>>>
>>>>>>>> Not sure that's the right place. Isn't it the (cc) machine that controls
>>>>>>>> the state?
>>>>>>>
>>>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>>>
>>>>>> Right; I consider KVM part of the machine.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> It's not really the memory backend, that's just the memory provider.
>>>>>>>
>>>>>>> Sorry but is not "providing memory" the purpose of "memory backend"? :)
>>>>>>
>>>>>> Hehe, what I wanted to say is that a memory backend is just something to
>>>>>> create a RAMBlock. There are different ways to create a RAMBlock, even
>>>>>> guest_memfd ones.
>>>>>>
>>>>>> guest_memfd is stored per RAMBlock. I assume the state should be stored per
>>>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>>>
>>>>>> Now, the question is, who is the manager?
>>>>>>
>>>>>> 1) The machine. KVM requests the machine to perform the transition, and the
>>>>>> machine takes care of updating the guest_memfd state and notifying any
>>>>>> listeners.
>>>>>>
>>>>>> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
>>>>>> RAMBlock would have to become an object, or we allocate separate objects.
>>>>>>
>>>>>> I'm leaning towards 1), but I might be missing something.
>>>>>
>>>>> A pure question: how do we process the bios gmemfds?  I assume they're
>>>>> shared when VM starts if QEMU needs to load the bios into it, but are they
>>>>> always shared, or can they be converted to private later?
>>>>
>>>> You're probably looking for memory_region_init_ram_guest_memfd().
>>>
>>> Yes, but I didn't see whether such gmemfd needs conversions there.  I saw
>>> an answer though from Chenyi in another email:
>>>
>>> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/
>>>
>>> So I suppose the BIOS region must support private / share conversions too,
>>> just like the rest part.
>>
>> Yes, the BIOS region can support conversion as well. I think guest_memfd
>> backed memory regions all follow the same sequence during setup time:
>>
>> guest_memfd is shared when the guest_memfd fd is created by
>> kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
>> converted to private just after kvm_set_user_memory_region() in
>> kvm_set_phys_mem(). So at the boot time of cc VM, the default attribute
>> is private. During runtime, the vBIOS can also do the conversion if it
>> wants.
> 
> I see.
> 
>>
>>>
>>> Though in that case, I'm not 100% sure whether that could also be done by
>>> reusing the major guest memfd with some specific offset regions.
>>
>> Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
>> will have its own slot. So the vBIOS can use its own guest_memfd to get
>> the specific offset regions.
> 
> Sorry to be confusing, please feel free to ignore my previous comment.
> That came from a very limited mindset that maybe one confidential VM should
> only have one gmemfd..
> 
> Now I see it looks like it's by design open to multiple gmemfds for each
> VM, then it's definitely ok that bios has its own.
> 
> Do you know why the bios needs to be convertable?  I wonder whether the VM
> can copy it over to a private region and do whatever it wants, e.g.  attest
> the bios being valid.  However this is also more of a pure question.. and
> it can be offtopic to this series, so feel free to ignore.

AFAIK, the vBIOS won't do conversion after it is set as private at the
beginning. But in theory, the VM can do the conversion at runtime with
current implementation. As for why make the vBIOS convertable, I'm also
uncertain about it. Maybe convenient for managing the private/shared
status by guest_memfd as it's also converted once at the beginning.

> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21 15:18                             ` Peter Xu
@ 2025-01-22  4:30                               ` Alexey Kardashevskiy
  2025-01-22  9:41                                 ` Xu Yilun
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-22  4:30 UTC (permalink / raw)
  To: Peter Xu, Xu Yilun
  Cc: Chenyi Qiang, David Hildenbrand, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 22/1/25 02:18, Peter Xu wrote:
> On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
>> On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
>>> On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
>>>>> It is still uncertain how to implement the private MMIO. Our assumption
>>>>> is the private MMIO would also create a memory region with
>>>>> guest_memfd-like backend. Its mr->ram is true and should be managed by
>>>>> RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
>>>>> listener.
>>>>
>>>> My current working approach is to leave it as is in QEMU and VFIO.
>>>
>>> Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
>>
>> The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
>> normal assigned MMIO is always set ram=true,
>>
>> void memory_region_init_ram_device_ptr(MemoryRegion *mr,
>>                                         Object *owner,
>>                                         const char *name,
>>                                         uint64_t size,
>>                                         void *ptr)
>> {
>>      memory_region_init(mr, owner, name, size);
>>      mr->ram = true;
>>
>>
>> So I don't think ram=true is a problem here.
> 
> I see.  If there's always a host pointer then it looks valid.  So it means
> the device private MMIOs are always mappable since the start?

Yes. VFIO owns the mapping and does not treat shared/private MMIO any 
different at the moment. Thanks,

> 
> Thanks,
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-22  3:28                       ` Chenyi Qiang
@ 2025-01-22  5:38                         ` Xiaoyao Li
  2025-01-24  0:15                           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Xiaoyao Li @ 2025-01-22  5:38 UTC (permalink / raw)
  To: Chenyi Qiang, Peter Xu
  Cc: David Hildenbrand, Alexey Kardashevskiy, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 1/22/2025 11:28 AM, Chenyi Qiang wrote:
> 
> 
> On 1/22/2025 12:35 AM, Peter Xu wrote:
>> On Tue, Jan 21, 2025 at 09:35:26AM +0800, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/21/2025 2:33 AM, Peter Xu wrote:
>>>> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>>>>> On 20.01.25 18:21, Peter Xu wrote:
>>>>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>>>>> Sorry, I was traveling end of last week. I wrote a mail on the train and
>>>>>>> apparently it was swallowed somehow ...
>>>>>>>
>>>>>>>>> Not sure that's the right place. Isn't it the (cc) machine that controls
>>>>>>>>> the state?
>>>>>>>>
>>>>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>>>>
>>>>>>> Right; I consider KVM part of the machine.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> It's not really the memory backend, that's just the memory provider.
>>>>>>>>
>>>>>>>> Sorry but is not "providing memory" the purpose of "memory backend"? :)
>>>>>>>
>>>>>>> Hehe, what I wanted to say is that a memory backend is just something to
>>>>>>> create a RAMBlock. There are different ways to create a RAMBlock, even
>>>>>>> guest_memfd ones.
>>>>>>>
>>>>>>> guest_memfd is stored per RAMBlock. I assume the state should be stored per
>>>>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>>>>
>>>>>>> Now, the question is, who is the manager?
>>>>>>>
>>>>>>> 1) The machine. KVM requests the machine to perform the transition, and the
>>>>>>> machine takes care of updating the guest_memfd state and notifying any
>>>>>>> listeners.
>>>>>>>
>>>>>>> 2) The RAMBlock. Then we need some other Object to trigger that. Maybe
>>>>>>> RAMBlock would have to become an object, or we allocate separate objects.
>>>>>>>
>>>>>>> I'm leaning towards 1), but I might be missing something.
>>>>>>
>>>>>> A pure question: how do we process the bios gmemfds?  I assume they're
>>>>>> shared when VM starts if QEMU needs to load the bios into it, but are they
>>>>>> always shared, or can they be converted to private later?
>>>>>
>>>>> You're probably looking for memory_region_init_ram_guest_memfd().
>>>>
>>>> Yes, but I didn't see whether such gmemfd needs conversions there.  I saw
>>>> an answer though from Chenyi in another email:
>>>>
>>>> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/
>>>>
>>>> So I suppose the BIOS region must support private / share conversions too,
>>>> just like the rest part.
>>>
>>> Yes, the BIOS region can support conversion as well. I think guest_memfd
>>> backed memory regions all follow the same sequence during setup time:
>>>
>>> guest_memfd is shared when the guest_memfd fd is created by
>>> kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
>>> converted to private just after kvm_set_user_memory_region() in
>>> kvm_set_phys_mem(). So at the boot time of cc VM, the default attribute
>>> is private. During runtime, the vBIOS can also do the conversion if it
>>> wants.
>>
>> I see.
>>
>>>
>>>>
>>>> Though in that case, I'm not 100% sure whether that could also be done by
>>>> reusing the major guest memfd with some specific offset regions.
>>>
>>> Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
>>> will have its own slot. So the vBIOS can use its own guest_memfd to get
>>> the specific offset regions.
>>
>> Sorry to be confusing, please feel free to ignore my previous comment.
>> That came from a very limited mindset that maybe one confidential VM should
>> only have one gmemfd..
>>
>> Now I see it looks like it's by design open to multiple gmemfds for each
>> VM, then it's definitely ok that bios has its own.
>>
>> Do you know why the bios needs to be convertable?  I wonder whether the VM
>> can copy it over to a private region and do whatever it wants, e.g.  attest
>> the bios being valid.  However this is also more of a pure question.. and
>> it can be offtopic to this series, so feel free to ignore.
> 
> AFAIK, the vBIOS won't do conversion after it is set as private at the
> beginning. But in theory, the VM can do the conversion at runtime with
> current implementation. As for why make the vBIOS convertable, I'm also
> uncertain about it. Maybe convenient for managing the private/shared
> status by guest_memfd as it's also converted once at the beginning.

The reason is just that we are too lazy to implement a variant of guest 
memfd for vBIOS that is disallowed to be converted from private to shared.

>>
>> Thanks,
>>
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21 10:26           ` David Hildenbrand
@ 2025-01-22  6:43             ` Chenyi Qiang
  0 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-22  6:43 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/21/2025 6:26 PM, David Hildenbrand wrote:
> On 21.01.25 11:16, Chenyi Qiang wrote:
>>
>>
>> On 1/21/2025 5:26 PM, David Hildenbrand wrote:
>>> On 21.01.25 10:00, Chenyi Qiang wrote:
>>>> Thanks Peter for your review!
>>>>
>>>> On 1/21/2025 2:09 AM, Peter Xu wrote:
>>>>> Two trivial comments I spot:
>>>>>
>>>>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>>>>> +struct GuestMemfdManager {
>>>>>> +    Object parent;
>>>>>> +
>>>>>> +    /* Managed memory region. */
>>>>>> +    MemoryRegion *mr;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * 1-setting of the bit represents the memory is populated
>>>>>> (shared).
>>>>>> +     */
>>>>>> +    int32_t bitmap_size;
>>>>>> +    unsigned long *bitmap;
>>>>>
>>>>> Might be clearer to name the bitmap directly as what it represents.
>>>>> E.g.,
>>>>> shared_bitmap?
>>>>
>>>> Make sense.
>>>>
>>>
>>> BTW, I was wondering if this information should be stored/linked from
>>> the RAMBlock, where we already store the guest_memdfd "int
>>> guest_memfd;".
>>>
>>> For example, having a "struct guest_memfd_state", and either embedding
>>> it in the RAMBlock or dynamically allocating and linking it.
>>>
>>> Alternatively, it would be such an object that we would simply link from
>>> the RAMBlock. (depending on which object will implement the manager
>>> interface)
>>>
>>> In any case, having all guest_memfd state that belongs to a RAMBlock at
>>> a single location might be cleanest.
>>
>> Good suggestion. Follow the design of this series, we can add link to
>> the guest_memfd_manager object in RAMBlock.
> 
> Or we'll move / link that to the RAM memory region, because that's what
> the object actually controls.
> 
> It starts getting a bit blury what should be part of the RAMBlock and
> what should be part of the "owning" RAM memory region :(

Maybe still part of RAMBlock. I think guest_memfd state should go along
with "int guest_memfd" as it is only valid when guest_memfd > 0; And
guest_memfd is only valid for ram MemoryRegion.

> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-22  4:30                               ` Alexey Kardashevskiy
@ 2025-01-22  9:41                                 ` Xu Yilun
  2025-01-22 16:43                                   ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-22  9:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Xu, Chenyi Qiang, David Hildenbrand, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Wed, Jan 22, 2025 at 03:30:05PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 22/1/25 02:18, Peter Xu wrote:
> > On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> > > On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > > > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > > > is the private MMIO would also create a memory region with
> > > > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > > > listener.
> > > > > 
> > > > > My current working approach is to leave it as is in QEMU and VFIO.
> > > > 
> > > > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> > > 
> > > The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> > > normal assigned MMIO is always set ram=true,
> > > 
> > > void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> > >                                         Object *owner,
> > >                                         const char *name,
> > >                                         uint64_t size,
> > >                                         void *ptr)
> > > {
> > >      memory_region_init(mr, owner, name, size);
> > >      mr->ram = true;
> > > 
> > > 
> > > So I don't think ram=true is a problem here.
> > 
> > I see.  If there's always a host pointer then it looks valid.  So it means
> > the device private MMIOs are always mappable since the start?
> 
> Yes. VFIO owns the mapping and does not treat shared/private MMIO any
> different at the moment. Thanks,

mm.. I'm actually expecting private MMIO not have a host pointer, just
as private memory do.

But I'm not sure why having host pointer correlates mr->ram == true.

Thanks,
Yilun

> 
> > 
> > Thanks,
> > 
> 
> -- 
> Alexey
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-22  9:41                                 ` Xu Yilun
@ 2025-01-22 16:43                                   ` Peter Xu
  2025-01-23  9:33                                     ` Xu Yilun
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-22 16:43 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Wed, Jan 22, 2025 at 05:41:31PM +0800, Xu Yilun wrote:
> On Wed, Jan 22, 2025 at 03:30:05PM +1100, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 22/1/25 02:18, Peter Xu wrote:
> > > On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> > > > On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > > > > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > > > > is the private MMIO would also create a memory region with
> > > > > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > > > > listener.
> > > > > > 
> > > > > > My current working approach is to leave it as is in QEMU and VFIO.
> > > > > 
> > > > > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> > > > 
> > > > The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> > > > normal assigned MMIO is always set ram=true,
> > > > 
> > > > void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> > > >                                         Object *owner,
> > > >                                         const char *name,
> > > >                                         uint64_t size,
> > > >                                         void *ptr)

[1]

> > > > {
> > > >      memory_region_init(mr, owner, name, size);
> > > >      mr->ram = true;
> > > > 
> > > > 
> > > > So I don't think ram=true is a problem here.
> > > 
> > > I see.  If there's always a host pointer then it looks valid.  So it means
> > > the device private MMIOs are always mappable since the start?
> > 
> > Yes. VFIO owns the mapping and does not treat shared/private MMIO any
> > different at the moment. Thanks,
> 
> mm.. I'm actually expecting private MMIO not have a host pointer, just
> as private memory do.
> 
> But I'm not sure why having host pointer correlates mr->ram == true.

If there is no host pointer, what would you pass into "ptr" as referenced
at [1] above when creating the private MMIO memory region?

OTOH, IIUC guest private memory finally can also have a host pointer (aka,
mmap()-able), it's just that even if it exists, accessing it may crash QEMU
if it's private.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-22 16:43                                   ` Peter Xu
@ 2025-01-23  9:33                                     ` Xu Yilun
  2025-01-23 16:47                                       ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-23  9:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Wed, Jan 22, 2025 at 11:43:01AM -0500, Peter Xu wrote:
> On Wed, Jan 22, 2025 at 05:41:31PM +0800, Xu Yilun wrote:
> > On Wed, Jan 22, 2025 at 03:30:05PM +1100, Alexey Kardashevskiy wrote:
> > > 
> > > 
> > > On 22/1/25 02:18, Peter Xu wrote:
> > > > On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> > > > > On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > > > > > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > > > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > > > > > is the private MMIO would also create a memory region with
> > > > > > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > > > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > > > > > listener.
> > > > > > > 
> > > > > > > My current working approach is to leave it as is in QEMU and VFIO.
> > > > > > 
> > > > > > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> > > > > 
> > > > > The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> > > > > normal assigned MMIO is always set ram=true,
> > > > > 
> > > > > void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> > > > >                                         Object *owner,
> > > > >                                         const char *name,
> > > > >                                         uint64_t size,
> > > > >                                         void *ptr)
> 
> [1]
> 
> > > > > {
> > > > >      memory_region_init(mr, owner, name, size);
> > > > >      mr->ram = true;
> > > > > 
> > > > > 
> > > > > So I don't think ram=true is a problem here.
> > > > 
> > > > I see.  If there's always a host pointer then it looks valid.  So it means
> > > > the device private MMIOs are always mappable since the start?
> > > 
> > > Yes. VFIO owns the mapping and does not treat shared/private MMIO any
> > > different at the moment. Thanks,
> > 
> > mm.. I'm actually expecting private MMIO not have a host pointer, just
> > as private memory do.
> > 
> > But I'm not sure why having host pointer correlates mr->ram == true.
> 
> If there is no host pointer, what would you pass into "ptr" as referenced
> at [1] above when creating the private MMIO memory region?

Sorry for confusion. I mean existing MMIO region use set mr->ram = true,
and unmappable region (gmem) also set mr->ram = true. So don't know why
mr->ram = true for private MMIO is hackish.

I think We could add another helper to create memory region for private
MMIO.

> 
> OTOH, IIUC guest private memory finally can also have a host pointer (aka,
> mmap()-able), it's just that even if it exists, accessing it may crash QEMU
> if it's private.

Not sure if I get it correct: when memory will be converted to private, QEMU
should firstly unmap the host ptr, which means host ptr doesn't alway exist.

Thanks,
Yilun

> 
> Thanks,
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-23  9:33                                     ` Xu Yilun
@ 2025-01-23 16:47                                       ` Peter Xu
  2025-01-24  9:47                                         ` Xu Yilun
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Xu @ 2025-01-23 16:47 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Thu, Jan 23, 2025 at 05:33:53PM +0800, Xu Yilun wrote:
> On Wed, Jan 22, 2025 at 11:43:01AM -0500, Peter Xu wrote:
> > On Wed, Jan 22, 2025 at 05:41:31PM +0800, Xu Yilun wrote:
> > > On Wed, Jan 22, 2025 at 03:30:05PM +1100, Alexey Kardashevskiy wrote:
> > > > 
> > > > 
> > > > On 22/1/25 02:18, Peter Xu wrote:
> > > > > On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> > > > > > On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > > > > > > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > > > > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > > > > > > is the private MMIO would also create a memory region with
> > > > > > > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > > > > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > > > > > > listener.
> > > > > > > > 
> > > > > > > > My current working approach is to leave it as is in QEMU and VFIO.
> > > > > > > 
> > > > > > > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> > > > > > 
> > > > > > The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> > > > > > normal assigned MMIO is always set ram=true,
> > > > > > 
> > > > > > void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> > > > > >                                         Object *owner,
> > > > > >                                         const char *name,
> > > > > >                                         uint64_t size,
> > > > > >                                         void *ptr)
> > 
> > [1]
> > 
> > > > > > {
> > > > > >      memory_region_init(mr, owner, name, size);
> > > > > >      mr->ram = true;
> > > > > > 
> > > > > > 
> > > > > > So I don't think ram=true is a problem here.
> > > > > 
> > > > > I see.  If there's always a host pointer then it looks valid.  So it means
> > > > > the device private MMIOs are always mappable since the start?
> > > > 
> > > > Yes. VFIO owns the mapping and does not treat shared/private MMIO any
> > > > different at the moment. Thanks,
> > > 
> > > mm.. I'm actually expecting private MMIO not have a host pointer, just
> > > as private memory do.
> > > 
> > > But I'm not sure why having host pointer correlates mr->ram == true.
> > 
> > If there is no host pointer, what would you pass into "ptr" as referenced
> > at [1] above when creating the private MMIO memory region?
> 
> Sorry for confusion. I mean existing MMIO region use set mr->ram = true,
> and unmappable region (gmem) also set mr->ram = true. So don't know why
> mr->ram = true for private MMIO is hackish.

That's exactly what I had on the question in the previous email - please
have a look at what QEMU does right now with memory_access_is_direct().
I'm not 100% sure it'll work if the host pointer doesn't exist.

Let's take one user of it to be explicit: flatview_write_continue_step()
will try to access the ram pointer if it's direct:

    if (!memory_access_is_direct(mr, true)) {
        ...
    } else {
        /* RAM case */
        uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
                                               false, true);

        memmove(ram_ptr, buf, *l);
        invalidate_and_set_dirty(mr, mr_addr, *l);

        return MEMTX_OK;
    }

I don't see how QEMU could work yet if one MR set ram=true but without a
host pointer..

As discussed previously, IMHO it's okay that the pointer is not accessible,
but still I assume QEMU assumes the pointer at least existed for a ram=on
MR.  I don't know whether it's suitable to set ram=on if the pointer
doesn't ever exist.

> 
> I think We could add another helper to create memory region for private
> MMIO.
> 
> > 
> > OTOH, IIUC guest private memory finally can also have a host pointer (aka,
> > mmap()-able), it's just that even if it exists, accessing it may crash QEMU
> > if it's private.
> 
> Not sure if I get it correct: when memory will be converted to private, QEMU
> should firstly unmap the host ptr, which means host ptr doesn't alway exist.

At least current QEMU doesn't unmap it? 

kvm_convert_memory() does ram_block_discard_range() indeed, but that's hole
punches, not unmap.  So the host pointer can always be there.

Even if we could have in-place gmemfd conversions in the future for guest
mem, we should also need the host pointer to be around, in which case (per
my current understand) it will even avoid hole punching but instead make
the page accessible (by being able to be faulted in).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-22  5:38                         ` Xiaoyao Li
@ 2025-01-24  0:15                           ` Alexey Kardashevskiy
  2025-01-24  3:09                             ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-24  0:15 UTC (permalink / raw)
  To: Xiaoyao Li, Chenyi Qiang, Peter Xu
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun



On 22/1/25 16:38, Xiaoyao Li wrote:
> On 1/22/2025 11:28 AM, Chenyi Qiang wrote:
>>
>>
>> On 1/22/2025 12:35 AM, Peter Xu wrote:
>>> On Tue, Jan 21, 2025 at 09:35:26AM +0800, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/21/2025 2:33 AM, Peter Xu wrote:
>>>>> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>>>>>> On 20.01.25 18:21, Peter Xu wrote:
>>>>>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>>>>>> Sorry, I was traveling end of last week. I wrote a mail on the 
>>>>>>>> train and
>>>>>>>> apparently it was swallowed somehow ...
>>>>>>>>
>>>>>>>>>> Not sure that's the right place. Isn't it the (cc) machine 
>>>>>>>>>> that controls
>>>>>>>>>> the state?
>>>>>>>>>
>>>>>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>>>>>
>>>>>>>> Right; I consider KVM part of the machine.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It's not really the memory backend, that's just the memory 
>>>>>>>>>> provider.
>>>>>>>>>
>>>>>>>>> Sorry but is not "providing memory" the purpose of "memory 
>>>>>>>>> backend"? :)
>>>>>>>>
>>>>>>>> Hehe, what I wanted to say is that a memory backend is just 
>>>>>>>> something to
>>>>>>>> create a RAMBlock. There are different ways to create a 
>>>>>>>> RAMBlock, even
>>>>>>>> guest_memfd ones.
>>>>>>>>
>>>>>>>> guest_memfd is stored per RAMBlock. I assume the state should be 
>>>>>>>> stored per
>>>>>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>>>>>
>>>>>>>> Now, the question is, who is the manager?
>>>>>>>>
>>>>>>>> 1) The machine. KVM requests the machine to perform the 
>>>>>>>> transition, and the
>>>>>>>> machine takes care of updating the guest_memfd state and 
>>>>>>>> notifying any
>>>>>>>> listeners.
>>>>>>>>
>>>>>>>> 2) The RAMBlock. Then we need some other Object to trigger that. 
>>>>>>>> Maybe
>>>>>>>> RAMBlock would have to become an object, or we allocate separate 
>>>>>>>> objects.
>>>>>>>>
>>>>>>>> I'm leaning towards 1), but I might be missing something.
>>>>>>>
>>>>>>> A pure question: how do we process the bios gmemfds?  I assume 
>>>>>>> they're
>>>>>>> shared when VM starts if QEMU needs to load the bios into it, but 
>>>>>>> are they
>>>>>>> always shared, or can they be converted to private later?
>>>>>>
>>>>>> You're probably looking for memory_region_init_ram_guest_memfd().
>>>>>
>>>>> Yes, but I didn't see whether such gmemfd needs conversions there.  
>>>>> I saw
>>>>> an answer though from Chenyi in another email:
>>>>>
>>>>> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-bf87-147a47f5f074@intel.com/
>>>>>
>>>>> So I suppose the BIOS region must support private / share 
>>>>> conversions too,
>>>>> just like the rest part.
>>>>
>>>> Yes, the BIOS region can support conversion as well. I think 
>>>> guest_memfd
>>>> backed memory regions all follow the same sequence during setup time:
>>>>
>>>> guest_memfd is shared when the guest_memfd fd is created by
>>>> kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
>>>> converted to private just after kvm_set_user_memory_region() in
>>>> kvm_set_phys_mem(). So at the boot time of cc VM, the default attribute
>>>> is private. During runtime, the vBIOS can also do the conversion if it
>>>> wants.
>>>
>>> I see.
>>>
>>>>
>>>>>
>>>>> Though in that case, I'm not 100% sure whether that could also be 
>>>>> done by
>>>>> reusing the major guest memfd with some specific offset regions.
>>>>
>>>> Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
>>>> will have its own slot. So the vBIOS can use its own guest_memfd to get
>>>> the specific offset regions.
>>>
>>> Sorry to be confusing, please feel free to ignore my previous comment.
>>> That came from a very limited mindset that maybe one confidential VM 
>>> should
>>> only have one gmemfd..
>>>
>>> Now I see it looks like it's by design open to multiple gmemfds for each
>>> VM, then it's definitely ok that bios has its own.
>>>
>>> Do you know why the bios needs to be convertable?  I wonder whether 
>>> the VM
>>> can copy it over to a private region and do whatever it wants, e.g.  
>>> attest
>>> the bios being valid.  However this is also more of a pure question.. 
>>> and
>>> it can be offtopic to this series, so feel free to ignore.
>>
>> AFAIK, the vBIOS won't do conversion after it is set as private at the
>> beginning. But in theory, the VM can do the conversion at runtime with
>> current implementation. As for why make the vBIOS convertable, I'm also
>> uncertain about it. Maybe convenient for managing the private/shared
>> status by guest_memfd as it's also converted once at the beginning.
> 
> The reason is just that we are too lazy to implement a variant of guest 
> memfd for vBIOS that is disallowed to be converted from private to shared.

What is the point in disallowing such conversion in QEMU? On AMD, a 
malicious HV can try converting at any time and if the guest did not ask 
for it, it will continue accessing those pages as private and trigger an 
RMP fault. But if the guest asked for conversion, then it should be no 
problem to convert to shared. What do I miss about TDX here? Thanks,


> 
>>>
>>> Thanks,
>>>
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-24  0:15                           ` Alexey Kardashevskiy
@ 2025-01-24  3:09                             ` Chenyi Qiang
  2025-01-24  5:56                               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-24  3:09 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Xiaoyao Li, Peter Xu
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun



On 1/24/2025 8:15 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 22/1/25 16:38, Xiaoyao Li wrote:
>> On 1/22/2025 11:28 AM, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/22/2025 12:35 AM, Peter Xu wrote:
>>>> On Tue, Jan 21, 2025 at 09:35:26AM +0800, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/21/2025 2:33 AM, Peter Xu wrote:
>>>>>> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>>>>>>> On 20.01.25 18:21, Peter Xu wrote:
>>>>>>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>>>>>>> Sorry, I was traveling end of last week. I wrote a mail on the
>>>>>>>>> train and
>>>>>>>>> apparently it was swallowed somehow ...
>>>>>>>>>
>>>>>>>>>>> Not sure that's the right place. Isn't it the (cc) machine
>>>>>>>>>>> that controls
>>>>>>>>>>> the state?
>>>>>>>>>>
>>>>>>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>>>>>>
>>>>>>>>> Right; I consider KVM part of the machine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> It's not really the memory backend, that's just the memory
>>>>>>>>>>> provider.
>>>>>>>>>>
>>>>>>>>>> Sorry but is not "providing memory" the purpose of "memory
>>>>>>>>>> backend"? :)
>>>>>>>>>
>>>>>>>>> Hehe, what I wanted to say is that a memory backend is just
>>>>>>>>> something to
>>>>>>>>> create a RAMBlock. There are different ways to create a
>>>>>>>>> RAMBlock, even
>>>>>>>>> guest_memfd ones.
>>>>>>>>>
>>>>>>>>> guest_memfd is stored per RAMBlock. I assume the state should
>>>>>>>>> be stored per
>>>>>>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>>>>>>
>>>>>>>>> Now, the question is, who is the manager?
>>>>>>>>>
>>>>>>>>> 1) The machine. KVM requests the machine to perform the
>>>>>>>>> transition, and the
>>>>>>>>> machine takes care of updating the guest_memfd state and
>>>>>>>>> notifying any
>>>>>>>>> listeners.
>>>>>>>>>
>>>>>>>>> 2) The RAMBlock. Then we need some other Object to trigger
>>>>>>>>> that. Maybe
>>>>>>>>> RAMBlock would have to become an object, or we allocate
>>>>>>>>> separate objects.
>>>>>>>>>
>>>>>>>>> I'm leaning towards 1), but I might be missing something.
>>>>>>>>
>>>>>>>> A pure question: how do we process the bios gmemfds?  I assume
>>>>>>>> they're
>>>>>>>> shared when VM starts if QEMU needs to load the bios into it,
>>>>>>>> but are they
>>>>>>>> always shared, or can they be converted to private later?
>>>>>>>
>>>>>>> You're probably looking for memory_region_init_ram_guest_memfd().
>>>>>>
>>>>>> Yes, but I didn't see whether such gmemfd needs conversions
>>>>>> there.  I saw
>>>>>> an answer though from Chenyi in another email:
>>>>>>
>>>>>> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-
>>>>>> bf87-147a47f5f074@intel.com/
>>>>>>
>>>>>> So I suppose the BIOS region must support private / share
>>>>>> conversions too,
>>>>>> just like the rest part.
>>>>>
>>>>> Yes, the BIOS region can support conversion as well. I think
>>>>> guest_memfd
>>>>> backed memory regions all follow the same sequence during setup time:
>>>>>
>>>>> guest_memfd is shared when the guest_memfd fd is created by
>>>>> kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
>>>>> converted to private just after kvm_set_user_memory_region() in
>>>>> kvm_set_phys_mem(). So at the boot time of cc VM, the default
>>>>> attribute
>>>>> is private. During runtime, the vBIOS can also do the conversion if it
>>>>> wants.
>>>>
>>>> I see.
>>>>
>>>>>
>>>>>>
>>>>>> Though in that case, I'm not 100% sure whether that could also be
>>>>>> done by
>>>>>> reusing the major guest memfd with some specific offset regions.
>>>>>
>>>>> Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
>>>>> will have its own slot. So the vBIOS can use its own guest_memfd to
>>>>> get
>>>>> the specific offset regions.
>>>>
>>>> Sorry to be confusing, please feel free to ignore my previous comment.
>>>> That came from a very limited mindset that maybe one confidential VM
>>>> should
>>>> only have one gmemfd..
>>>>
>>>> Now I see it looks like it's by design open to multiple gmemfds for
>>>> each
>>>> VM, then it's definitely ok that bios has its own.
>>>>
>>>> Do you know why the bios needs to be convertable?  I wonder whether
>>>> the VM
>>>> can copy it over to a private region and do whatever it wants, e.g. 
>>>> attest
>>>> the bios being valid.  However this is also more of a pure
>>>> question.. and
>>>> it can be offtopic to this series, so feel free to ignore.
>>>
>>> AFAIK, the vBIOS won't do conversion after it is set as private at the
>>> beginning. But in theory, the VM can do the conversion at runtime with
>>> current implementation. As for why make the vBIOS convertable, I'm also
>>> uncertain about it. Maybe convenient for managing the private/shared
>>> status by guest_memfd as it's also converted once at the beginning.
>>
>> The reason is just that we are too lazy to implement a variant of
>> guest memfd for vBIOS that is disallowed to be converted from private
>> to shared.
> 
> What is the point in disallowing such conversion in QEMU? On AMD, a
> malicious HV can try converting at any time and if the guest did not ask
> for it, it will continue accessing those pages as private and trigger an
> RMP fault. But if the guest asked for conversion, then it should be no
> problem to convert to shared. What do I miss about TDX here? Thanks,

Re-read Peter's question, maybe I misunderstood it a little bit.

I thought Peter asked why the vBIOS need to do page conversion since it
would keep private and no need to convert to shared at runtime. So it is
not necessary to manage the vBIOS with guest_memfd-backed memory region
as it only converts to private once during setup stage. Xiaoyao
mentioned no need to implement a variant of guest_memfd to convert from
private to shared. As you said, allowing such conversion won't bring
security issues.

Now, I assume Peter's real question is, if we can copy the vBIOS to a
private region and no need to create a specific guest_memfd-backed
memory region for it?

> 
> 
>>
>>>>
>>>> Thanks,
>>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
       [not found]           ` <59bd0e82-f269-4567-8f75-a32c9c997ca9@redhat.com>
@ 2025-01-24  3:27             ` Alexey Kardashevskiy
  2025-01-24  5:36               ` Chenyi Qiang
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-24  3:27 UTC (permalink / raw)
  To: David Hildenbrand, Chenyi Qiang, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 21/1/25 00:06, David Hildenbrand wrote:
> On 10.01.25 06:13, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 5:32 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 16:34, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>> Introduce the realize()/unrealize() callbacks to initialize/
>>>>>> uninitialize
>>>>>> the new guest_memfd_manager object and register/unregister it in the
>>>>>> target MemoryRegion.
>>>>>>
>>>>>> Guest_memfd was initially set to shared until the commit bd3bcf6962
>>>>>> ("kvm/memory: Make memory type private by default if it has guest 
>>>>>> memfd
>>>>>> backend"). To align with this change, the default state in
>>>>>> guest_memfd_manager is set to private. (The bitmap is cleared to 0).
>>>>>> Additionally, setting the default to private can also reduce the
>>>>>> overhead of mapping shared pages into IOMMU by VFIO during the bootup
>>>>>> stage.
>>>>>>
>>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>>> ---
>>>>>>     include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++
>>>>>> ++++
>>>>>>     system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++
>>>>>> ++++-
>>>>>>     system/physmem.c                     |  7 +++++++
>>>>>>     3 files changed, 61 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>>>> guest-memfd-manager.h
>>>>>> index 9dc4e0346d..d1e7f698e8 100644
>>>>>> --- a/include/sysemu/guest-memfd-manager.h
>>>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>>>> @@ -42,6 +42,8 @@ struct GuestMemfdManager {
>>>>>>     struct GuestMemfdManagerClass {
>>>>>>         ObjectClass parent_class;
>>>>>>     +    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr,
>>>>>> uint64_t region_size);
>>>>>> +    void (*unrealize)(GuestMemfdManager *gmm);
>>>>>>         int (*state_change)(GuestMemfdManager *gmm, uint64_t offset,
>>>>>> uint64_t size,
>>>>>>                             bool shared_to_private);
>>>>>>     };
>>>>>> @@ -61,4 +63,29 @@ static inline int
>>>>>> guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
>>>>>>         return 0;
>>>>>>     }
>>>>>>     +static inline void guest_memfd_manager_realize(GuestMemfdManager
>>>>>> *gmm,
>>>>>> +                                              MemoryRegion *mr,
>>>>>> uint64_t region_size)
>>>>>> +{
>>>>>> +    GuestMemfdManagerClass *klass;
>>>>>> +
>>>>>> +    g_assert(gmm);
>>>>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>>>>> +
>>>>>> +    if (klass->realize) {
>>>>>> +        klass->realize(gmm, mr, region_size);
>>>>>
>>>>> Ditch realize() hook and call guest_memfd_manager_realizefn() 
>>>>> directly?
>>>>> Not clear why these new hooks are needed.
>>>>
>>>>>
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static inline void guest_memfd_manager_unrealize(GuestMemfdManager
>>>>>> *gmm)
>>>>>> +{
>>>>>> +    GuestMemfdManagerClass *klass;
>>>>>> +
>>>>>> +    g_assert(gmm);
>>>>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>>>>> +
>>>>>> +    if (klass->unrealize) {
>>>>>> +        klass->unrealize(gmm);
>>>>>> +    }
>>>>>> +}
>>>>>
>>>>> guest_memfd_manager_unrealizefn()?
>>>>
>>>> Agree. Adding these wrappers seem unnecessary.
>>>>
>>>>>
>>>>>
>>>>>> +
>>>>>>     #endif
>>>>>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-
>>>>>> manager.c
>>>>>> index 6601df5f3f..b6a32f0bfb 100644
>>>>>> --- a/system/guest-memfd-manager.c
>>>>>> +++ b/system/guest-memfd-manager.c
>>>>>> @@ -366,6 +366,31 @@ static int
>>>>>> guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
>>>>>>         return ret;
>>>>>>     }
>>>>>>     +static void guest_memfd_manager_realizefn(GuestMemfdManager 
>>>>>> *gmm,
>>>>>> MemoryRegion *mr,
>>>>>> +                                          uint64_t region_size)
>>>>>> +{
>>>>>> +    uint64_t bitmap_size;
>>>>>> +
>>>>>> +    gmm->block_size = qemu_real_host_page_size();
>>>>>> +    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm-
>>>>>>> block_size;
>>>>>
>>>>> imho unaligned region_size should be an assert.
>>>>
>>>> There's no guarantee the region_size of the MemoryRegion is PAGE_SIZE
>>>> aligned. So the ROUND_UP() is more appropriate.
>>>
>>> It is all about DMA so the smallest you can map is PAGE_SIZE so even if
>>> you round up here, it is likely going to fail to DMA-map later anyway
>>> (or not?).
>>
>> Checked the handling of VFIO, if the size is less than PAGE_SIZE, it
>> will just return and won't do DMA-map.
>>
>> Here is a different thing. It tries to calculate the bitmap_size. The
>> bitmap is used to track the private/shared status of the page. So if the
>> size is less than PAGE_SIZE, we still use the one bit to track this
>> small-size range.
>>
>>>
>>>
>>>>>> +
>>>>>> +    gmm->mr = mr;
>>>>>> +    gmm->bitmap_size = bitmap_size;
>>>>>> +    gmm->bitmap = bitmap_new(bitmap_size);
>>>>>> +
>>>>>> +    memory_region_set_ram_discard_manager(gmm->mr,
>>>>>> RAM_DISCARD_MANAGER(gmm));
>>>>>> +}
>>>>>
>>>>> This belongs to 2/7.
>>>>>
>>>>>> +
>>>>>> +static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
>>>>>> +{
>>>>>> +    memory_region_set_ram_discard_manager(gmm->mr, NULL);
>>>>>> +
>>>>>> +    g_free(gmm->bitmap);
>>>>>> +    gmm->bitmap = NULL;
>>>>>> +    gmm->bitmap_size = 0;
>>>>>> +    gmm->mr = NULL;
>>>>>
>>>>> @gmm is being destroyed here, why bother zeroing?
>>>>
>>>> OK, will remove it.
>>>>
>>>>>
>>>>>> +}
>>>>>> +
>>>>>
>>>>> This function belongs to 2/7.
>>>>
>>>> Will move both realizefn() and unrealizefn().
>>>
>>> Yes.
>>>
>>>
>>>>>
>>>>>>     static void guest_memfd_manager_init(Object *obj)
>>>>>>     {
>>>>>>         GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>> @@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
>>>>>>       static void guest_memfd_manager_finalize(Object *obj)
>>>>>>     {
>>>>>> -    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>>     }
>>>>>>       static void guest_memfd_manager_class_init(ObjectClass *oc, 
>>>>>> void
>>>>>> *data)
>>>>>> @@ -384,6 +408,8 @@ static void
>>>>>> guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>>>>>         RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>>           gmmc->state_change = guest_memfd_state_change;
>>>>>> +    gmmc->realize = guest_memfd_manager_realizefn;
>>>>>> +    gmmc->unrealize = guest_memfd_manager_unrealizefn;
>>>>>>           rdmc->get_min_granularity =
>>>>>> guest_memfd_rdm_get_min_granularity;
>>>>>>         rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>> index dc1db3a384..532182a6dd 100644
>>>>>> --- a/system/physmem.c
>>>>>> +++ b/system/physmem.c
>>>>>> @@ -53,6 +53,7 @@
>>>>>>     #include "sysemu/hostmem.h"
>>>>>>     #include "sysemu/hw_accel.h"
>>>>>>     #include "sysemu/xen-mapcache.h"
>>>>>> +#include "sysemu/guest-memfd-manager.h"
>>>>>>     #include "trace.h"
>>>>>>       #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>>>> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block,
>>>>>> Error **errp)
>>>>>>                 qemu_mutex_unlock_ramlist();
>>>>>>                 goto out_free;
>>>>>>             }
>>>>>> +
>>>>>> +        GuestMemfdManager *gmm =
>>>>>> GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
>>>>>> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block-
>>>>>>> mr->size);
>>>>>
>>>>> Wow. Quite invasive.
>>>>
>>>> Yeah... It creates a manager object no matter whether the user wants to
>>>> us    e shared passthru or not. We assume some fields like 
>>>> private/shared
>>>> bitmap may also be helpful in other scenario for future usage, and 
>>>> if no
>>>> passthru device, the listener would just return, so it is acceptable.
>>>
>>> Explain these other scenarios in the commit log please as otherwise
>>> making this an interface of HostMemoryBackendMemfd looks way cleaner.
>>> Thanks,
>>
>> Thanks for the suggestion. Until now, I think making this an interface
>> of HostMemoryBackend is cleaner. The potential future usage for
>> non-HostMemoryBackend guest_memfd-backed memory region I can think of is
>> the the TEE I/O for iommufd P2P support? when it tries to initialize RAM
>> device memory region with the attribute of shared/private. But I think
>> it would be a long term story and we are not sure what it will be like
>> in future.
> 
> As raised in #2, I'm don't think this belongs into HostMemoryBackend. It 
> kind-of belongs to the RAMBlock, but we could have another object 
> (similar to virtio-mem currently managing a single 
> HostMemoryBackend->RAMBlock) that takes care of that for multiple memory 
> backends.

The vBIOS thingy confused me and then I confused others :) There are 2 
things:
1) an interface or new subclass of HostMemoryBackendClass which we need 
to advertise and implement ability to discard pages;
2) RamDiscardManagerClass which is MR/Ramblock and does not really 
belong to HostMemoryBackend (as it is in what was posted ages ago).

I suggest Chenyi post a new version using the current approach with the 
comments and commitlogs fixed. Makes sense? Thanks,


-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-21 15:38       ` Peter Xu
@ 2025-01-24  3:40         ` Chenyi Qiang
  0 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-24  3:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun

Sorry I missed this mail.

On 1/21/2025 11:38 PM, Peter Xu wrote:
> On Tue, Jan 21, 2025 at 05:00:45PM +0800, Chenyi Qiang wrote:
>>>> +
>>>> +    /* block size and alignment */
>>>> +    uint64_t block_size;
>>>
>>> Can we always fetch it from the MR/ramblock? If this is needed, better add
>>> some comment explaining why.
>>
>> The block_size is the granularity used to track the private/shared
>> attribute in the bitmap. It is currently hardcoded to 4K as guest_memfd
>> may manipulate the page conversion in at least 4K size and alignment.
>> I think It is somewhat a variable to cache the size and can avoid many
>> getpagesize() calls.
> 
> Though qemu does it frequently.. e.g. qemu_real_host_page_size() wraps
> that.  So IIUC that's not a major concern, and if it's a concern maybe we
> can cache it globally instead.
> 
> OTOH, this is not a per-ramblock limitation either, IIUC.  So maybe instead
> of caching it per manager, we could have memory_attr_manager_get_psize()
> helper (or any better name..):
> 
> memory_attr_manager_get_psize(MemoryAttrManager *mgr)
> {
>         /* Due to limitation of ... always notify with host psize */
>         return qemu_real_host_page_size();
> }
> 
> Then in the future if necessary, switch to:
> 
> memory_attr_manager_get_psize(MemoryAttrManager *mgr)
> {
>         return mgr->mr->ramblock->pagesize;
> }

This looks good to me. I'll change in this way.

> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
  2025-01-24  3:27             ` Alexey Kardashevskiy
@ 2025-01-24  5:36               ` Chenyi Qiang
  0 siblings, 0 replies; 98+ messages in thread
From: Chenyi Qiang @ 2025-01-24  5:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun



On 1/24/2025 11:27 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 21/1/25 00:06, David Hildenbrand wrote:
>> On 10.01.25 06:13, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/9/2025 5:32 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 9/1/25 16:34, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> Introduce the realize()/unrealize() callbacks to initialize/
>>>>>>> uninitialize
>>>>>>> the new guest_memfd_manager object and register/unregister it in the
>>>>>>> target MemoryRegion.
>>>>>>>
>>>>>>> Guest_memfd was initially set to shared until the commit bd3bcf6962
>>>>>>> ("kvm/memory: Make memory type private by default if it has guest
>>>>>>> memfd
>>>>>>> backend"). To align with this change, the default state in
>>>>>>> guest_memfd_manager is set to private. (The bitmap is cleared to 0).
>>>>>>> Additionally, setting the default to private can also reduce the
>>>>>>> overhead of mapping shared pages into IOMMU by VFIO during the
>>>>>>> bootup
>>>>>>> stage.
>>>>>>>
>>>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>>>> ---
>>>>>>>     include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++
>>>>>>> ++++
>>>>>>> ++++
>>>>>>>     system/guest-memfd-manager.c         | 28 +++++++++++++++++++
>>>>>>> ++++
>>>>>>> ++++-
>>>>>>>     system/physmem.c                     |  7 +++++++
>>>>>>>     3 files changed, 61 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>>>>> guest-memfd-manager.h
>>>>>>> index 9dc4e0346d..d1e7f698e8 100644
>>>>>>> --- a/include/sysemu/guest-memfd-manager.h
>>>>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>>>>> @@ -42,6 +42,8 @@ struct GuestMemfdManager {
>>>>>>>     struct GuestMemfdManagerClass {
>>>>>>>         ObjectClass parent_class;
>>>>>>>     +    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr,
>>>>>>> uint64_t region_size);
>>>>>>> +    void (*unrealize)(GuestMemfdManager *gmm);
>>>>>>>         int (*state_change)(GuestMemfdManager *gmm, uint64_t offset,
>>>>>>> uint64_t size,
>>>>>>>                             bool shared_to_private);
>>>>>>>     };
>>>>>>> @@ -61,4 +63,29 @@ static inline int
>>>>>>> guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
>>>>>>>         return 0;
>>>>>>>     }
>>>>>>>     +static inline void
>>>>>>> guest_memfd_manager_realize(GuestMemfdManager
>>>>>>> *gmm,
>>>>>>> +                                              MemoryRegion *mr,
>>>>>>> uint64_t region_size)
>>>>>>> +{
>>>>>>> +    GuestMemfdManagerClass *klass;
>>>>>>> +
>>>>>>> +    g_assert(gmm);
>>>>>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>>>>>> +
>>>>>>> +    if (klass->realize) {
>>>>>>> +        klass->realize(gmm, mr, region_size);
>>>>>>
>>>>>> Ditch realize() hook and call guest_memfd_manager_realizefn()
>>>>>> directly?
>>>>>> Not clear why these new hooks are needed.
>>>>>
>>>>>>
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline void guest_memfd_manager_unrealize(GuestMemfdManager
>>>>>>> *gmm)
>>>>>>> +{
>>>>>>> +    GuestMemfdManagerClass *klass;
>>>>>>> +
>>>>>>> +    g_assert(gmm);
>>>>>>> +    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
>>>>>>> +
>>>>>>> +    if (klass->unrealize) {
>>>>>>> +        klass->unrealize(gmm);
>>>>>>> +    }
>>>>>>> +}
>>>>>>
>>>>>> guest_memfd_manager_unrealizefn()?
>>>>>
>>>>> Agree. Adding these wrappers seem unnecessary.
>>>>>
>>>>>>
>>>>>>
>>>>>>> +
>>>>>>>     #endif
>>>>>>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-
>>>>>>> manager.c
>>>>>>> index 6601df5f3f..b6a32f0bfb 100644
>>>>>>> --- a/system/guest-memfd-manager.c
>>>>>>> +++ b/system/guest-memfd-manager.c
>>>>>>> @@ -366,6 +366,31 @@ static int
>>>>>>> guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
>>>>>>>         return ret;
>>>>>>>     }
>>>>>>>     +static void guest_memfd_manager_realizefn(GuestMemfdManager
>>>>>>> *gmm,
>>>>>>> MemoryRegion *mr,
>>>>>>> +                                          uint64_t region_size)
>>>>>>> +{
>>>>>>> +    uint64_t bitmap_size;
>>>>>>> +
>>>>>>> +    gmm->block_size = qemu_real_host_page_size();
>>>>>>> +    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm-
>>>>>>>> block_size;
>>>>>>
>>>>>> imho unaligned region_size should be an assert.
>>>>>
>>>>> There's no guarantee the region_size of the MemoryRegion is PAGE_SIZE
>>>>> aligned. So the ROUND_UP() is more appropriate.
>>>>
>>>> It is all about DMA so the smallest you can map is PAGE_SIZE so even if
>>>> you round up here, it is likely going to fail to DMA-map later anyway
>>>> (or not?).
>>>
>>> Checked the handling of VFIO, if the size is less than PAGE_SIZE, it
>>> will just return and won't do DMA-map.
>>>
>>> Here is a different thing. It tries to calculate the bitmap_size. The
>>> bitmap is used to track the private/shared status of the page. So if the
>>> size is less than PAGE_SIZE, we still use the one bit to track this
>>> small-size range.
>>>
>>>>
>>>>
>>>>>>> +
>>>>>>> +    gmm->mr = mr;
>>>>>>> +    gmm->bitmap_size = bitmap_size;
>>>>>>> +    gmm->bitmap = bitmap_new(bitmap_size);
>>>>>>> +
>>>>>>> +    memory_region_set_ram_discard_manager(gmm->mr,
>>>>>>> RAM_DISCARD_MANAGER(gmm));
>>>>>>> +}
>>>>>>
>>>>>> This belongs to 2/7.
>>>>>>
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
>>>>>>> +{
>>>>>>> +    memory_region_set_ram_discard_manager(gmm->mr, NULL);
>>>>>>> +
>>>>>>> +    g_free(gmm->bitmap);
>>>>>>> +    gmm->bitmap = NULL;
>>>>>>> +    gmm->bitmap_size = 0;
>>>>>>> +    gmm->mr = NULL;
>>>>>>
>>>>>> @gmm is being destroyed here, why bother zeroing?
>>>>>
>>>>> OK, will remove it.
>>>>>
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>
>>>>>> This function belongs to 2/7.
>>>>>
>>>>> Will move both realizefn() and unrealizefn().
>>>>
>>>> Yes.
>>>>
>>>>
>>>>>>
>>>>>>>     static void guest_memfd_manager_init(Object *obj)
>>>>>>>     {
>>>>>>>         GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>>> @@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object
>>>>>>> *obj)
>>>>>>>       static void guest_memfd_manager_finalize(Object *obj)
>>>>>>>     {
>>>>>>> -    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>>>     }
>>>>>>>       static void guest_memfd_manager_class_init(ObjectClass *oc,
>>>>>>> void
>>>>>>> *data)
>>>>>>> @@ -384,6 +408,8 @@ static void
>>>>>>> guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>>>>>>         RamDiscardManagerClass *rdmc =
>>>>>>> RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>>>           gmmc->state_change = guest_memfd_state_change;
>>>>>>> +    gmmc->realize = guest_memfd_manager_realizefn;
>>>>>>> +    gmmc->unrealize = guest_memfd_manager_unrealizefn;
>>>>>>>           rdmc->get_min_granularity =
>>>>>>> guest_memfd_rdm_get_min_granularity;
>>>>>>>         rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>>> index dc1db3a384..532182a6dd 100644
>>>>>>> --- a/system/physmem.c
>>>>>>> +++ b/system/physmem.c
>>>>>>> @@ -53,6 +53,7 @@
>>>>>>>     #include "sysemu/hostmem.h"
>>>>>>>     #include "sysemu/hw_accel.h"
>>>>>>>     #include "sysemu/xen-mapcache.h"
>>>>>>> +#include "sysemu/guest-memfd-manager.h"
>>>>>>>     #include "trace.h"
>>>>>>>       #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>>>>> @@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block,
>>>>>>> Error **errp)
>>>>>>>                 qemu_mutex_unlock_ramlist();
>>>>>>>                 goto out_free;
>>>>>>>             }
>>>>>>> +
>>>>>>> +        GuestMemfdManager *gmm =
>>>>>>> GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
>>>>>>> +        guest_memfd_manager_realize(gmm, new_block->mr, new_block-
>>>>>>>> mr->size);
>>>>>>
>>>>>> Wow. Quite invasive.
>>>>>
>>>>> Yeah... It creates a manager object no matter whether the user
>>>>> wants to
>>>>> us    e shared passthru or not. We assume some fields like private/
>>>>> shared
>>>>> bitmap may also be helpful in other scenario for future usage, and
>>>>> if no
>>>>> passthru device, the listener would just return, so it is acceptable.
>>>>
>>>> Explain these other scenarios in the commit log please as otherwise
>>>> making this an interface of HostMemoryBackendMemfd looks way cleaner.
>>>> Thanks,
>>>
>>> Thanks for the suggestion. Until now, I think making this an interface
>>> of HostMemoryBackend is cleaner. The potential future usage for
>>> non-HostMemoryBackend guest_memfd-backed memory region I can think of is
>>> the the TEE I/O for iommufd P2P support? when it tries to initialize RAM
>>> device memory region with the attribute of shared/private. But I think
>>> it would be a long term story and we are not sure what it will be like
>>> in future.
>>
>> As raised in #2, I'm don't think this belongs into HostMemoryBackend.
>> It kind-of belongs to the RAMBlock, but we could have another object
>> (similar to virtio-mem currently managing a single HostMemoryBackend-
>> >RAMBlock) that takes care of that for multiple memory backends.
> 
> The vBIOS thingy confused me and then I confused others :) There are 2
> things:
> 1) an interface or new subclass of HostMemoryBackendClass which we need
> to advertise and implement ability to discard pages;
> 2) RamDiscardManagerClass which is MR/Ramblock and does not really
> belong to HostMemoryBackend (as it is in what was posted ages ago).
> 
> I suggest Chenyi post a new version using the current approach with the
> comments and commitlogs fixed. Makes sense? Thanks,

Sure, thanks Alexey! BTW, I'm going to have a vacation. Will continue to
work on it after I come back :)

> 
> 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-24  3:09                             ` Chenyi Qiang
@ 2025-01-24  5:56                               ` Alexey Kardashevskiy
  2025-01-24 16:12                                 ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-24  5:56 UTC (permalink / raw)
  To: Chenyi Qiang, Xiaoyao Li, Peter Xu
  Cc: David Hildenbrand, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun



On 24/1/25 14:09, Chenyi Qiang wrote:
> 
> 
> On 1/24/2025 8:15 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 22/1/25 16:38, Xiaoyao Li wrote:
>>> On 1/22/2025 11:28 AM, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/22/2025 12:35 AM, Peter Xu wrote:
>>>>> On Tue, Jan 21, 2025 at 09:35:26AM +0800, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 1/21/2025 2:33 AM, Peter Xu wrote:
>>>>>>> On Mon, Jan 20, 2025 at 06:54:14PM +0100, David Hildenbrand wrote:
>>>>>>>> On 20.01.25 18:21, Peter Xu wrote:
>>>>>>>>> On Mon, Jan 20, 2025 at 11:48:39AM +0100, David Hildenbrand wrote:
>>>>>>>>>> Sorry, I was traveling end of last week. I wrote a mail on the
>>>>>>>>>> train and
>>>>>>>>>> apparently it was swallowed somehow ...
>>>>>>>>>>
>>>>>>>>>>>> Not sure that's the right place. Isn't it the (cc) machine
>>>>>>>>>>>> that controls
>>>>>>>>>>>> the state?
>>>>>>>>>>>
>>>>>>>>>>> KVM does, via MemoryRegion->RAMBlock->guest_memfd.
>>>>>>>>>>
>>>>>>>>>> Right; I consider KVM part of the machine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> It's not really the memory backend, that's just the memory
>>>>>>>>>>>> provider.
>>>>>>>>>>>
>>>>>>>>>>> Sorry but is not "providing memory" the purpose of "memory
>>>>>>>>>>> backend"? :)
>>>>>>>>>>
>>>>>>>>>> Hehe, what I wanted to say is that a memory backend is just
>>>>>>>>>> something to
>>>>>>>>>> create a RAMBlock. There are different ways to create a
>>>>>>>>>> RAMBlock, even
>>>>>>>>>> guest_memfd ones.
>>>>>>>>>>
>>>>>>>>>> guest_memfd is stored per RAMBlock. I assume the state should
>>>>>>>>>> be stored per
>>>>>>>>>> RAMBlock as well, maybe as part of a "guest_memfd state" thing.
>>>>>>>>>>
>>>>>>>>>> Now, the question is, who is the manager?
>>>>>>>>>>
>>>>>>>>>> 1) The machine. KVM requests the machine to perform the
>>>>>>>>>> transition, and the
>>>>>>>>>> machine takes care of updating the guest_memfd state and
>>>>>>>>>> notifying any
>>>>>>>>>> listeners.
>>>>>>>>>>
>>>>>>>>>> 2) The RAMBlock. Then we need some other Object to trigger
>>>>>>>>>> that. Maybe
>>>>>>>>>> RAMBlock would have to become an object, or we allocate
>>>>>>>>>> separate objects.
>>>>>>>>>>
>>>>>>>>>> I'm leaning towards 1), but I might be missing something.
>>>>>>>>>
>>>>>>>>> A pure question: how do we process the bios gmemfds?  I assume
>>>>>>>>> they're
>>>>>>>>> shared when VM starts if QEMU needs to load the bios into it,
>>>>>>>>> but are they
>>>>>>>>> always shared, or can they be converted to private later?
>>>>>>>>
>>>>>>>> You're probably looking for memory_region_init_ram_guest_memfd().
>>>>>>>
>>>>>>> Yes, but I didn't see whether such gmemfd needs conversions
>>>>>>> there.  I saw
>>>>>>> an answer though from Chenyi in another email:
>>>>>>>
>>>>>>> https://lore.kernel.org/all/fc7194ee-ed21-4f6b-
>>>>>>> bf87-147a47f5f074@intel.com/
>>>>>>>
>>>>>>> So I suppose the BIOS region must support private / share
>>>>>>> conversions too,
>>>>>>> just like the rest part.
>>>>>>
>>>>>> Yes, the BIOS region can support conversion as well. I think
>>>>>> guest_memfd
>>>>>> backed memory regions all follow the same sequence during setup time:
>>>>>>
>>>>>> guest_memfd is shared when the guest_memfd fd is created by
>>>>>> kvm_create_guest_memfd() in ram_block_add(), But it will sooner be
>>>>>> converted to private just after kvm_set_user_memory_region() in
>>>>>> kvm_set_phys_mem(). So at the boot time of cc VM, the default
>>>>>> attribute
>>>>>> is private. During runtime, the vBIOS can also do the conversion if it
>>>>>> wants.
>>>>>
>>>>> I see.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Though in that case, I'm not 100% sure whether that could also be
>>>>>>> done by
>>>>>>> reusing the major guest memfd with some specific offset regions.
>>>>>>
>>>>>> Not sure if I understand you clearly. guest_memfd is per-Ramblock. It
>>>>>> will have its own slot. So the vBIOS can use its own guest_memfd to
>>>>>> get
>>>>>> the specific offset regions.
>>>>>
>>>>> Sorry to be confusing, please feel free to ignore my previous comment.
>>>>> That came from a very limited mindset that maybe one confidential VM
>>>>> should
>>>>> only have one gmemfd..
>>>>>
>>>>> Now I see it looks like it's by design open to multiple gmemfds for
>>>>> each
>>>>> VM, then it's definitely ok that bios has its own.
>>>>>
>>>>> Do you know why the bios needs to be convertable?  I wonder whether
>>>>> the VM
>>>>> can copy it over to a private region and do whatever it wants, e.g.
>>>>> attest
>>>>> the bios being valid.  However this is also more of a pure
>>>>> question.. and
>>>>> it can be offtopic to this series, so feel free to ignore.
>>>>
>>>> AFAIK, the vBIOS won't do conversion after it is set as private at the
>>>> beginning. But in theory, the VM can do the conversion at runtime with
>>>> current implementation. As for why make the vBIOS convertable, I'm also
>>>> uncertain about it. Maybe convenient for managing the private/shared
>>>> status by guest_memfd as it's also converted once at the beginning.
>>>
>>> The reason is just that we are too lazy to implement a variant of
>>> guest memfd for vBIOS that is disallowed to be converted from private
>>> to shared.
>>
>> What is the point in disallowing such conversion in QEMU? On AMD, a
>> malicious HV can try converting at any time and if the guest did not ask
>> for it, it will continue accessing those pages as private and trigger an
>> RMP fault. But if the guest asked for conversion, then it should be no
>> problem to convert to shared. What do I miss about TDX here? Thanks,
> 
> Re-read Peter's question, maybe I misunderstood it a little bit.
> 
> I thought Peter asked why the vBIOS need to do page conversion since it
> would keep private and no need to convert to shared at runtime. So it is

I suspect there is no need to convert vBIOS but also there is no need to 
assume that some memory is never convertable.

> not necessary to manage the vBIOS with guest_memfd-backed memory region
> as it only converts to private once during setup stage. Xiaoyao
> mentioned no need to implement a variant of guest_memfd to convert from
> private to shared. As you said, allowing such conversion won't bring
> security issues.
> 
> Now, I assume Peter's real question is, if we can copy the vBIOS to a
> private region and no need to create a specific guest_memfd-backed
> memory region for it?

I guess we can copy it but we have pc.bios and pc.rom in own memory 
regions for some reason even for legacy non-secure VMs, for ages, so it 
has little or nothing to do with whether vBIOS is in private or shared 
memory. Thanks,


> 
>>
>>
>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>
>>>
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-23 16:47                                       ` Peter Xu
@ 2025-01-24  9:47                                         ` Xu Yilun
  2025-01-24 15:55                                           ` Peter Xu
  0 siblings, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-24  9:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Thu, Jan 23, 2025 at 11:47:17AM -0500, Peter Xu wrote:
> On Thu, Jan 23, 2025 at 05:33:53PM +0800, Xu Yilun wrote:
> > On Wed, Jan 22, 2025 at 11:43:01AM -0500, Peter Xu wrote:
> > > On Wed, Jan 22, 2025 at 05:41:31PM +0800, Xu Yilun wrote:
> > > > On Wed, Jan 22, 2025 at 03:30:05PM +1100, Alexey Kardashevskiy wrote:
> > > > > 
> > > > > 
> > > > > On 22/1/25 02:18, Peter Xu wrote:
> > > > > > On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> > > > > > > On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > > > > > > > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > > > > > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > > > > > > > is the private MMIO would also create a memory region with
> > > > > > > > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > > > > > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > > > > > > > listener.
> > > > > > > > > 
> > > > > > > > > My current working approach is to leave it as is in QEMU and VFIO.
> > > > > > > > 
> > > > > > > > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> > > > > > > 
> > > > > > > The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> > > > > > > normal assigned MMIO is always set ram=true,
> > > > > > > 
> > > > > > > void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> > > > > > >                                         Object *owner,
> > > > > > >                                         const char *name,
> > > > > > >                                         uint64_t size,
> > > > > > >                                         void *ptr)
> > > 
> > > [1]
> > > 
> > > > > > > {
> > > > > > >      memory_region_init(mr, owner, name, size);
> > > > > > >      mr->ram = true;
> > > > > > > 
> > > > > > > 
> > > > > > > So I don't think ram=true is a problem here.
> > > > > > 
> > > > > > I see.  If there's always a host pointer then it looks valid.  So it means
> > > > > > the device private MMIOs are always mappable since the start?
> > > > > 
> > > > > Yes. VFIO owns the mapping and does not treat shared/private MMIO any
> > > > > different at the moment. Thanks,
> > > > 
> > > > mm.. I'm actually expecting private MMIO not have a host pointer, just
> > > > as private memory do.
> > > > 
> > > > But I'm not sure why having host pointer correlates mr->ram == true.
> > > 
> > > If there is no host pointer, what would you pass into "ptr" as referenced
> > > at [1] above when creating the private MMIO memory region?
> > 
> > Sorry for confusion. I mean existing MMIO region use set mr->ram = true,
> > and unmappable region (gmem) also set mr->ram = true. So don't know why
> > mr->ram = true for private MMIO is hackish.
> 
> That's exactly what I had on the question in the previous email - please
> have a look at what QEMU does right now with memory_access_is_direct().

I see memory_access_is_direct() should exclude mr->ram_device == true, which
is the case for normal assigned MMIO and for private assigned MMIO. So
this func is not a problem.

But I think flatview_access_allowed() is a problem that it doesn't filter
out the private memory. When memory is converted to private, the result
of host access can't be what you want and should be errored out. IOW,
the host ptr is sometimes invalid.

> I'm not 100% sure it'll work if the host pointer doesn't exist.
> 
> Let's take one user of it to be explicit: flatview_write_continue_step()
> will try to access the ram pointer if it's direct:
> 
>     if (!memory_access_is_direct(mr, true)) {
>         ...
>     } else {
>         /* RAM case */
>         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
>                                                false, true);
> 
>         memmove(ram_ptr, buf, *l);
>         invalidate_and_set_dirty(mr, mr_addr, *l);
> 
>         return MEMTX_OK;
>     }
> 
> I don't see how QEMU could work yet if one MR set ram=true but without a
> host pointer..
> 
> As discussed previously, IMHO it's okay that the pointer is not accessible,

Maybe I missed something in previous discussion, I assume it is OK cause
no address_space_rw is happening on this host ptr when memory is
private, is it?

> but still I assume QEMU assumes the pointer at least existed for a ram=on
> MR.  I don't know whether it's suitable to set ram=on if the pointer
> doesn't ever exist.

In theory, any code logic should not depends on an invalid pointer. I
think a NULL pointer would be much better than a invalid pointer, at
least you can check whether to access. So if you think an invalid
pointer is OK, a NULL pointer should be also OK.

Thanks,
Yilun

> 
> > 
> > I think We could add another helper to create memory region for private
> > MMIO.
> > 
> > > 
> > > OTOH, IIUC guest private memory finally can also have a host pointer (aka,
> > > mmap()-able), it's just that even if it exists, accessing it may crash QEMU
> > > if it's private.
> > 
> > Not sure if I get it correct: when memory will be converted to private, QEMU
> > should firstly unmap the host ptr, which means host ptr doesn't alway exist.
> 
> At least current QEMU doesn't unmap it? 
> 
> kvm_convert_memory() does ram_block_discard_range() indeed, but that's hole
> punches, not unmap.  So the host pointer can always be there.
> 
> Even if we could have in-place gmemfd conversions in the future for guest
> mem, we should also need the host pointer to be around, in which case (per
> my current understand) it will even avoid hole punching but instead make
> the page accessible (by being able to be faulted in).
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-24  9:47                                         ` Xu Yilun
@ 2025-01-24 15:55                                           ` Peter Xu
  2025-01-24 18:17                                             ` David Hildenbrand
  2025-01-26  3:34                                             ` Xu Yilun
  0 siblings, 2 replies; 98+ messages in thread
From: Peter Xu @ 2025-01-24 15:55 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Fri, Jan 24, 2025 at 05:47:45PM +0800, Xu Yilun wrote:
> On Thu, Jan 23, 2025 at 11:47:17AM -0500, Peter Xu wrote:
> > On Thu, Jan 23, 2025 at 05:33:53PM +0800, Xu Yilun wrote:
> > > On Wed, Jan 22, 2025 at 11:43:01AM -0500, Peter Xu wrote:
> > > > On Wed, Jan 22, 2025 at 05:41:31PM +0800, Xu Yilun wrote:
> > > > > On Wed, Jan 22, 2025 at 03:30:05PM +1100, Alexey Kardashevskiy wrote:
> > > > > > 
> > > > > > 
> > > > > > On 22/1/25 02:18, Peter Xu wrote:
> > > > > > > On Tue, Jun 25, 2024 at 12:31:13AM +0800, Xu Yilun wrote:
> > > > > > > > On Mon, Jan 20, 2025 at 03:46:15PM -0500, Peter Xu wrote:
> > > > > > > > > On Mon, Jan 20, 2025 at 09:22:50PM +1100, Alexey Kardashevskiy wrote:
> > > > > > > > > > > It is still uncertain how to implement the private MMIO. Our assumption
> > > > > > > > > > > is the private MMIO would also create a memory region with
> > > > > > > > > > > guest_memfd-like backend. Its mr->ram is true and should be managed by
> > > > > > > > > > > RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
> > > > > > > > > > > listener.
> > > > > > > > > > 
> > > > > > > > > > My current working approach is to leave it as is in QEMU and VFIO.
> > > > > > > > > 
> > > > > > > > > Agreed.  Setting ram=true to even private MMIO sounds hackish, at least
> > > > > > > > 
> > > > > > > > The private MMIO refers to assigned MMIO, not emulated MMIO. IIUC,
> > > > > > > > normal assigned MMIO is always set ram=true,
> > > > > > > > 
> > > > > > > > void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> > > > > > > >                                         Object *owner,
> > > > > > > >                                         const char *name,
> > > > > > > >                                         uint64_t size,
> > > > > > > >                                         void *ptr)
> > > > 
> > > > [1]
> > > > 
> > > > > > > > {
> > > > > > > >      memory_region_init(mr, owner, name, size);
> > > > > > > >      mr->ram = true;
> > > > > > > > 
> > > > > > > > 
> > > > > > > > So I don't think ram=true is a problem here.
> > > > > > > 
> > > > > > > I see.  If there's always a host pointer then it looks valid.  So it means
> > > > > > > the device private MMIOs are always mappable since the start?
> > > > > > 
> > > > > > Yes. VFIO owns the mapping and does not treat shared/private MMIO any
> > > > > > different at the moment. Thanks,
> > > > > 
> > > > > mm.. I'm actually expecting private MMIO not have a host pointer, just
> > > > > as private memory do.
> > > > > 
> > > > > But I'm not sure why having host pointer correlates mr->ram == true.
> > > > 
> > > > If there is no host pointer, what would you pass into "ptr" as referenced
> > > > at [1] above when creating the private MMIO memory region?
> > > 
> > > Sorry for confusion. I mean existing MMIO region use set mr->ram = true,
> > > and unmappable region (gmem) also set mr->ram = true. So don't know why
> > > mr->ram = true for private MMIO is hackish.
> > 
> > That's exactly what I had on the question in the previous email - please
> > have a look at what QEMU does right now with memory_access_is_direct().
> 
> I see memory_access_is_direct() should exclude mr->ram_device == true, which
> is the case for normal assigned MMIO and for private assigned MMIO. So
> this func is not a problem.

I'm not sure even if so.

VFIO's current use case is pretty special - it still has a host pointer,
it's just that things like memcpy() might not be always suitable to be
applied on MMIO mapped regions.  Alex explained the rational in commit
4a2e242bbb3.  I mean, the host pointer is valid even if ram_device=true in
this case.  Even if no direct access allowed (memcpy, etc.) it still
operates on the host address using ram_device_mem_ops.

> 
> But I think flatview_access_allowed() is a problem that it doesn't filter
> out the private memory. When memory is converted to private, the result
> of host access can't be what you want and should be errored out. IOW,
> the host ptr is sometimes invalid.
> 
> > I'm not 100% sure it'll work if the host pointer doesn't exist.
> > 
> > Let's take one user of it to be explicit: flatview_write_continue_step()
> > will try to access the ram pointer if it's direct:
> > 
> >     if (!memory_access_is_direct(mr, true)) {
> >         ...
> >     } else {
> >         /* RAM case */
> >         uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
> >                                                false, true);
> > 
> >         memmove(ram_ptr, buf, *l);
> >         invalidate_and_set_dirty(mr, mr_addr, *l);
> > 
> >         return MEMTX_OK;
> >     }
> > 
> > I don't see how QEMU could work yet if one MR set ram=true but without a
> > host pointer..
> > 
> > As discussed previously, IMHO it's okay that the pointer is not accessible,
> 
> Maybe I missed something in previous discussion, I assume it is OK cause
> no address_space_rw is happening on this host ptr when memory is
> private, is it?

Yes, and when there is a mapped host address and someone tries to access an
address that is bound to a private page, QEMU should get a SIGBUS.  This
code is not ready yet for gmem, but I believe it'll work like that when
in-place gmem folio conversions will be ready.  So far QEMU's gmem works by
providing two layers of memory backends, which is IMHO pretty tricky.

> 
> > but still I assume QEMU assumes the pointer at least existed for a ram=on
> > MR.  I don't know whether it's suitable to set ram=on if the pointer
> > doesn't ever exist.
> 
> In theory, any code logic should not depends on an invalid pointer. I
> think a NULL pointer would be much better than a invalid pointer, at
> least you can check whether to access. So if you think an invalid
> pointer is OK, a NULL pointer should be also OK.

Definitely not suggesting to install an invalid pointer anywhere.  The
mapped pointer will still be valid for gmem for example, but the fault
isn't.  We need to differenciate two things (1) virtual address mapping,
then (2) permission and accesses on the folios / pages of the mapping.
Here I think it's okay if the host pointer is correctly mapped.

For your private MMIO use case, my question is if there's no host pointer
to be mapped anyway, then what's the benefit to make the MR to be ram=on?
Can we simply make it a normal IO memory region?  The only benefit of a
ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
host pointer at all, I don't yet understand how that helps private MMIO
from working.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-24  5:56                               ` Alexey Kardashevskiy
@ 2025-01-24 16:12                                 ` Peter Xu
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Xu @ 2025-01-24 16:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Chenyi Qiang, Xiaoyao Li, David Hildenbrand, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Fri, Jan 24, 2025 at 04:56:50PM +1100, Alexey Kardashevskiy wrote:
> > Now, I assume Peter's real question is, if we can copy the vBIOS to a
> > private region and no need to create a specific guest_memfd-backed
> > memory region for it?

Yes.

> 
> I guess we can copy it but we have pc.bios and pc.rom in own memory regions
> for some reason even for legacy non-secure VMs, for ages, so it has little
> or nothing to do with whether vBIOS is in private or shared memory. Thanks,

My previous question is whether they are required to be converted to be
guest-memfd backed memory regions, irrelevant of whether they're separate
or not.

I think I found some answers in the commit logs here (it isn't hiding too
deep; I could have tried when asking):

===8<===
commit fc7a69e177e4ba26d11fcf47b853f85115b35a11
Author: Michael Roth <michael.roth@amd.com>
Date:   Thu May 30 06:16:40 2024 -0500

    hw/i386: Add support for loading BIOS using guest_memfd
    
    When guest_memfd is enabled, the BIOS is generally part of the initial
    encrypted guest image and will be accessed as private guest memory. Add
    the necessary changes to set up the associated RAM region with a
    guest_memfd backend to allow for this.
    
    Current support centers around using -bios to load the BIOS data.
    Support for loading the BIOS via pflash requires additional enablement
    since those interfaces rely on the use of ROM memory regions which make
    use of the KVM_MEM_READONLY memslot flag, which is not supported for
    guest_memfd-backed memslots.

commit 413a67450750e0459efeffc3db3ba9759c3e381c
Author: Michael Roth <michael.roth@amd.com>
Date:   Thu May 30 06:16:39 2024 -0500

    hw/i386/sev: Use guest_memfd for legacy ROMs
    
    Current SNP guest kernels will attempt to access these regions with
    with C-bit set, so guest_memfd is needed to handle that. Otherwise,
    kvm_convert_memory() will fail when the guest kernel tries to access it
    and QEMU attempts to call KVM_SET_MEMORY_ATTRIBUTES to set these ranges
    to private.
    
    Whether guests should actually try to access ROM regions in this way (or
    need to deal with legacy ROM regions at all), is a separate issue to be
    addressed on kernel side, but current SNP guest kernels will exhibit
    this behavior and so this handling is needed to allow QEMU to continue
    running existing SNP guest kernels.
===8<===

So IIUC the CoCo VMs will assume they're somehow convertable memories or
they'll stop working I assume, at least on some existing hardwares.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-24 15:55                                           ` Peter Xu
@ 2025-01-24 18:17                                             ` David Hildenbrand
  2025-01-26  3:34                                             ` Xu Yilun
  1 sibling, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2025-01-24 18:17 UTC (permalink / raw)
  To: Peter Xu, Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

> Definitely not suggesting to install an invalid pointer anywhere.  The
> mapped pointer will still be valid for gmem for example, but the fault
> isn't.  We need to differenciate two things (1) virtual address mapping,
> then (2) permission and accesses on the folios / pages of the mapping.
> Here I think it's okay if the host pointer is correctly mapped.
> 
> For your private MMIO use case, my question is if there's no host pointer
> to be mapped anyway, then what's the benefit to make the MR to be ram=on?
> Can we simply make it a normal IO memory region?  The only benefit of a
> ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
> host pointer at all, I don't yet understand how that helps private MMIO
> from working.

Same here.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-24 15:55                                           ` Peter Xu
  2025-01-24 18:17                                             ` David Hildenbrand
@ 2025-01-26  3:34                                             ` Xu Yilun
  2025-01-30 16:28                                               ` Peter Xu
  1 sibling, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-01-26  3:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

> Definitely not suggesting to install an invalid pointer anywhere.  The
> mapped pointer will still be valid for gmem for example, but the fault
> isn't.  We need to differenciate two things (1) virtual address mapping,
> then (2) permission and accesses on the folios / pages of the mapping.
> Here I think it's okay if the host pointer is correctly mapped.
> 
> For your private MMIO use case, my question is if there's no host pointer
> to be mapped anyway, then what's the benefit to make the MR to be ram=on?
> Can we simply make it a normal IO memory region?  The only benefit of a

The guest access to normal IO memory region would be emulated by QEMU,
while private assigned MMIO requires guest direct access via Secure EPT.

Seems the existing code doesn't support guest direct access if
mr->ram == false:

static void kvm_set_phys_mem(KVMMemoryListener *kml,
                             MemoryRegionSection *section, bool add)
{
    [...]

    if (!memory_region_is_ram(mr)) {
        if (writable || !kvm_readonly_mem_allowed) {
            return;
        } else if (!mr->romd_mode) {
            /* If the memory device is not in romd_mode, then we actually want
             * to remove the kvm memory slot so all accesses will trap. */
            add = false;
        }
    }

    [...]

    /* register the new slot */
    do {

        [...]

        err = kvm_set_user_memory_region(kml, mem, true);
    }
}

> ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
> host pointer at all, I don't yet understand how that helps private MMIO
> from working.

I expect private MMIO not accessible from host, but accessible from
guest so has kvm_userspace_memory_region2 set. That means the resolving
of its PFN during EPT fault cannot depends on host pointer.

https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/

Thanks,
Yilun

> 
> Thanks,
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-26  3:34                                             ` Xu Yilun
@ 2025-01-30 16:28                                               ` Peter Xu
  2025-01-30 16:51                                                 ` David Hildenbrand
  2025-02-06 10:41                                                 ` Xu Yilun
  0 siblings, 2 replies; 98+ messages in thread
From: Peter Xu @ 2025-01-30 16:28 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Sun, Jan 26, 2025 at 11:34:29AM +0800, Xu Yilun wrote:
> > Definitely not suggesting to install an invalid pointer anywhere.  The
> > mapped pointer will still be valid for gmem for example, but the fault
> > isn't.  We need to differenciate two things (1) virtual address mapping,
> > then (2) permission and accesses on the folios / pages of the mapping.
> > Here I think it's okay if the host pointer is correctly mapped.
> > 
> > For your private MMIO use case, my question is if there's no host pointer
> > to be mapped anyway, then what's the benefit to make the MR to be ram=on?
> > Can we simply make it a normal IO memory region?  The only benefit of a
> 
> The guest access to normal IO memory region would be emulated by QEMU,
> while private assigned MMIO requires guest direct access via Secure EPT.
> 
> Seems the existing code doesn't support guest direct access if
> mr->ram == false:

Ah it's about this, ok.

I am not sure what's the best approach, but IMHO it's still better we stick
with host pointer always available when ram=on.  OTOH, VFIO private regions
may be able to provide a special mark somewhere, just like when romd_mode
was done previously as below (qemu 235e8982ad39), so that KVM should still
apply these MRs even if they're not RAM.

> 
> static void kvm_set_phys_mem(KVMMemoryListener *kml,
>                              MemoryRegionSection *section, bool add)
> {
>     [...]
> 
>     if (!memory_region_is_ram(mr)) {
>         if (writable || !kvm_readonly_mem_allowed) {
>             return;
>         } else if (!mr->romd_mode) {
>             /* If the memory device is not in romd_mode, then we actually want
>              * to remove the kvm memory slot so all accesses will trap. */
>             add = false;
>         }
>     }
> 
>     [...]
> 
>     /* register the new slot */
>     do {
> 
>         [...]
> 
>         err = kvm_set_user_memory_region(kml, mem, true);
>     }
> }
> 
> > ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
> > host pointer at all, I don't yet understand how that helps private MMIO
> > from working.
> 
> I expect private MMIO not accessible from host, but accessible from
> guest so has kvm_userspace_memory_region2 set. That means the resolving
> of its PFN during EPT fault cannot depends on host pointer.
> 
> https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/

I'll leave this to KVM experts, but I actually didn't follow exactly on why
mmu notifier is an issue to make , as I thought that was per-mm anyway, and KVM
should logically be able to skip all VFIO private MMIO regions if affected.
This is a comment to this part of your commit message:

        Rely on userspace mapping also means private MMIO mapping should
        follow userspace mapping change via mmu_notifier. This conflicts
        with the current design that mmu_notifier never impacts private
        mapping. It also makes no sense to support mmu_notifier just for
        private MMIO, private MMIO mapping should be fixed when CoCo-VM
        accepts the private MMIO, any following mapping change without
        guest permission should be invalid.

So I don't yet see a hard-no of reusing userspace mapping even if they're
not faultable as of now - what if they can be faultable in the future?  I
am not sure..

OTOH, I also don't think we need KVM_SET_USER_MEMORY_REGION3 anyway.. The
_REGION2 API is already smart enough to leave some reserved fields:

/* for KVM_SET_USER_MEMORY_REGION2 */
struct kvm_userspace_memory_region2 {
	__u32 slot;
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size;
	__u64 userspace_addr;
	__u64 guest_memfd_offset;
	__u32 guest_memfd;
	__u32 pad1;
	__u64 pad2[14];
};

I think we _could_ reuse some pad*?  Reusing guest_memfd field sounds error
prone to me.

Not sure it could be easier if it's not guest_memfd* but fd + fd_offset
since the start.  But I guess when introducing _REGION2 we didn't expect
MMIO private regions come so soon..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-30 16:28                                               ` Peter Xu
@ 2025-01-30 16:51                                                 ` David Hildenbrand
  2025-02-06 10:41                                                 ` Xu Yilun
  1 sibling, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2025-01-30 16:51 UTC (permalink / raw)
  To: Peter Xu, Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On 30.01.25 17:28, Peter Xu wrote:
> On Sun, Jan 26, 2025 at 11:34:29AM +0800, Xu Yilun wrote:
>>> Definitely not suggesting to install an invalid pointer anywhere.  The
>>> mapped pointer will still be valid for gmem for example, but the fault
>>> isn't.  We need to differenciate two things (1) virtual address mapping,
>>> then (2) permission and accesses on the folios / pages of the mapping.
>>> Here I think it's okay if the host pointer is correctly mapped.
>>>
>>> For your private MMIO use case, my question is if there's no host pointer
>>> to be mapped anyway, then what's the benefit to make the MR to be ram=on?
>>> Can we simply make it a normal IO memory region?  The only benefit of a
>>
>> The guest access to normal IO memory region would be emulated by QEMU,
>> while private assigned MMIO requires guest direct access via Secure EPT.
>>
>> Seems the existing code doesn't support guest direct access if
>> mr->ram == false:
> 
> Ah it's about this, ok.
> 
> I am not sure what's the best approach, but IMHO it's still better we stick
> with host pointer always available when ram=on.  OTOH, VFIO private regions
> may be able to provide a special mark somewhere, just like when romd_mode
> was done previously as below (qemu 235e8982ad39), so that KVM should still
> apply these MRs even if they're not RAM.

I agree.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-01-30 16:28                                               ` Peter Xu
  2025-01-30 16:51                                                 ` David Hildenbrand
@ 2025-02-06 10:41                                                 ` Xu Yilun
  2025-02-06 20:03                                                   ` Peter Xu
  1 sibling, 1 reply; 98+ messages in thread
From: Xu Yilun @ 2025-02-06 10:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Thu, Jan 30, 2025 at 11:28:11AM -0500, Peter Xu wrote:
> On Sun, Jan 26, 2025 at 11:34:29AM +0800, Xu Yilun wrote:
> > > Definitely not suggesting to install an invalid pointer anywhere.  The
> > > mapped pointer will still be valid for gmem for example, but the fault
> > > isn't.  We need to differenciate two things (1) virtual address mapping,
> > > then (2) permission and accesses on the folios / pages of the mapping.
> > > Here I think it's okay if the host pointer is correctly mapped.
> > > 
> > > For your private MMIO use case, my question is if there's no host pointer
> > > to be mapped anyway, then what's the benefit to make the MR to be ram=on?
> > > Can we simply make it a normal IO memory region?  The only benefit of a
> > 
> > The guest access to normal IO memory region would be emulated by QEMU,
> > while private assigned MMIO requires guest direct access via Secure EPT.
> > 
> > Seems the existing code doesn't support guest direct access if
> > mr->ram == false:
> 
> Ah it's about this, ok.
> 
> I am not sure what's the best approach, but IMHO it's still better we stick
> with host pointer always available when ram=on.  OTOH, VFIO private regions
> may be able to provide a special mark somewhere, just like when romd_mode
> was done previously as below (qemu 235e8982ad39), so that KVM should still
> apply these MRs even if they're not RAM.

Also good to me.

> 
> > 
> > static void kvm_set_phys_mem(KVMMemoryListener *kml,
> >                              MemoryRegionSection *section, bool add)
> > {
> >     [...]
> > 
> >     if (!memory_region_is_ram(mr)) {
> >         if (writable || !kvm_readonly_mem_allowed) {
> >             return;
> >         } else if (!mr->romd_mode) {
> >             /* If the memory device is not in romd_mode, then we actually want
> >              * to remove the kvm memory slot so all accesses will trap. */
> >             add = false;
> >         }
> >     }
> > 
> >     [...]
> > 
> >     /* register the new slot */
> >     do {
> > 
> >         [...]
> > 
> >         err = kvm_set_user_memory_region(kml, mem, true);
> >     }
> > }
> > 
> > > ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
> > > host pointer at all, I don't yet understand how that helps private MMIO
> > > from working.
> > 
> > I expect private MMIO not accessible from host, but accessible from
> > guest so has kvm_userspace_memory_region2 set. That means the resolving
> > of its PFN during EPT fault cannot depends on host pointer.
> > 
> > https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/
> 
> I'll leave this to KVM experts, but I actually didn't follow exactly on why
> mmu notifier is an issue to make , as I thought that was per-mm anyway, and KVM
> should logically be able to skip all VFIO private MMIO regions if affected.

I think this creates logical inconsistency. You builds the private MMIO
EPT mapping on fault based on the HVA<->HPA mapping, but doesn't follow
the HVA<->HPA mapping change. Why KVM believes the mapping on fault time
but doesn't on mmu notify time?

> This is a comment to this part of your commit message:
> 
>         Rely on userspace mapping also means private MMIO mapping should
>         follow userspace mapping change via mmu_notifier. This conflicts
>         with the current design that mmu_notifier never impacts private
>         mapping. It also makes no sense to support mmu_notifier just for
>         private MMIO, private MMIO mapping should be fixed when CoCo-VM
>         accepts the private MMIO, any following mapping change without
>         guest permission should be invalid.
> 
> So I don't yet see a hard-no of reusing userspace mapping even if they're
> not faultable as of now - what if they can be faultable in the future?  I

The first commit of guest_memfd emphasize a lot on the benifit of
decoupling KVM mapping from host mapping. My understanding is even if
guest memfd can be faultable later, KVM should still work in a way
without userspace mapping.

> am not sure..
> 
> OTOH, I also don't think we need KVM_SET_USER_MEMORY_REGION3 anyway.. The
> _REGION2 API is already smart enough to leave some reserved fields:
> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_userspace_memory_region2 {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size;
> 	__u64 userspace_addr;
> 	__u64 guest_memfd_offset;
> 	__u32 guest_memfd;
> 	__u32 pad1;
> 	__u64 pad2[14];
> };
> 
> I think we _could_ reuse some pad*?  Reusing guest_memfd field sounds error
> prone to me.

It truly is. I'm expecting some suggestions here.

Thanks,
Yilun

> 
> Not sure it could be easier if it's not guest_memfd* but fd + fd_offset
> since the start.  But I guess when introducing _REGION2 we didn't expect
> MMIO private regions come so soon..
> 
> Thanks,
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
  2025-02-06 10:41                                                 ` Xu Yilun
@ 2025-02-06 20:03                                                   ` Peter Xu
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Xu @ 2025-02-06 20:03 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Alexey Kardashevskiy, Chenyi Qiang, David Hildenbrand,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun

On Thu, Feb 06, 2025 at 06:41:09PM +0800, Xu Yilun wrote:
> On Thu, Jan 30, 2025 at 11:28:11AM -0500, Peter Xu wrote:
> > On Sun, Jan 26, 2025 at 11:34:29AM +0800, Xu Yilun wrote:
> > > > Definitely not suggesting to install an invalid pointer anywhere.  The
> > > > mapped pointer will still be valid for gmem for example, but the fault
> > > > isn't.  We need to differenciate two things (1) virtual address mapping,
> > > > then (2) permission and accesses on the folios / pages of the mapping.
> > > > Here I think it's okay if the host pointer is correctly mapped.
> > > > 
> > > > For your private MMIO use case, my question is if there's no host pointer
> > > > to be mapped anyway, then what's the benefit to make the MR to be ram=on?
> > > > Can we simply make it a normal IO memory region?  The only benefit of a
> > > 
> > > The guest access to normal IO memory region would be emulated by QEMU,
> > > while private assigned MMIO requires guest direct access via Secure EPT.
> > > 
> > > Seems the existing code doesn't support guest direct access if
> > > mr->ram == false:
> > 
> > Ah it's about this, ok.
> > 
> > I am not sure what's the best approach, but IMHO it's still better we stick
> > with host pointer always available when ram=on.  OTOH, VFIO private regions
> > may be able to provide a special mark somewhere, just like when romd_mode
> > was done previously as below (qemu 235e8982ad39), so that KVM should still
> > apply these MRs even if they're not RAM.
> 
> Also good to me.
> 
> > 
> > > 
> > > static void kvm_set_phys_mem(KVMMemoryListener *kml,
> > >                              MemoryRegionSection *section, bool add)
> > > {
> > >     [...]
> > > 
> > >     if (!memory_region_is_ram(mr)) {
> > >         if (writable || !kvm_readonly_mem_allowed) {
> > >             return;
> > >         } else if (!mr->romd_mode) {
> > >             /* If the memory device is not in romd_mode, then we actually want
> > >              * to remove the kvm memory slot so all accesses will trap. */
> > >             add = false;
> > >         }
> > >     }
> > > 
> > >     [...]
> > > 
> > >     /* register the new slot */
> > >     do {
> > > 
> > >         [...]
> > > 
> > >         err = kvm_set_user_memory_region(kml, mem, true);
> > >     }
> > > }
> > > 
> > > > ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
> > > > host pointer at all, I don't yet understand how that helps private MMIO
> > > > from working.
> > > 
> > > I expect private MMIO not accessible from host, but accessible from
> > > guest so has kvm_userspace_memory_region2 set. That means the resolving
> > > of its PFN during EPT fault cannot depends on host pointer.
> > > 
> > > https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/
> > 
> > I'll leave this to KVM experts, but I actually didn't follow exactly on why
> > mmu notifier is an issue to make , as I thought that was per-mm anyway, and KVM
> > should logically be able to skip all VFIO private MMIO regions if affected.
> 
> I think this creates logical inconsistency. You builds the private MMIO
> EPT mapping on fault based on the HVA<->HPA mapping, but doesn't follow
> the HVA<->HPA mapping change. Why KVM believes the mapping on fault time
> but doesn't on mmu notify time?

IMHO as long as kvm knows it's a private MMIO and there's no mapping under
it guaranteed, then KVM can safely skip those ranges to speedup the mmu
notifier.

Said that, I'm not suggesting to stick with hvas if there're better
alternatives.  It's only about the paragraph that confused me a bit.

> 
> > This is a comment to this part of your commit message:
> > 
> >         Rely on userspace mapping also means private MMIO mapping should
> >         follow userspace mapping change via mmu_notifier. This conflicts
> >         with the current design that mmu_notifier never impacts private
> >         mapping. It also makes no sense to support mmu_notifier just for
> >         private MMIO, private MMIO mapping should be fixed when CoCo-VM
> >         accepts the private MMIO, any following mapping change without
> >         guest permission should be invalid.
> > 
> > So I don't yet see a hard-no of reusing userspace mapping even if they're
> > not faultable as of now - what if they can be faultable in the future?  I
> 
> The first commit of guest_memfd emphasize a lot on the benifit of
> decoupling KVM mapping from host mapping. My understanding is even if
> guest memfd can be faultable later, KVM should still work in a way
> without userspace mapping.

I could have implied to suggest using hva, not my intention.  I agree
fd-based API is better too in this case at least as of now.

What I'm not sure is how the whole things evolve with either gmemfd or
device fd when they're used with shared and mappable pages.  We can leave
that for later discussion for sure.

> 
> > am not sure..
> > 
> > OTOH, I also don't think we need KVM_SET_USER_MEMORY_REGION3 anyway.. The
> > _REGION2 API is already smart enough to leave some reserved fields:
> > 
> > /* for KVM_SET_USER_MEMORY_REGION2 */
> > struct kvm_userspace_memory_region2 {
> > 	__u32 slot;
> > 	__u32 flags;
> > 	__u64 guest_phys_addr;
> > 	__u64 memory_size;
> > 	__u64 userspace_addr;
> > 	__u64 guest_memfd_offset;
> > 	__u32 guest_memfd;
> > 	__u32 pad1;
> > 	__u64 pad2[14];
> > };
> > 
> > I think we _could_ reuse some pad*?  Reusing guest_memfd field sounds error
> > prone to me.
> 
> It truly is. I'm expecting some suggestions here.

Maybe a generic fd+offset pair from pad*?  I'm not sure whether at some
point that could also support guest-memfd there too, after all it's easy
for kvm to check whatever file->f_op it's backing, so logically kvm should
allow backing a memslot with whatever file without HVA.  Just my 2 cents.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2025-02-06 20:13 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-13  7:08 [PATCH 0/7] Enable shared device assignment Chenyi Qiang
2024-12-13  7:08 ` [PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2024-12-18 12:33   ` David Hildenbrand
2025-01-08  4:47   ` Alexey Kardashevskiy
2025-01-08  6:41     ` Chenyi Qiang
2024-12-13  7:08 ` [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Chenyi Qiang
2024-12-18  6:45   ` Chenyi Qiang
2025-01-08  4:48   ` Alexey Kardashevskiy
2025-01-08 10:56     ` Chenyi Qiang
2025-01-08 11:20       ` Alexey Kardashevskiy
2025-01-09  2:11         ` Chenyi Qiang
2025-01-09  2:55           ` Alexey Kardashevskiy
2025-01-09  4:29             ` Chenyi Qiang
2025-01-10  0:58               ` Alexey Kardashevskiy
2025-01-10  6:38                 ` Chenyi Qiang
2025-01-09 21:00                   ` Xu Yilun
2025-01-09 21:50                     ` Xu Yilun
2025-01-13  3:34                       ` Chenyi Qiang
2025-01-12 22:23                         ` Xu Yilun
2025-01-14  1:14                           ` Chenyi Qiang
2025-01-15  4:06                   ` Alexey Kardashevskiy
2025-01-15  6:15                     ` Chenyi Qiang
     [not found]                       ` <2b2730f3-6e1a-4def-b126-078cf6249759@amd.com>
2025-01-20 20:46                         ` Peter Xu
2024-06-24 16:31                           ` Xu Yilun
2025-01-21 15:18                             ` Peter Xu
2025-01-22  4:30                               ` Alexey Kardashevskiy
2025-01-22  9:41                                 ` Xu Yilun
2025-01-22 16:43                                   ` Peter Xu
2025-01-23  9:33                                     ` Xu Yilun
2025-01-23 16:47                                       ` Peter Xu
2025-01-24  9:47                                         ` Xu Yilun
2025-01-24 15:55                                           ` Peter Xu
2025-01-24 18:17                                             ` David Hildenbrand
2025-01-26  3:34                                             ` Xu Yilun
2025-01-30 16:28                                               ` Peter Xu
2025-01-30 16:51                                                 ` David Hildenbrand
2025-02-06 10:41                                                 ` Xu Yilun
2025-02-06 20:03                                                   ` Peter Xu
2025-01-14  6:45               ` Chenyi Qiang
2025-01-13 10:54       ` David Hildenbrand
2025-01-14  1:10         ` Chenyi Qiang
2025-01-15  4:05         ` Alexey Kardashevskiy
     [not found]           ` <f3aaffe7-7045-4288-8675-349115a867ce@redhat.com>
2025-01-20 17:21             ` Peter Xu
2025-01-20 17:54               ` David Hildenbrand
2025-01-20 18:33                 ` Peter Xu
2025-01-20 18:47                   ` David Hildenbrand
2025-01-20 20:19                     ` Peter Xu
2025-01-20 20:25                       ` David Hildenbrand
2025-01-20 20:43                         ` Peter Xu
2025-01-21  1:35                   ` Chenyi Qiang
2025-01-21 16:35                     ` Peter Xu
2025-01-22  3:28                       ` Chenyi Qiang
2025-01-22  5:38                         ` Xiaoyao Li
2025-01-24  0:15                           ` Alexey Kardashevskiy
2025-01-24  3:09                             ` Chenyi Qiang
2025-01-24  5:56                               ` Alexey Kardashevskiy
2025-01-24 16:12                                 ` Peter Xu
2025-01-20 18:09   ` Peter Xu
2025-01-21  9:00     ` Chenyi Qiang
2025-01-21  9:26       ` David Hildenbrand
2025-01-21 10:16         ` Chenyi Qiang
2025-01-21 10:26           ` David Hildenbrand
2025-01-22  6:43             ` Chenyi Qiang
2025-01-21 15:38       ` Peter Xu
2025-01-24  3:40         ` Chenyi Qiang
2024-12-13  7:08 ` [PATCH 3/7] guest_memfd: Introduce a callback to notify the shared/private state change Chenyi Qiang
2024-12-13  7:08 ` [PATCH 4/7] KVM: Notify the state change event during shared/private conversion Chenyi Qiang
2024-12-13  7:08 ` [PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation Chenyi Qiang
2025-01-08  4:47   ` Alexey Kardashevskiy
2025-01-09  5:34     ` Chenyi Qiang
2025-01-09  9:32       ` Alexey Kardashevskiy
2025-01-10  5:13         ` Chenyi Qiang
     [not found]           ` <59bd0e82-f269-4567-8f75-a32c9c997ca9@redhat.com>
2025-01-24  3:27             ` Alexey Kardashevskiy
2025-01-24  5:36               ` Chenyi Qiang
2025-01-09  8:14   ` Zhao Liu
2025-01-09  8:17     ` Chenyi Qiang
2024-12-13  7:08 ` [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard Chenyi Qiang
2025-01-13 10:56   ` David Hildenbrand
2025-01-14  1:38     ` Chenyi Qiang
     [not found]       ` <e1141052-1dec-435b-8635-a41881fedd4c@redhat.com>
2025-01-21  6:26         ` Chenyi Qiang
2025-01-21  8:05           ` David Hildenbrand
2024-12-13  7:08 ` [RFC PATCH 7/7] memory: Add a new argument to indicate the request attribute in RamDismcardManager helpers Chenyi Qiang
2025-01-08  4:47 ` [PATCH 0/7] Enable shared device assignment Alexey Kardashevskiy
2025-01-08  6:28   ` Chenyi Qiang
2025-01-08 11:38     ` Alexey Kardashevskiy
2025-01-09  7:52       ` Chenyi Qiang
2025-01-09  8:18         ` Alexey Kardashevskiy
2025-01-09  8:49           ` Chenyi Qiang
2025-01-10  1:42             ` Alexey Kardashevskiy
2025-01-10  7:06               ` Chenyi Qiang
2025-01-10  8:26                 ` David Hildenbrand
2025-01-10 13:20                   ` Jason Gunthorpe
2025-01-10 13:45                     ` David Hildenbrand
2025-01-10 14:14                       ` Jason Gunthorpe
2025-01-10 14:50                         ` David Hildenbrand
2025-01-15  3:39                         ` Alexey Kardashevskiy
2025-01-15 12:49                           ` Jason Gunthorpe
     [not found]                             ` <cc3428b1-22b7-432a-9c74-12b7e36b6cc6@redhat.com>
2025-01-20 18:39                               ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).