[PATCH v3 0/7] Enable shared device assignment

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/7] Enable shared device assignment
@ 2025-03-10  8:18 Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

This is the v3 series of the shared device assignment support.

The overview of this series:
- Patch 1-3: preparation patches. The function exposure and some
  defintion change to return values.
- Patch 4-5: Introduce a new object to implement RamDiscardManager
  interface and a callback to notify the shared/private state change.
- Patch 6: Store the new object including guest_memfd information in
  RAMBlock. Register the RamDiscardManager instance to the target
  RAMBlock's MemoryRegion so that the object can notify the page
  conversion events to other systems.
- Patch 7: Unlock the coordinate discard so that the shared device
  assignment (VFIO) can work with guest_memfd.

Compared with v2 series, the main changes are:

- Introduce the new patch 04 to unify the definiton of ReplayRamPopulate()
  and ReplayRamDiscard().
- In state_change() callback, fallback to a "1 block at a time" handling in case of
  mixed state. In addition, change the bitmap before calling the
  listener so that the listener can do the change according to the
  latest status.
- In kvm_convert_memory(), use ReplayPopulated() and ReplayDiscard()
  interface to trigger set_attribute() and add the undo operation if
  state_change() failed.
- v2: https://lore.kernel.org/qemu-devel/20250217081833.21568-1-chenyi.qiang@intel.com/

More small changes or details can be found in the individual patches.

---
Original cover letter:

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [1] when we try to enable shared
device assignment in confidential VMs.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversion is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardManager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Relationship with in-place page conversion
==========================================
To support 1G page support for guest_memfd [2], the current direction is to
allow mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages as the backend. This in-place page
conversion design eliminates the need to discard pages during shared/private
conversions. However, device assignment will still be blocked because the
in-place page conversion will reject the conversion when the page is pinned
by VFIO.

To address this, the key difference lies in the sequence of VFIO map/unmap
operations and the page conversion. It can be adjusted to achieve
unmap-before-conversion-to-private and map-after-conversion-to-shared,
ensuring compatibility with guest_memfd.

Limitation
==========
One limitation is that VFIO expects the DMA mapping for a specific IOVA
to be mapped and unmapped with the same granularity. The guest may
perform partial conversions, such as converting a small region within a
larger region. To prevent such invalid cases, all operations are
performed with 4K granularity. This could be optimized after the
cut_mapping operation [3] is introduced in future. We can alway perform a
split-before-unmap if partial conversions happen. If the split succeeds,
the unmap will succeed and be atomic. If the split fails, the unmap
process fails.

Testing
=======
This patch series is tested based on TDX patches available at:
KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-coco-queue-snapshot-20250308
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-03-10

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

No additional adjustment required.

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
[3] https://lore.kernel.org/linux-iommu/7-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/

Chenyi Qiang (7):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  memory: Change memory_region_set_ram_discard_manager() to return the
    result
  memory: Unify the definiton of ReplayRamPopulate() and
    ReplayRamDiscard()
  memory-attribute-manager: Introduce MemoryAttributeManager to manage
    RAMBLock with guest_memfd
  memory-attribute-manager: Introduce a callback to notify the
    shared/private state change
  memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  RAMBlock: Make guest_memfd require coordinate discard

 accel/kvm/kvm-all.c                       |  50 ++-
 hw/virtio/virtio-mem.c                    |  81 ++--
 include/exec/memory.h                     |  64 ++-
 include/exec/ramblock.h                   |   2 +
 include/system/memory-attribute-manager.h |  60 +++
 migration/ram.c                           |   5 +-
 system/memory-attribute-manager.c         | 471 ++++++++++++++++++++++
 system/memory.c                           |  22 +-
 system/meson.build                        |   1 +
 system/physmem.c                          |  17 +-
 10 files changed, 690 insertions(+), 83 deletions(-)
 create mode 100644 include/system/memory-attribute-manager.h
 create mode 100644 system/memory-attribute-manager.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 2/7] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Rename the helper to memory_region_section_intersect_range() to make it
more generic. Meanwhile, define the @end as Int128 and replace the
related operations with Int128_* format since the helper is exported as
a wider API.

Suggested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - No change

Changes in v2:
    - Make memory_region_section_intersect_range() an inline function.
    - Add Reviewed-by from David
    - Define the @end as Int128 and use the related Int128_* ops as a wilder
      API (Alexey)
---
 hw/virtio/virtio-mem.c | 32 +++++---------------------------
 include/exec/memory.h  | 27 +++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b1a003736b..21f16e4912 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -244,28 +244,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -287,7 +265,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -319,7 +297,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -355,7 +333,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -371,7 +349,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -388,7 +366,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 3ee1901b52..3bebc43d59 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1202,6 +1202,33 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+static inline bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                                         uint64_t offset, uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    Int128 end = int128_min(int128_add(int128_make64(s->offset_within_region), s->size),
+                            int128_add(int128_make64(offset), int128_make64(size)));
+
+    if (int128_le(end, int128_make64(start))) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_sub(end, int128_make64(start));
+    return true;
+}
+
 /**
  * memory_region_init: Initialize a memory region
  *
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 2/7] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Modify memory_region_set_ram_discard_manager() to return false if a
RamDiscardManager is already set in the MemoryRegion. The caller must
handle this failure, such as having virtio-mem undo its actions and fail
the realize() process. Opportunistically move the call earlier to avoid
complex error handling.

This change is beneficial when introducing a new RamDiscardManager
instance besides virtio-mem. After
ram_block_coordinated_discard_require(true) unlocks all
RamDiscardManager instances, only one instance is allowed to be set for
a MemoryRegion at present.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - Move set_ram_discard_manager() up to avoid a g_free()
    - Clean up set_ram_discard_manager() definition

Changes in v2:
    - newly added.
---
 hw/virtio/virtio-mem.c | 29 ++++++++++++++++-------------
 include/exec/memory.h  |  6 +++---
 system/memory.c        | 10 +++++++---
 3 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 21f16e4912..d0d3a0240f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1049,6 +1049,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    /*
+     * Set ourselves as RamDiscardManager before the plug handler maps the
+     * memory region and exposes it via an address space.
+     */
+    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
+                                              RAM_DISCARD_MANAGER(vmem))) {
+        error_setg(errp, "Failed to set RamDiscardManager");
+        ram_block_coordinated_discard_require(false);
+        return;
+    }
+
     /*
      * We don't know at this point whether shared RAM is migrated using
      * QEMU or migrated using the file content. "x-ignore-shared" will be
@@ -1124,13 +1135,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
     vmem->system_reset->vmem = vmem;
     qemu_register_resettable(obj);
-
-    /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
-                                          RAM_DISCARD_MANAGER(vmem));
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -1138,12 +1142,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-
     qemu_unregister_resettable(OBJECT(vmem->system_reset));
     object_unref(OBJECT(vmem->system_reset));
 
@@ -1156,6 +1154,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
+    /*
+     * The unplug handler unmapped the memory region, it cannot be
+     * found via an address space anymore. Unset ourselves.
+     */
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     ram_block_coordinated_discard_require(false);
 }
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 3bebc43d59..390477b588 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2487,13 +2487,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
  *
  * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
  * that does not cover RAM, or a #MemoryRegion that already has a
- * #RamDiscardManager assigned.
+ * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
  *
  * @mr: the #MemoryRegion
  * @rdm: #RamDiscardManager to set
  */
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm);
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm);
 
 /**
  * memory_region_find: translate an address/size relative to a
diff --git a/system/memory.c b/system/memory.c
index b17b5538ff..62d6b410f0 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2115,12 +2115,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
     return mr->rdm;
 }
 
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm)
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm)
 {
     g_assert(memory_region_is_ram(mr));
-    g_assert(!rdm || !mr->rdm);
+    if (mr->rdm && rdm) {
+        return -EBUSY;
+    }
+
     mr->rdm = rdm;
+    return 0;
 }
 
 uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 2/7] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  2025-03-17  3:00   ` Kishen Maloor
  2025-03-10  8:18 ` [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Update ReplayRamDiscard() function to return the result and unify the
ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamStateChange() at
the same time due to their identical definitions. This unification
simplifies related structures, such as VirtIOMEMReplayData, which makes
it more cleaner and maintainable. It also paves the way for future
extension when introducing new ReplayRamDiscard() functions to return
the results.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - Newly added.
---
 hw/virtio/virtio-mem.c | 20 ++++++++++----------
 include/exec/memory.h  | 31 ++++++++++++++++---------------
 migration/ram.c        |  5 +++--
 system/memory.c        | 12 ++++++------
 4 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index d0d3a0240f..816ae8abdd 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1733,7 +1733,7 @@ static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
 }
 
 struct VirtIOMEMReplayData {
-    void *fn;
+    ReplayRamStateChange fn;
     void *opaque;
 };
 
@@ -1741,12 +1741,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
+    return data->fn(s, data->opaque);
 }
 
 static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                            MemoryRegionSection *s,
-                                           ReplayRamPopulate replay_fn,
+                                           ReplayRamStateChange replay_fn,
                                            void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
@@ -1765,14 +1765,14 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    ((ReplayRamDiscard)data->fn)(s, data->opaque);
+    data->fn(s, data->opaque);
     return 0;
 }
 
-static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
-                                            MemoryRegionSection *s,
-                                            ReplayRamDiscard replay_fn,
-                                            void *opaque)
+static int virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                           MemoryRegionSection *s,
+                                           ReplayRamStateChange replay_fn,
+                                           void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
     struct VirtIOMEMReplayData data = {
@@ -1781,8 +1781,8 @@ static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
     };
 
     g_assert(s->mr == &vmem->memdev->mr);
-    virtio_mem_for_each_unplugged_section(vmem, s, &data,
-                                          virtio_mem_rdm_replay_discarded_cb);
+    return virtio_mem_for_each_unplugged_section(vmem, s, &data,
+                                                 virtio_mem_rdm_replay_discarded_cb);
 }
 
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 390477b588..aa67d84329 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -566,8 +566,7 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
     rdl->double_discard_supported = double_discard_supported;
 }
 
-typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
-typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
+typedef int (*ReplayRamStateChange)(MemoryRegionSection *section, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -641,36 +640,38 @@ struct RamDiscardManagerClass {
     /**
      * @replay_populated:
      *
-     * Call the #ReplayRamPopulate callback for all populated parts within the
+     * Call the #ReplayRamStateChange callback for all populated parts within the
      * #MemoryRegionSection via the #RamDiscardManager.
      *
      * In case any call fails, no further calls are made.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamPopulate callback
+     * @replay_fn: the #ReplayRamStateChange callback
      * @opaque: pointer to forward to the callback
      *
      * Returns 0 on success, or a negative error if any notification failed.
      */
     int (*replay_populated)(const RamDiscardManager *rdm,
                             MemoryRegionSection *section,
-                            ReplayRamPopulate replay_fn, void *opaque);
+                            ReplayRamStateChange replay_fn, void *opaque);
 
     /**
      * @replay_discarded:
      *
-     * Call the #ReplayRamDiscard callback for all discarded parts within the
+     * Call the #ReplayRamStateChange callback for all discarded parts within the
      * #MemoryRegionSection via the #RamDiscardManager.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamDiscard callback
+     * @replay_fn: the #ReplayRamStateChange callback
      * @opaque: pointer to forward to the callback
+     *
+     * Returns 0 on success, or a negative error if any notification failed.
      */
-    void (*replay_discarded)(const RamDiscardManager *rdm,
-                             MemoryRegionSection *section,
-                             ReplayRamDiscard replay_fn, void *opaque);
+    int (*replay_discarded)(const RamDiscardManager *rdm,
+                            MemoryRegionSection *section,
+                            ReplayRamStateChange replay_fn, void *opaque);
 
     /**
      * @register_listener:
@@ -713,13 +714,13 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamStateChange replay_fn,
                                          void *opaque);
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque);
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamStateChange replay_fn,
+                                         void *opaque);
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
diff --git a/migration/ram.c b/migration/ram.c
index ce28328141..053730367b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -816,8 +816,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
-static void dirty_bitmap_clear_section(MemoryRegionSection *section,
-                                       void *opaque)
+static int dirty_bitmap_clear_section(MemoryRegionSection *section,
+                                      void *opaque)
 {
     const hwaddr offset = section->offset_within_region;
     const hwaddr size = int128_get64(section->size);
@@ -836,6 +836,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
     }
     *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
     bitmap_clear(rb->bmap, start, npages);
+    return 0;
 }
 
 /*
diff --git a/system/memory.c b/system/memory.c
index 62d6b410f0..8622d17133 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2147,7 +2147,7 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamStateChange replay_fn,
                                          void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
@@ -2156,15 +2156,15 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
     return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque)
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamStateChange replay_fn,
+                                         void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_discarded);
-    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+    return rdmc->replay_discarded(rdm, section, replay_fn, opaque);
 }
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2025-03-10  8:18 ` [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  2025-03-14 12:11   ` Gupta, Pankaj
  2025-03-10  8:18 ` [PATCH v3 5/7] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO may
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential VM via shared memory. To address this, it is
crucial to ensure systems like VFIO refresh its IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.

Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. One potential attempt is to implement it in
HostMemoryBackend. This is not appropriate because guest_memfd is per
RAMBlock. Some RAMBlocks have a memory backend but others do not. In
particular, the ones like virtual BIOS calling
memory_region_init_ram_guest_memfd() do not.

To manage the RAMBlocks with guest_memfd, define a new object named
MemoryAttributeManager to implement the RamDiscardManager interface. The
object stores guest_memfd information such as shared_bitmap, and handles
page conversion notification. Using the name of MemoryAttributeManager is
aimed to make it more generic. The term "Memory" emcompasses not only RAM
but also private MMIO in TEE I/O, which might rely on this
object/interface to handle page conversion events in the future. The
term "Attribute" allows for the management of various attributes beyond
shared and private. For instance, it could support scenarios where
discard vs. populated and shared vs. private states co-exists, such as
supporting virtio-mem or something similar in the future.

In the current context, MemoryAttributeManager signifies discarded state
as private and populated state as shared. Memory state is tracked at the
host page size granularity, as the minimum memory conversion size can be one
page per request. Additionally, VFIO expects the DMA mapping for a
specific iova to be mapped and unmapped with the same granularity.
Confidential VMs may perform  partial conversions, e.g. conversion
happens on a small region within a large region. To prevent such invalid
cases and until cut_mapping operation support is introduced, all
operations are performed with 4K granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - Some rename (bitmap_size->shared_bitmap_size,
      first_one/zero_bit->first_bit, etc.)
    - Change shared_bitmap_size from uint32_t to unsigned
    - Return mgr->mr->ram_block->page_size in get_block_size()
    - Move set_ram_discard_manager() up to avoid a g_free() in failure
      case.
    - Add const for the memory_attribute_manager_get_block_size()
    - Unify the ReplayRamPopulate and ReplayRamDiscard and related
      callback.

Changes in v2:
    - Rename the object name to MemoryAttributeManager
    - Rename the bitmap to shared_bitmap to make it more clear.
    - Remove block_size field and get it from a helper. In future, we
      can get the page_size from RAMBlock if necessary.
    - Remove the unncessary "struct" before GuestMemfdReplayData
    - Remove the unncessary g_free() for the bitmap
    - Add some error report when the callback failure for
      populated/discarded section.
    - Move the realize()/unrealize() definition to this patch.
---
 include/system/memory-attribute-manager.h |  42 ++++
 system/memory-attribute-manager.c         | 283 ++++++++++++++++++++++
 system/meson.build                        |   1 +
 3 files changed, 326 insertions(+)
 create mode 100644 include/system/memory-attribute-manager.h
 create mode 100644 system/memory-attribute-manager.c

diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
new file mode 100644
index 0000000000..23375a14b8
--- /dev/null
+++ b/include/system/memory-attribute-manager.h
@@ -0,0 +1,42 @@
+/*
+ * QEMU memory attribute manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
+#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
+
+#include "system/hostmem.h"
+
+#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
+
+OBJECT_DECLARE_TYPE(MemoryAttributeManager, MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
+
+struct MemoryAttributeManager {
+    Object parent;
+
+    MemoryRegion *mr;
+
+    /* 1-setting of the bit represents the memory is populated (shared) */
+    unsigned shared_bitmap_size;
+    unsigned long *shared_bitmap;
+
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+struct MemoryAttributeManagerClass {
+    ObjectClass parent_class;
+};
+
+int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
+void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
+
+#endif
diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
new file mode 100644
index 0000000000..7c3789cf49
--- /dev/null
+++ b/system/memory-attribute-manager.c
@@ -0,0 +1,283 @@
+/*
+ * QEMU memory attribute manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "exec/ramblock.h"
+#include "system/memory-attribute-manager.h"
+
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
+                                   memory_attribute_manager,
+                                   MEMORY_ATTRIBUTE_MANAGER,
+                                   OBJECT,
+                                   { TYPE_RAM_DISCARD_MANAGER },
+                                   { })
+
+static size_t memory_attribute_manager_get_block_size(const MemoryAttributeManager *mgr)
+{
+    /*
+     * Because page conversion could be manipulated in the size of at least 4K or 4K aligned,
+     * Use the host page size as the granularity to track the memory attribute.
+     */
+    g_assert(mgr && mgr->mr && mgr->mr->ram_block);
+    g_assert(mgr->mr->ram_block->page_size == qemu_real_host_page_size());
+    return mgr->mr->ram_block->page_size;
+}
+
+
+static bool memory_attribute_rdm_is_populated(const RamDiscardManager *rdm,
+                                              const MemoryRegionSection *section)
+{
+    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    uint64_t first_bit = section->offset_within_region / block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
+    return first_discard_bit > last_bit;
+}
+
+typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int memory_attribute_notify_populate_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int memory_attribute_notify_discard_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+
+    return 0;
+}
+
+static int memory_attribute_for_each_populated_section(const MemoryAttributeManager *mgr,
+                                                       MemoryRegionSection *section,
+                                                       void *arg,
+                                                       memory_attribute_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_bit(mgr->shared_bitmap, mgr->shared_bitmap_size, first_bit);
+
+    while (first_bit < mgr->shared_bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
+                                      first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
+                         strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
+                                  last_bit + 2);
+    }
+
+    return ret;
+}
+
+static int memory_attribute_for_each_discarded_section(const MemoryAttributeManager *mgr,
+                                                       MemoryRegionSection *section,
+                                                       void *arg,
+                                                       memory_attribute_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
+                                   first_bit);
+
+    while (first_bit < mgr->shared_bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
+                                      first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
+                         strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
+                                       last_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t memory_attribute_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                                         const MemoryRegion *mr)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+
+    g_assert(mr == mgr->mr);
+    return memory_attribute_manager_get_block_size(mgr);
+}
+
+static void memory_attribute_rdm_register_listener(RamDiscardManager *rdm,
+                                                   RamDiscardListener *rdl,
+                                                   MemoryRegionSection *section)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    int ret;
+
+    g_assert(section->mr == mgr->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&mgr->rdl_list, rdl, next);
+
+    ret = memory_attribute_for_each_populated_section(mgr, section, rdl,
+                                                      memory_attribute_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
+                                                     RamDiscardListener *rdl)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == mgr->mr);
+
+    ret = memory_attribute_for_each_populated_section(mgr, rdl->section, rdl,
+                                                      memory_attribute_notify_discard_cb);
+    if (ret) {
+        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+
+}
+
+typedef struct MemoryAttributeReplayData {
+    ReplayRamStateChange fn;
+    void *opaque;
+} MemoryAttributeReplayData;
+
+static int memory_attribute_rdm_replay_cb(MemoryRegionSection *section, void *arg)
+{
+    MemoryAttributeReplayData *data = arg;
+
+    return data->fn(section, data->opaque);
+}
+
+static int memory_attribute_rdm_replay_populated(const RamDiscardManager *rdm,
+                                                 MemoryRegionSection *section,
+                                                 ReplayRamStateChange replay_fn,
+                                                 void *opaque)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == mgr->mr);
+    return memory_attribute_for_each_populated_section(mgr, section, &data,
+                                                       memory_attribute_rdm_replay_cb);
+}
+
+static int memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                                 MemoryRegionSection *section,
+                                                 ReplayRamStateChange replay_fn,
+                                                 void *opaque)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == mgr->mr);
+    return memory_attribute_for_each_discarded_section(mgr, section, &data,
+                                                       memory_attribute_rdm_replay_cb);
+}
+
+int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
+{
+    uint64_t shared_bitmap_size;
+    const int block_size  = qemu_real_host_page_size();
+    int ret;
+
+    shared_bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
+
+    mgr->mr = mr;
+    ret = memory_region_set_ram_discard_manager(mgr->mr, RAM_DISCARD_MANAGER(mgr));
+    if (ret) {
+        return ret;
+    }
+    mgr->shared_bitmap_size = shared_bitmap_size;
+    mgr->shared_bitmap = bitmap_new(shared_bitmap_size);
+
+    return ret;
+}
+
+void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
+{
+    g_free(mgr->shared_bitmap);
+    memory_region_set_ram_discard_manager(mgr->mr, NULL);
+}
+
+static void memory_attribute_manager_init(Object *obj)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
+
+    QLIST_INIT(&mgr->rdl_list);
+}
+
+static void memory_attribute_manager_finalize(Object *obj)
+{
+}
+
+static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+    rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
+    rdmc->register_listener = memory_attribute_rdm_register_listener;
+    rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;
+    rdmc->is_populated = memory_attribute_rdm_is_populated;
+    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
+    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index 4952f4b2c7..ab07ff1442 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -15,6 +15,7 @@ system_ss.add(files(
   'dirtylimit.c',
   'dma-helpers.c',
   'globals.c',
+  'memory-attribute-manager.c',
   'memory_mapping.c',
   'qdev-monitor.c',
   'qtest.c',
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 5/7] memory-attribute-manager: Introduce a callback to notify the shared/private state change
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2025-03-10  8:18 ` [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
  2025-03-10  8:18 ` [PATCH v3 7/7] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
  6 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Introduce a new state_change() callback in MemoryAttributeManagerClass to
efficiently notify all registered RamDiscardListeners, including VFIO
listeners about the memory conversion events in guest_memfd. The
existing VFIO listener can dynamically DMA map/unmap the shared pages
based on conversion types:
- For conversions from shared to private, the VFIO system ensures the
  discarding of shared mapping from the IOMMU.
- For conversions from private to shared, it triggers the population of
  the shared mapping into the IOMMU.

Additionally, there could be some special conversion requests:
- When a conversion request is made for a page already in the desired
  state, the helper simply returns success.
- For requests involving a range partially in the desired state, only
  the necessary segments are converted, ensuring the entire range
  complies with the request efficiently. In this case, fallback to a "1
  block at a time" handling.
- In scenarios where a conversion request is declined by other systems,
  such as a failure from VFIO during notify_populate(), the helper will
  roll back the request, maintaining consistency.

Note that the bitmap status is updated before the notifier callbacks so
that the listener can handle the memory based on the latest status.

Opportunistically introduce a helper to trigger the state_change()
callback of the class.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - Move the bitmap update before notifier callbacks.
    - Call the notifier callbacks directly in notify_discard/populate()
      with the expectation that the request memory range is in the
      desired attribute.
    - For the case that only partial range in the desire status, handle
      the range with block_size granularity for ease of rollback
      (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com/)

Changes in v2:
    - Do the alignment changes due to the rename to MemoryAttributeManager
    - Move the state_change() helper definition in this patch.
---
 include/system/memory-attribute-manager.h |  18 +++
 system/memory-attribute-manager.c         | 188 ++++++++++++++++++++++
 2 files changed, 206 insertions(+)

diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
index 23375a14b8..3d9227d62a 100644
--- a/include/system/memory-attribute-manager.h
+++ b/include/system/memory-attribute-manager.h
@@ -34,8 +34,26 @@ struct MemoryAttributeManager {
 
 struct MemoryAttributeManagerClass {
     ObjectClass parent_class;
+
+    int (*state_change)(MemoryAttributeManager *mgr, uint64_t offset, uint64_t size,
+                        bool to_private);
 };
 
+static inline int memory_attribute_manager_state_change(MemoryAttributeManager *mgr, uint64_t offset,
+                                                        uint64_t size, bool to_private)
+{
+    MemoryAttributeManagerClass *klass;
+
+    if (mgr == NULL) {
+        return 0;
+    }
+
+    klass = MEMORY_ATTRIBUTE_MANAGER_GET_CLASS(mgr);
+
+    g_assert(klass->state_change);
+    return klass->state_change(mgr, offset, size, to_private);
+}
+
 int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
 void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
 
diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
index 7c3789cf49..6456babc95 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -234,6 +234,191 @@ static int memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
                                                        memory_attribute_rdm_replay_cb);
 }
 
+static bool memory_attribute_is_valid_range(MemoryAttributeManager *mgr,
+                                            uint64_t offset, uint64_t size)
+{
+    MemoryRegion *mr = mgr->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+
+    if (!QEMU_IS_ALIGNED(offset, block_size)) {
+        return false;
+    }
+    if (offset + size < offset || !size) {
+        return false;
+    }
+    if (offset >= region_size || offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
+                                            uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &mgr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        rdl->notify_discard(rdl, &tmp);
+    }
+}
+
+static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
+                                            uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl, *rdl2;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &mgr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        ret = rdl->notify_populate(rdl, &tmp);
+        if (ret) {
+            break;
+        }
+    }
+
+    if (ret) {
+        /* Notify all already-notified listeners. */
+        QLIST_FOREACH(rdl2, &mgr->rdl_list, next) {
+            MemoryRegionSection tmp = *rdl2->section;
+
+            if (rdl2 == rdl) {
+                break;
+            }
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+                continue;
+            }
+            rdl2->notify_discard(rdl2, &tmp);
+        }
+    }
+    return ret;
+}
+
+static bool memory_attribute_is_range_populated(MemoryAttributeManager *mgr,
+                                                uint64_t offset, uint64_t size)
+{
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
+                                                uint64_t offset, uint64_t size)
+{
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
+                                         uint64_t size, bool to_private)
+{
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    const uint64_t end = offset + size;
+    unsigned long bit;
+    uint64_t cur;
+    int ret = 0;
+
+    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if (to_private) {
+        if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Already private */
+        } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (!test_bit(bit, mgr->shared_bitmap)) {
+                    continue;
+                }
+                clear_bit(bit, mgr->shared_bitmap);
+                memory_attribute_notify_discard(mgr, cur, block_size);
+            }
+        } else {
+            /* Completely shared */
+            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            memory_attribute_notify_discard(mgr, offset, size);
+        }
+    } else {
+        if (memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Already shared */
+        } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks */
+            unsigned long *modified_bitmap = bitmap_new(nbits);
+
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (test_bit(bit, mgr->shared_bitmap)) {
+                    continue;
+                }
+                set_bit(bit, mgr->shared_bitmap);
+                ret = memory_attribute_notify_populate(mgr, cur, block_size);
+                if (!ret) {
+                    set_bit(bit - first_bit, modified_bitmap);
+                    continue;
+                }
+                clear_bit(bit, mgr->shared_bitmap);
+                break;
+            }
+
+            if (ret) {
+                /*
+                 * Very unexpected: something went wrong. Revert to the old
+                 * state, marking only the blocks as private that we converted
+                 * to shared.
+                 */
+                for (cur = offset; cur < end; cur += block_size) {
+                    bit = cur / block_size;
+                    if (!test_bit(bit - first_bit, modified_bitmap)) {
+                        continue;
+                    }
+                    assert(test_bit(bit, mgr->shared_bitmap));
+                    clear_bit(bit, mgr->shared_bitmap);
+                    memory_attribute_notify_discard(mgr, cur, block_size);
+                }
+            }
+            g_free(modified_bitmap);
+        } else {
+            /* Complete private */
+            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+            ret = memory_attribute_notify_populate(mgr, offset, size);
+            if (ret) {
+                bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            }
+        }
+    }
+
+    return ret;
+}
+
 int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
 {
     uint64_t shared_bitmap_size;
@@ -272,8 +457,11 @@ static void memory_attribute_manager_finalize(Object *obj)
 
 static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
 {
+    MemoryAttributeManagerClass *mamc = MEMORY_ATTRIBUTE_MANAGER_CLASS(oc);
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
 
+    mamc->state_change = memory_attribute_state_change;
+
     rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
     rdmc->register_listener = memory_attribute_rdm_register_listener;
     rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2025-03-10  8:18 ` [PATCH v3 5/7] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  2025-03-14  8:21   ` Chenyi Qiang
  2025-03-17  6:18   ` Tony Lindgren
  2025-03-10  8:18 ` [PATCH v3 7/7] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
  6 siblings, 2 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Introduce a new field, memory_attribute_manager, in RAMBlock to link to
an MemoryAttributeManager object. This change centralizes all
guest_memfd state information (like fd and shared_bitmap) within a
RAMBlock, making it easier to manage.

Use the realize()/unrealize() helpers to initialize/uninitialize the
MemoryAttributeManager object. Register/unregister the object in the
target RAMBlock's MemoryRegion when creating guest_memfd.

In the kvm_convert_memory() function, manage memory state changes by
using the shared_bitmap to call set_attribute() only on the specific
memory range. Additionally, use the
memory_attribute_manager_state_change() helper to notify the reigstered
RamDiscardListener of these changes.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - Use ram_discard_manager_reply_populated/discarded() to set the
      memory attribute and add the undo support if state_change()
      failed.
    - Didn't add Reviewed-by from Alexey due to the new changes in this
      commit.

Changes in v2:
    - Introduce a new field memory_attribute_manager in RAMBlock.
    - Move the state_change() handling during page conversion in this patch.
    - Undo what we did if it fails to set.
    - Change the order of close(guest_memfd) and memory_attribute_manager cleanup.
---
 accel/kvm/kvm-all.c     | 50 +++++++++++++++++++++++++++++++++++++++--
 include/exec/ramblock.h |  2 ++
 system/physmem.c        | 13 +++++++++++
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c1fea69d58..a89c5655e8 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -48,6 +48,7 @@
 #include "kvm-cpus.h"
 #include "system/dirtylimit.h"
 #include "qemu/range.h"
+#include "system/memory-attribute-manager.h"
 
 #include "hw/boards.h"
 #include "system/stats.h"
@@ -3018,6 +3019,25 @@ static void kvm_eat_signals(CPUState *cpu)
     } while (sigismember(&chkset, SIG_IPI));
 }
 
+typedef struct SetMemoryAttribute {
+    bool to_private;
+} SetMemoryAttribute;
+
+static int kvm_set_memory_attributes_cb(MemoryRegionSection *section,
+                                        void *opaque)
+{
+    hwaddr start = section->offset_within_address_space;
+    hwaddr size = section->size;
+    SetMemoryAttribute *args = opaque;
+    bool to_private = args->to_private;
+
+    if (to_private) {
+        return kvm_set_memory_attributes_private(start, size);
+    } else {
+        return kvm_set_memory_attributes_shared(start, size);
+    }
+}
+
 int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
 {
     MemoryRegionSection section;
@@ -3026,6 +3046,7 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     RAMBlock *rb;
     void *addr;
     int ret = -EINVAL;
+    SetMemoryAttribute args = { .to_private = to_private };
 
     trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
 
@@ -3077,9 +3098,13 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     }
 
     if (to_private) {
-        ret = kvm_set_memory_attributes_private(start, size);
+        ret = ram_discard_manager_replay_populated(mr->rdm, &section,
+                                                   kvm_set_memory_attributes_cb,
+                                                   &args);
     } else {
-        ret = kvm_set_memory_attributes_shared(start, size);
+        ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
+                                                   kvm_set_memory_attributes_cb,
+                                                   &args);
     }
     if (ret) {
         goto out_unref;
@@ -3088,6 +3113,27 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    ret = memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
+                                                offset, size, to_private);
+    if (ret) {
+        warn_report("Failed to notify the listener the state change of "
+                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
+                    start, size, to_private ? "private" : "shared");
+        args.to_private = !to_private;
+        if (to_private) {
+            ret = ram_discard_manager_replay_populated(mr->rdm, &section,
+                                                       kvm_set_memory_attributes_cb,
+                                                       &args);
+        } else {
+            ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
+                                                       kvm_set_memory_attributes_cb,
+                                                       &args);
+        }
+        if (ret) {
+            goto out_unref;
+        }
+    }
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 0babd105c0..06fd365326 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -23,6 +23,7 @@
 #include "cpu-common.h"
 #include "qemu/rcu.h"
 #include "exec/ramlist.h"
+#include "system/memory-attribute-manager.h"
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -42,6 +43,7 @@ struct RAMBlock {
     int fd;
     uint64_t fd_offset;
     int guest_memfd;
+    MemoryAttributeManager *memory_attribute_manager;
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
diff --git a/system/physmem.c b/system/physmem.c
index c76503aea8..0ed394c5d2 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -54,6 +54,7 @@
 #include "system/hostmem.h"
 #include "system/hw_accel.h"
 #include "system/xen-mapcache.h"
+#include "system/memory-attribute-manager.h"
 #include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
+
+        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
+        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
+            error_setg(errp, "Failed to realize memory attribute manager");
+            object_unref(OBJECT(new_block->memory_attribute_manager));
+            close(new_block->guest_memfd);
+            ram_block_discard_require(false);
+            qemu_mutex_unlock_ramlist();
+            goto out_free;
+        }
     }
 
     ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
@@ -2138,6 +2149,8 @@ static void reclaim_ramblock(RAMBlock *block)
     }
 
     if (block->guest_memfd >= 0) {
+        memory_attribute_manager_unrealize(block->memory_attribute_manager);
+        object_unref(OBJECT(block->memory_attribute_manager));
         close(block->guest_memfd);
         ram_block_discard_require(false);
     }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 7/7] RAMBlock: Make guest_memfd require coordinate discard
  2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
                   ` (5 preceding siblings ...)
  2025-03-10  8:18 ` [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
@ 2025-03-10  8:18 ` Chenyi Qiang
  6 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-10  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

As guest_memfd is now managed by memory_attribute_manager with
RamDiscardManager, only block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v3:
    - No change.

Changes in v2:
    - Change the ram_block_discard_require(false) to
      ram_block_coordinated_discard_require(false).
---
 system/physmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 0ed394c5d2..a30cdd43ee 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         assert(kvm_enabled());
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");
@@ -1892,7 +1892,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             error_setg(errp, "Failed to realize memory attribute manager");
             object_unref(OBJECT(new_block->memory_attribute_manager));
             close(new_block->guest_memfd);
-            ram_block_discard_require(false);
+            ram_block_coordinated_discard_require(false);
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
@@ -2152,7 +2152,7 @@ static void reclaim_ramblock(RAMBlock *block)
         memory_attribute_manager_unrealize(block->memory_attribute_manager);
         object_unref(OBJECT(block->memory_attribute_manager));
         close(block->guest_memfd);
-        ram_block_discard_require(false);
+        ram_block_coordinated_discard_require(false);
     }
 
     g_free(block);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-10  8:18 ` [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
@ 2025-03-14  8:21   ` Chenyi Qiang
  2025-03-14  9:00     ` David Hildenbrand
  2025-03-17  6:18   ` Tony Lindgren
  1 sibling, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-14  8:21 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

Hi David & Alexey,

To keep the bitmap aligned, I add the undo operation for
set_memory_attributes() and use the bitmap + replay callback to do
set_memory_attributes(). Does this change make sense?

Alexey, I didn't add your Reivewed-by since this patch introduced some
new changes.

On 3/10/2025 4:18 PM, Chenyi Qiang wrote:
> Introduce a new field, memory_attribute_manager, in RAMBlock to link to
> an MemoryAttributeManager object. This change centralizes all
> guest_memfd state information (like fd and shared_bitmap) within a
> RAMBlock, making it easier to manage.
> 
> Use the realize()/unrealize() helpers to initialize/uninitialize the
> MemoryAttributeManager object. Register/unregister the object in the
> target RAMBlock's MemoryRegion when creating guest_memfd.
> 
> In the kvm_convert_memory() function, manage memory state changes by
> using the shared_bitmap to call set_attribute() only on the specific
> memory range. Additionally, use the
> memory_attribute_manager_state_change() helper to notify the reigstered
> RamDiscardListener of these changes.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v3:
>     - Use ram_discard_manager_reply_populated/discarded() to set the
>       memory attribute and add the undo support if state_change()
>       failed.
>     - Didn't add Reviewed-by from Alexey due to the new changes in this
>       commit.
> 
> Changes in v2:
>     - Introduce a new field memory_attribute_manager in RAMBlock.
>     - Move the state_change() handling during page conversion in this patch.
>     - Undo what we did if it fails to set.
>     - Change the order of close(guest_memfd) and memory_attribute_manager cleanup.
> ---
>  accel/kvm/kvm-all.c     | 50 +++++++++++++++++++++++++++++++++++++++--
>  include/exec/ramblock.h |  2 ++
>  system/physmem.c        | 13 +++++++++++
>  3 files changed, 63 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c1fea69d58..a89c5655e8 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -48,6 +48,7 @@
>  #include "kvm-cpus.h"
>  #include "system/dirtylimit.h"
>  #include "qemu/range.h"
> +#include "system/memory-attribute-manager.h"
>  
>  #include "hw/boards.h"
>  #include "system/stats.h"
> @@ -3018,6 +3019,25 @@ static void kvm_eat_signals(CPUState *cpu)
>      } while (sigismember(&chkset, SIG_IPI));
>  }
>  
> +typedef struct SetMemoryAttribute {
> +    bool to_private;
> +} SetMemoryAttribute;
> +
> +static int kvm_set_memory_attributes_cb(MemoryRegionSection *section,
> +                                        void *opaque)
> +{
> +    hwaddr start = section->offset_within_address_space;
> +    hwaddr size = section->size;
> +    SetMemoryAttribute *args = opaque;
> +    bool to_private = args->to_private;
> +
> +    if (to_private) {
> +        return kvm_set_memory_attributes_private(start, size);
> +    } else {
> +        return kvm_set_memory_attributes_shared(start, size);
> +    }
> +}
> +
>  int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>  {
>      MemoryRegionSection section;
> @@ -3026,6 +3046,7 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>      RAMBlock *rb;
>      void *addr;
>      int ret = -EINVAL;
> +    SetMemoryAttribute args = { .to_private = to_private };
>  
>      trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
>  
> @@ -3077,9 +3098,13 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>      }
>  
>      if (to_private) {
> -        ret = kvm_set_memory_attributes_private(start, size);
> +        ret = ram_discard_manager_replay_populated(mr->rdm, &section,
> +                                                   kvm_set_memory_attributes_cb,
> +                                                   &args);
>      } else {
> -        ret = kvm_set_memory_attributes_shared(start, size);
> +        ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
> +                                                   kvm_set_memory_attributes_cb,
> +                                                   &args);
>      }
>      if (ret) {
>          goto out_unref;
> @@ -3088,6 +3113,27 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>      addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
>      rb = qemu_ram_block_from_host(addr, false, &offset);
>  
> +    ret = memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
> +                                                offset, size, to_private);
> +    if (ret) {
> +        warn_report("Failed to notify the listener the state change of "
> +                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
> +                    start, size, to_private ? "private" : "shared");
> +        args.to_private = !to_private;
> +        if (to_private) {
> +            ret = ram_discard_manager_replay_populated(mr->rdm, &section,
> +                                                       kvm_set_memory_attributes_cb,
> +                                                       &args);
> +        } else {
> +            ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
> +                                                       kvm_set_memory_attributes_cb,
> +                                                       &args);
> +        }
> +        if (ret) {
> +            goto out_unref;
> +        }
> +    }
> +
>      if (to_private) {
>          if (rb->page_size != qemu_real_host_page_size()) {
>              /*
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 0babd105c0..06fd365326 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -23,6 +23,7 @@
>  #include "cpu-common.h"
>  #include "qemu/rcu.h"
>  #include "exec/ramlist.h"
> +#include "system/memory-attribute-manager.h"
>  
>  struct RAMBlock {
>      struct rcu_head rcu;
> @@ -42,6 +43,7 @@ struct RAMBlock {
>      int fd;
>      uint64_t fd_offset;
>      int guest_memfd;
> +    MemoryAttributeManager *memory_attribute_manager;
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> diff --git a/system/physmem.c b/system/physmem.c
> index c76503aea8..0ed394c5d2 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -54,6 +54,7 @@
>  #include "system/hostmem.h"
>  #include "system/hw_accel.h"
>  #include "system/xen-mapcache.h"
> +#include "system/memory-attribute-manager.h"
>  #include "trace.h"
>  
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>              qemu_mutex_unlock_ramlist();
>              goto out_free;
>          }
> +
> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
> +            error_setg(errp, "Failed to realize memory attribute manager");
> +            object_unref(OBJECT(new_block->memory_attribute_manager));
> +            close(new_block->guest_memfd);
> +            ram_block_discard_require(false);
> +            qemu_mutex_unlock_ramlist();
> +            goto out_free;
> +        }
>      }
>  
>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> @@ -2138,6 +2149,8 @@ static void reclaim_ramblock(RAMBlock *block)
>      }
>  
>      if (block->guest_memfd >= 0) {
> +        memory_attribute_manager_unrealize(block->memory_attribute_manager);
> +        object_unref(OBJECT(block->memory_attribute_manager));
>          close(block->guest_memfd);
>          ram_block_discard_require(false);
>      }



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-14  8:21   ` Chenyi Qiang
@ 2025-03-14  9:00     ` David Hildenbrand
  2025-03-14  9:30       ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-03-14  9:00 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 14.03.25 09:21, Chenyi Qiang wrote:
> Hi David & Alexey,

Hi,

> 
> To keep the bitmap aligned, I add the undo operation for
> set_memory_attributes() and use the bitmap + replay callback to do
> set_memory_attributes(). Does this change make sense?

I assume you mean this hunk:

+    ret = memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
+                                                offset, size, to_private);
+    if (ret) {
+        warn_report("Failed to notify the listener the state change of "
+                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
+                    start, size, to_private ? "private" : "shared");
+        args.to_private = !to_private;
+        if (to_private) {
+            ret = ram_discard_manager_replay_populated(mr->rdm, &section,
+                                                       kvm_set_memory_attributes_cb,
+                                                       &args);
+        } else {
+            ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
+                                                       kvm_set_memory_attributes_cb,
+                                                       &args);
+        }
+        if (ret) {
+            goto out_unref;
+        }


Why is that undo necessary? The bitmap + listeners should be held in sync inside of
memory_attribute_manager_state_change(). Handling this in the caller looks wrong.

I thought the current implementation properly handles that internally? In which scenario is that not the case?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-14  9:00     ` David Hildenbrand
@ 2025-03-14  9:30       ` Chenyi Qiang
  2025-03-14  9:50         ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-14  9:30 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/14/2025 5:00 PM, David Hildenbrand wrote:
> On 14.03.25 09:21, Chenyi Qiang wrote:
>> Hi David & Alexey,
> 
> Hi,
> 
>>
>> To keep the bitmap aligned, I add the undo operation for
>> set_memory_attributes() and use the bitmap + replay callback to do
>> set_memory_attributes(). Does this change make sense?
> 
> I assume you mean this hunk:
> 
> +    ret =
> memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
> +                                                offset, size, to_private);
> +    if (ret) {
> +        warn_report("Failed to notify the listener the state change of "
> +                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
> +                    start, size, to_private ? "private" : "shared");
> +        args.to_private = !to_private;
> +        if (to_private) {
> +            ret = ram_discard_manager_replay_populated(mr->rdm, &section,
> +                                                      
> kvm_set_memory_attributes_cb,
> +                                                       &args);
> +        } else {
> +            ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
> +                                                      
> kvm_set_memory_attributes_cb,
> +                                                       &args);
> +        }
> +        if (ret) {
> +            goto out_unref;
> +        }
> 
> 
> Why is that undo necessary? The bitmap + listeners should be held in
> sync inside of
> memory_attribute_manager_state_change(). Handling this in the caller
> looks wrong.

state_change() handles the listener, i.e. VFIO/IOMMU. And the caller
handles the core mm side (guest_memfd set_attribute()) undo if
state_change() failed. Just want to keep the attribute consistent with
the bitmap on both side. Do we need this? If not, the bitmap can only
represent the status of listeners.

> 
> I thought the current implementation properly handles that internally?
> In which scenario is that not the case?

As mentioned above, e.g. if private_to_shared:
1. set_attribute_shared() convert to shared, but bitmap doesn't change
at that stage and is still private.
2. state_change() failed, it undo the operation and bitmap status to
keep unchanged.
3. core mm side status is inconsistent with bitmap.

> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-14  9:30       ` Chenyi Qiang
@ 2025-03-14  9:50         ` David Hildenbrand
  2025-03-14 10:23           ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-03-14  9:50 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 14.03.25 10:30, Chenyi Qiang wrote:
> 
> 
> On 3/14/2025 5:00 PM, David Hildenbrand wrote:
>> On 14.03.25 09:21, Chenyi Qiang wrote:
>>> Hi David & Alexey,
>>
>> Hi,
>>
>>>
>>> To keep the bitmap aligned, I add the undo operation for
>>> set_memory_attributes() and use the bitmap + replay callback to do
>>> set_memory_attributes(). Does this change make sense?
>>
>> I assume you mean this hunk:
>>
>> +    ret =
>> memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
>> +                                                offset, size, to_private);
>> +    if (ret) {
>> +        warn_report("Failed to notify the listener the state change of "
>> +                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
>> +                    start, size, to_private ? "private" : "shared");
>> +        args.to_private = !to_private;
>> +        if (to_private) {
>> +            ret = ram_discard_manager_replay_populated(mr->rdm, &section,
>> +
>> kvm_set_memory_attributes_cb,
>> +                                                       &args);
>> +        } else {
>> +            ret = ram_discard_manager_replay_discarded(mr->rdm, &section,
>> +
>> kvm_set_memory_attributes_cb,
>> +                                                       &args);
>> +        }
>> +        if (ret) {
>> +            goto out_unref;
>> +        }
>>

We should probably document that memory_attribute_state_change() cannot 
fail with "to_private", so you can simplify it to only handle the "to 
shared" case.

>>
>> Why is that undo necessary? The bitmap + listeners should be held in
>> sync inside of
>> memory_attribute_manager_state_change(). Handling this in the caller
>> looks wrong.
> 
> state_change() handles the listener, i.e. VFIO/IOMMU. And the caller
> handles the core mm side (guest_memfd set_attribute()) undo if
> state_change() failed. Just want to keep the attribute consistent with
> the bitmap on both side. Do we need this? If not, the bitmap can only
> represent the status of listeners.

Ah, so you meant that you effectively want to undo the attribute change, 
because the state effectively cannot change, and we want to revert the 
attribute change.

That makes sense when we are converting private->shared.


BTW, I'm thinking if the orders should be the following (with in-place 
conversion in mind where we would mmap guest_memfd for the shared memory 
parts).

On private -> shared conversion:

(1) change_attribute()
(2) state_change(): IOMMU pins shared memory
(3) restore_attribute() if it failed

On shared -> private conversion
(1) state_change(): IOMMU unpins shared memory
(2) change_attribute(): can convert in-place because there are not pins

I'm wondering if the whole attribute change could actually be a 
listener, invoked by the memory_attribute_manager_state_change() call 
itself in the right order.

We'd probably need listener priorities, and invoke them in reverse order 
on shared -> private conversion. Just an idea to get rid of the manual 
ram_discard_manager_replay_discarded() call in your code.

-- 
Cheers,

David / dhildenbh



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-14  9:50         ` David Hildenbrand
@ 2025-03-14 10:23           ` Chenyi Qiang
  2025-03-31  8:55             ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-14 10:23 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/14/2025 5:50 PM, David Hildenbrand wrote:
> On 14.03.25 10:30, Chenyi Qiang wrote:
>>
>>
>> On 3/14/2025 5:00 PM, David Hildenbrand wrote:
>>> On 14.03.25 09:21, Chenyi Qiang wrote:
>>>> Hi David & Alexey,
>>>
>>> Hi,
>>>
>>>>
>>>> To keep the bitmap aligned, I add the undo operation for
>>>> set_memory_attributes() and use the bitmap + replay callback to do
>>>> set_memory_attributes(). Does this change make sense?
>>>
>>> I assume you mean this hunk:
>>>
>>> +    ret =
>>> memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
>>> +                                                offset, size,
>>> to_private);
>>> +    if (ret) {
>>> +        warn_report("Failed to notify the listener the state change
>>> of "
>>> +                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
>>> +                    start, size, to_private ? "private" : "shared");
>>> +        args.to_private = !to_private;
>>> +        if (to_private) {
>>> +            ret = ram_discard_manager_replay_populated(mr->rdm,
>>> &section,
>>> +
>>> kvm_set_memory_attributes_cb,
>>> +                                                       &args);
>>> +        } else {
>>> +            ret = ram_discard_manager_replay_discarded(mr->rdm,
>>> &section,
>>> +
>>> kvm_set_memory_attributes_cb,
>>> +                                                       &args);
>>> +        }
>>> +        if (ret) {
>>> +            goto out_unref;
>>> +        }
>>>
> 
> We should probably document that memory_attribute_state_change() cannot
> fail with "to_private", so you can simplify it to only handle the "to
> shared" case.

Yes, "to_private" branch is unnecessary.

> 
>>>
>>> Why is that undo necessary? The bitmap + listeners should be held in
>>> sync inside of
>>> memory_attribute_manager_state_change(). Handling this in the caller
>>> looks wrong.
>>
>> state_change() handles the listener, i.e. VFIO/IOMMU. And the caller
>> handles the core mm side (guest_memfd set_attribute()) undo if
>> state_change() failed. Just want to keep the attribute consistent with
>> the bitmap on both side. Do we need this? If not, the bitmap can only
>> represent the status of listeners.
> 
> Ah, so you meant that you effectively want to undo the attribute change,
> because the state effectively cannot change, and we want to revert the
> attribute change.
> 
> That makes sense when we are converting private->shared.
> 
> 
> BTW, I'm thinking if the orders should be the following (with in-place
> conversion in mind where we would mmap guest_memfd for the shared memory
> parts).
> 
> On private -> shared conversion:
> 
> (1) change_attribute()
> (2) state_change(): IOMMU pins shared memory
> (3) restore_attribute() if it failed
> 
> On shared -> private conversion
> (1) state_change(): IOMMU unpins shared memory
> (2) change_attribute(): can convert in-place because there are not pins
> 
> I'm wondering if the whole attribute change could actually be a
> listener, invoked by the memory_attribute_manager_state_change() call
> itself in the right order.
> 
> We'd probably need listener priorities, and invoke them in reverse order
> on shared -> private conversion. Just an idea to get rid of the manual
> ram_discard_manager_replay_discarded() call in your code.

Good idea. I think listener priorities can make it more elegant with
in-place conversion. And the change_attribute() listener can be given a
highest or lowest priority. Maybe we can add this change in advance
before in-place conversion available.

> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-10  8:18 ` [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
@ 2025-03-14 12:11   ` Gupta, Pankaj
  2025-03-17  2:54     ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Gupta, Pankaj @ 2025-03-14 12:11 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 3/10/2025 9:18 AM, Chenyi Qiang wrote:
> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") highlighted, some subsystems like VFIO may
> disable ram block discard. However, guest_memfd relies on the discard
> operation to perform page conversion between private and shared memory.
> This can lead to stale IOMMU mapping issue when assigning a hardware
> device to a confidential VM via shared memory. To address this, it is
> crucial to ensure systems like VFIO refresh its IOMMU mappings.
> 
> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
> VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. One potential attempt is to implement it in
> HostMemoryBackend. This is not appropriate because guest_memfd is per
> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
> particular, the ones like virtual BIOS calling
> memory_region_init_ram_guest_memfd() do not.
> 
> To manage the RAMBlocks with guest_memfd, define a new object named
> MemoryAttributeManager to implement the RamDiscardManager interface. The

Isn't this should be the other way around. 'MemoryAttributeManager' 
should be an interface and RamDiscardManager a type of it, an 
implementation?

MemoryAttributeManager have the data like 'shared_bitmap' etc that
information can also be used by the other implementation types?

Or maybe I am getting it entirely wrong.

Thanks,
Pankaj

> object stores guest_memfd information such as shared_bitmap, and handles
> page conversion notification. Using the name of MemoryAttributeManager is
> aimed to make it more generic. The term "Memory" emcompasses not only RAM
> but also private MMIO in TEE I/O, which might rely on this
> object/interface to handle page conversion events in the future. The
> term "Attribute" allows for the management of various attributes beyond
> shared and private. For instance, it could support scenarios where
> discard vs. populated and shared vs. private states co-exists, such as
> supporting virtio-mem or something similar in the future.
> 
> In the current context, MemoryAttributeManager signifies discarded state
> as private and populated state as shared. Memory state is tracked at the
> host page size granularity, as the minimum memory conversion size can be one
> page per request. Additionally, VFIO expects the DMA mapping for a
> specific iova to be mapped and unmapped with the same granularity.
> Confidential VMs may perform  partial conversions, e.g. conversion
> happens on a small region within a large region. To prevent such invalid
> cases and until cut_mapping operation support is introduced, all
> operations are performed with 4K granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v3:
>      - Some rename (bitmap_size->shared_bitmap_size,
>        first_one/zero_bit->first_bit, etc.)
>      - Change shared_bitmap_size from uint32_t to unsigned
>      - Return mgr->mr->ram_block->page_size in get_block_size()
>      - Move set_ram_discard_manager() up to avoid a g_free() in failure
>        case.
>      - Add const for the memory_attribute_manager_get_block_size()
>      - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>        callback.
> 
> Changes in v2:
>      - Rename the object name to MemoryAttributeManager
>      - Rename the bitmap to shared_bitmap to make it more clear.
>      - Remove block_size field and get it from a helper. In future, we
>        can get the page_size from RAMBlock if necessary.
>      - Remove the unncessary "struct" before GuestMemfdReplayData
>      - Remove the unncessary g_free() for the bitmap
>      - Add some error report when the callback failure for
>        populated/discarded section.
>      - Move the realize()/unrealize() definition to this patch.
> ---
>   include/system/memory-attribute-manager.h |  42 ++++
>   system/memory-attribute-manager.c         | 283 ++++++++++++++++++++++
>   system/meson.build                        |   1 +
>   3 files changed, 326 insertions(+)
>   create mode 100644 include/system/memory-attribute-manager.h
>   create mode 100644 system/memory-attribute-manager.c
> 
> diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
> new file mode 100644
> index 0000000000..23375a14b8
> --- /dev/null
> +++ b/include/system/memory-attribute-manager.h
> @@ -0,0 +1,42 @@
> +/*
> + * QEMU memory attribute manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
> +
> +#include "system/hostmem.h"
> +
> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
> +
> +OBJECT_DECLARE_TYPE(MemoryAttributeManager, MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
> +
> +struct MemoryAttributeManager {
> +    Object parent;
> +
> +    MemoryRegion *mr;
> +
> +    /* 1-setting of the bit represents the memory is populated (shared) */
> +    unsigned shared_bitmap_size;
> +    unsigned long *shared_bitmap;
> +
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +struct MemoryAttributeManagerClass {
> +    ObjectClass parent_class;
> +};
> +
> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
> +
> +#endif
> diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
> new file mode 100644
> index 0000000000..7c3789cf49
> --- /dev/null
> +++ b/system/memory-attribute-manager.c
> @@ -0,0 +1,283 @@
> +/*
> + * QEMU memory attribute manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "exec/ramblock.h"
> +#include "system/memory-attribute-manager.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
> +                                   memory_attribute_manager,
> +                                   MEMORY_ATTRIBUTE_MANAGER,
> +                                   OBJECT,
> +                                   { TYPE_RAM_DISCARD_MANAGER },
> +                                   { })
> +
> +static size_t memory_attribute_manager_get_block_size(const MemoryAttributeManager *mgr)
> +{
> +    /*
> +     * Because page conversion could be manipulated in the size of at least 4K or 4K aligned,
> +     * Use the host page size as the granularity to track the memory attribute.
> +     */
> +    g_assert(mgr && mgr->mr && mgr->mr->ram_block);
> +    g_assert(mgr->mr->ram_block->page_size == qemu_real_host_page_size());
> +    return mgr->mr->ram_block->page_size;
> +}
> +
> +
> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager *rdm,
> +                                              const MemoryRegionSection *section)
> +{
> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
> +    uint64_t first_bit = section->offset_within_region / block_size;
> +    uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
> +    unsigned long first_discard_bit;
> +
> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
> +    return first_discard_bit > last_bit;
> +}
> +
> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s, void *arg);
> +
> +static int memory_attribute_notify_populate_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    return rdl->notify_populate(rdl, section);
> +}
> +
> +static int memory_attribute_notify_discard_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    rdl->notify_discard(rdl, section);
> +
> +    return 0;
> +}
> +
> +static int memory_attribute_for_each_populated_section(const MemoryAttributeManager *mgr,
> +                                                       MemoryRegionSection *section,
> +                                                       void *arg,
> +                                                       memory_attribute_section_cb cb)
> +{
> +    unsigned long first_bit, last_bit;
> +    uint64_t offset, size;
> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret = 0;
> +
> +    first_bit = section->offset_within_region / block_size;
> +    first_bit = find_next_bit(mgr->shared_bitmap, mgr->shared_bitmap_size, first_bit);
> +
> +    while (first_bit < mgr->shared_bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_bit * block_size;
> +        last_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
> +                                      first_bit + 1) - 1;
> +        size = (last_bit - first_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
> +                         strerror(-ret));
> +            break;
> +        }
> +
> +        first_bit = find_next_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
> +                                  last_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static int memory_attribute_for_each_discarded_section(const MemoryAttributeManager *mgr,
> +                                                       MemoryRegionSection *section,
> +                                                       void *arg,
> +                                                       memory_attribute_section_cb cb)
> +{
> +    unsigned long first_bit, last_bit;
> +    uint64_t offset, size;
> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret = 0;
> +
> +    first_bit = section->offset_within_region / block_size;
> +    first_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
> +                                   first_bit);
> +
> +    while (first_bit < mgr->shared_bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_bit * block_size;
> +        last_bit = find_next_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
> +                                      first_bit + 1) - 1;
> +        size = (last_bit - first_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
> +                         strerror(-ret));
> +            break;
> +        }
> +
> +        first_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->shared_bitmap_size,
> +                                       last_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint64_t memory_attribute_rdm_get_min_granularity(const RamDiscardManager *rdm,
> +                                                         const MemoryRegion *mr)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +
> +    g_assert(mr == mgr->mr);
> +    return memory_attribute_manager_get_block_size(mgr);
> +}
> +
> +static void memory_attribute_rdm_register_listener(RamDiscardManager *rdm,
> +                                                   RamDiscardListener *rdl,
> +                                                   MemoryRegionSection *section)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(section->mr == mgr->mr);
> +    rdl->section = memory_region_section_new_copy(section);
> +
> +    QLIST_INSERT_HEAD(&mgr->rdl_list, rdl, next);
> +
> +    ret = memory_attribute_for_each_populated_section(mgr, section, rdl,
> +                                                      memory_attribute_notify_populate_cb);
> +    if (ret) {
> +        error_report("%s: Failed to register RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +}
> +
> +static void memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
> +                                                     RamDiscardListener *rdl)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(rdl->section);
> +    g_assert(rdl->section->mr == mgr->mr);
> +
> +    ret = memory_attribute_for_each_populated_section(mgr, rdl->section, rdl,
> +                                                      memory_attribute_notify_discard_cb);
> +    if (ret) {
> +        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +
> +    memory_region_section_free_copy(rdl->section);
> +    rdl->section = NULL;
> +    QLIST_REMOVE(rdl, next);
> +
> +}
> +
> +typedef struct MemoryAttributeReplayData {
> +    ReplayRamStateChange fn;
> +    void *opaque;
> +} MemoryAttributeReplayData;
> +
> +static int memory_attribute_rdm_replay_cb(MemoryRegionSection *section, void *arg)
> +{
> +    MemoryAttributeReplayData *data = arg;
> +
> +    return data->fn(section, data->opaque);
> +}
> +
> +static int memory_attribute_rdm_replay_populated(const RamDiscardManager *rdm,
> +                                                 MemoryRegionSection *section,
> +                                                 ReplayRamStateChange replay_fn,
> +                                                 void *opaque)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == mgr->mr);
> +    return memory_attribute_for_each_populated_section(mgr, section, &data,
> +                                                       memory_attribute_rdm_replay_cb);
> +}
> +
> +static int memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                                 MemoryRegionSection *section,
> +                                                 ReplayRamStateChange replay_fn,
> +                                                 void *opaque)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == mgr->mr);
> +    return memory_attribute_for_each_discarded_section(mgr, section, &data,
> +                                                       memory_attribute_rdm_replay_cb);
> +}
> +
> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
> +{
> +    uint64_t shared_bitmap_size;
> +    const int block_size  = qemu_real_host_page_size();
> +    int ret;
> +
> +    shared_bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
> +
> +    mgr->mr = mr;
> +    ret = memory_region_set_ram_discard_manager(mgr->mr, RAM_DISCARD_MANAGER(mgr));
> +    if (ret) {
> +        return ret;
> +    }
> +    mgr->shared_bitmap_size = shared_bitmap_size;
> +    mgr->shared_bitmap = bitmap_new(shared_bitmap_size);
> +
> +    return ret;
> +}
> +
> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
> +{
> +    g_free(mgr->shared_bitmap);
> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
> +}
> +
> +static void memory_attribute_manager_init(Object *obj)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
> +
> +    QLIST_INIT(&mgr->rdl_list);
> +}
> +
> +static void memory_attribute_manager_finalize(Object *obj)
> +{
> +}
> +
> +static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
> +{
> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
> +
> +    rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
> +    rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;
> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
> +}

Would this initialization be for
> diff --git a/system/meson.build b/system/meson.build
> index 4952f4b2c7..ab07ff1442 100644
> --- a/system/meson.build
> +++ b/system/meson.build
> @@ -15,6 +15,7 @@ system_ss.add(files(
>     'dirtylimit.c',
>     'dma-helpers.c',
>     'globals.c',
> +  'memory-attribute-manager.c',
>     'memory_mapping.c',
>     'qdev-monitor.c',
>     'qtest.c',



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-14 12:11   ` Gupta, Pankaj
@ 2025-03-17  2:54     ` Chenyi Qiang
  2025-03-17 10:36       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-17  2:54 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/14/2025 8:11 PM, Gupta, Pankaj wrote:
> On 3/10/2025 9:18 AM, Chenyi Qiang wrote:
>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>> uncoordinated discard") highlighted, some subsystems like VFIO may
>> disable ram block discard. However, guest_memfd relies on the discard
>> operation to perform page conversion between private and shared memory.
>> This can lead to stale IOMMU mapping issue when assigning a hardware
>> device to a confidential VM via shared memory. To address this, it is
>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>
>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>> VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. One potential attempt is to implement it in
>> HostMemoryBackend. This is not appropriate because guest_memfd is per
>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>> particular, the ones like virtual BIOS calling
>> memory_region_init_ram_guest_memfd() do not.
>>
>> To manage the RAMBlocks with guest_memfd, define a new object named
>> MemoryAttributeManager to implement the RamDiscardManager interface. The
> 
> Isn't this should be the other way around. 'MemoryAttributeManager'
> should be an interface and RamDiscardManager a type of it, an
> implementation?

We want to use 'MemoryAttributeManager' to represent RAMBlock to
implement the RamDiscardManager interface callbacks because RAMBlock is
not an object. It includes some metadata of guest_memfd like
shared_bitmap at the same time.

I can't get it that make 'MemoryAttributeManager' an interface and
RamDiscardManager a type of it. Can you elaborate it a little bit? I
think at least we need someone to implement the RamDiscardManager interface.

> 
> MemoryAttributeManager have the data like 'shared_bitmap' etc that
> information can also be used by the other implementation types?

Shared_bitmap is just the metadata of guest_memfd. Other subsystems may
consult its status by RamDiscardManager interface like
ram_discard_manager_is_populated().

> 
> Or maybe I am getting it entirely wrong.
> 
> Thanks,
> Pankaj
> 
>> object stores guest_memfd information such as shared_bitmap, and handles
>> page conversion notification. Using the name of MemoryAttributeManager is
>> aimed to make it more generic. The term "Memory" emcompasses not only RAM
>> but also private MMIO in TEE I/O, which might rely on this
>> object/interface to handle page conversion events in the future. The
>> term "Attribute" allows for the management of various attributes beyond
>> shared and private. For instance, it could support scenarios where
>> discard vs. populated and shared vs. private states co-exists, such as
>> supporting virtio-mem or something similar in the future.
>>
>> In the current context, MemoryAttributeManager signifies discarded state
>> as private and populated state as shared. Memory state is tracked at the
>> host page size granularity, as the minimum memory conversion size can
>> be one
>> page per request. Additionally, VFIO expects the DMA mapping for a
>> specific iova to be mapped and unmapped with the same granularity.
>> Confidential VMs may perform  partial conversions, e.g. conversion
>> happens on a small region within a large region. To prevent such invalid
>> cases and until cut_mapping operation support is introduced, all
>> operations are performed with 4K granularity.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v3:
>>      - Some rename (bitmap_size->shared_bitmap_size,
>>        first_one/zero_bit->first_bit, etc.)
>>      - Change shared_bitmap_size from uint32_t to unsigned
>>      - Return mgr->mr->ram_block->page_size in get_block_size()
>>      - Move set_ram_discard_manager() up to avoid a g_free() in failure
>>        case.
>>      - Add const for the memory_attribute_manager_get_block_size()
>>      - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>        callback.
>>
>> Changes in v2:
>>      - Rename the object name to MemoryAttributeManager
>>      - Rename the bitmap to shared_bitmap to make it more clear.
>>      - Remove block_size field and get it from a helper. In future, we
>>        can get the page_size from RAMBlock if necessary.
>>      - Remove the unncessary "struct" before GuestMemfdReplayData
>>      - Remove the unncessary g_free() for the bitmap
>>      - Add some error report when the callback failure for
>>        populated/discarded section.
>>      - Move the realize()/unrealize() definition to this patch.
>> ---
>>   include/system/memory-attribute-manager.h |  42 ++++
>>   system/memory-attribute-manager.c         | 283 ++++++++++++++++++++++
>>   system/meson.build                        |   1 +
>>   3 files changed, 326 insertions(+)
>>   create mode 100644 include/system/memory-attribute-manager.h
>>   create mode 100644 system/memory-attribute-manager.c
>>
>> diff --git a/include/system/memory-attribute-manager.h b/include/
>> system/memory-attribute-manager.h
>> new file mode 100644
>> index 0000000000..23375a14b8
>> --- /dev/null
>> +++ b/include/system/memory-attribute-manager.h
>> @@ -0,0 +1,42 @@
>> +/*
>> + * QEMU memory attribute manager
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>> +
>> +#include "system/hostmem.h"
>> +
>> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
>> +
>> +OBJECT_DECLARE_TYPE(MemoryAttributeManager,
>> MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
>> +
>> +struct MemoryAttributeManager {
>> +    Object parent;
>> +
>> +    MemoryRegion *mr;
>> +
>> +    /* 1-setting of the bit represents the memory is populated
>> (shared) */
>> +    unsigned shared_bitmap_size;
>> +    unsigned long *shared_bitmap;
>> +
>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>> +
>> +struct MemoryAttributeManagerClass {
>> +    ObjectClass parent_class;
>> +};
>> +
>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>> MemoryRegion *mr);
>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
>> +
>> +#endif
>> diff --git a/system/memory-attribute-manager.c b/system/memory-
>> attribute-manager.c
>> new file mode 100644
>> index 0000000000..7c3789cf49
>> --- /dev/null
>> +++ b/system/memory-attribute-manager.c
>> @@ -0,0 +1,283 @@
>> +/*
>> + * QEMU memory attribute manager
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "exec/ramblock.h"
>> +#include "system/memory-attribute-manager.h"
>> +
>> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
>> +                                   memory_attribute_manager,
>> +                                   MEMORY_ATTRIBUTE_MANAGER,
>> +                                   OBJECT,
>> +                                   { TYPE_RAM_DISCARD_MANAGER },
>> +                                   { })
>> +
>> +static size_t memory_attribute_manager_get_block_size(const
>> MemoryAttributeManager *mgr)
>> +{
>> +    /*
>> +     * Because page conversion could be manipulated in the size of at
>> least 4K or 4K aligned,
>> +     * Use the host page size as the granularity to track the memory
>> attribute.
>> +     */
>> +    g_assert(mgr && mgr->mr && mgr->mr->ram_block);
>> +    g_assert(mgr->mr->ram_block->page_size ==
>> qemu_real_host_page_size());
>> +    return mgr->mr->ram_block->page_size;
>> +}
>> +
>> +
>> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager
>> *rdm,
>> +                                              const
>> MemoryRegionSection *section)
>> +{
>> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    uint64_t first_bit = section->offset_within_region / block_size;
>> +    uint64_t last_bit = first_bit + int128_get64(section->size) /
>> block_size - 1;
>> +    unsigned long first_discard_bit;
>> +
>> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap,
>> last_bit + 1, first_bit);
>> +    return first_discard_bit > last_bit;
>> +}
>> +
>> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s,
>> void *arg);
>> +
>> +static int memory_attribute_notify_populate_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    return rdl->notify_populate(rdl, section);
>> +}
>> +
>> +static int memory_attribute_notify_discard_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    rdl->notify_discard(rdl, section);
>> +
>> +    return 0;
>> +}
>> +
>> +static int memory_attribute_for_each_populated_section(const
>> MemoryAttributeManager *mgr,
>> +                                                      
>> MemoryRegionSection *section,
>> +                                                       void *arg,
>> +                                                      
>> memory_attribute_section_cb cb)
>> +{
>> +    unsigned long first_bit, last_bit;
>> +    uint64_t offset, size;
>> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    int ret = 0;
>> +
>> +    first_bit = section->offset_within_region / block_size;
>> +    first_bit = find_next_bit(mgr->shared_bitmap, mgr-
>> >shared_bitmap_size, first_bit);
>> +
>> +    while (first_bit < mgr->shared_bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_bit * block_size;
>> +        last_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>> >shared_bitmap_size,
>> +                                      first_bit + 1) - 1;
>> +        size = (last_bit - first_bit + 1) * block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener:
>> %s", __func__,
>> +                         strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_bit = find_next_bit(mgr->shared_bitmap, mgr-
>> >shared_bitmap_size,
>> +                                  last_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int memory_attribute_for_each_discarded_section(const
>> MemoryAttributeManager *mgr,
>> +                                                      
>> MemoryRegionSection *section,
>> +                                                       void *arg,
>> +                                                      
>> memory_attribute_section_cb cb)
>> +{
>> +    unsigned long first_bit, last_bit;
>> +    uint64_t offset, size;
>> +    const int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    int ret = 0;
>> +
>> +    first_bit = section->offset_within_region / block_size;
>> +    first_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>> >shared_bitmap_size,
>> +                                   first_bit);
>> +
>> +    while (first_bit < mgr->shared_bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_bit * block_size;
>> +        last_bit = find_next_bit(mgr->shared_bitmap, mgr-
>> >shared_bitmap_size,
>> +                                      first_bit + 1) - 1;
>> +        size = (last_bit - first_bit + 1) * block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener:
>> %s", __func__,
>> +                         strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>> >shared_bitmap_size,
>> +                                       last_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint64_t memory_attribute_rdm_get_min_granularity(const
>> RamDiscardManager *rdm,
>> +                                                         const
>> MemoryRegion *mr)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +
>> +    g_assert(mr == mgr->mr);
>> +    return memory_attribute_manager_get_block_size(mgr);
>> +}
>> +
>> +static void memory_attribute_rdm_register_listener(RamDiscardManager
>> *rdm,
>> +                                                   RamDiscardListener
>> *rdl,
>> +                                                  
>> MemoryRegionSection *section)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(section->mr == mgr->mr);
>> +    rdl->section = memory_region_section_new_copy(section);
>> +
>> +    QLIST_INSERT_HEAD(&mgr->rdl_list, rdl, next);
>> +
>> +    ret = memory_attribute_for_each_populated_section(mgr, section, rdl,
>> +                                                     
>> memory_attribute_notify_populate_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to register RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +}
>> +
>> +static void
>> memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
>> +                                                    
>> RamDiscardListener *rdl)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(rdl->section);
>> +    g_assert(rdl->section->mr == mgr->mr);
>> +
>> +    ret = memory_attribute_for_each_populated_section(mgr, rdl-
>> >section, rdl,
>> +                                                     
>> memory_attribute_notify_discard_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to unregister RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +
>> +    memory_region_section_free_copy(rdl->section);
>> +    rdl->section = NULL;
>> +    QLIST_REMOVE(rdl, next);
>> +
>> +}
>> +
>> +typedef struct MemoryAttributeReplayData {
>> +    ReplayRamStateChange fn;
>> +    void *opaque;
>> +} MemoryAttributeReplayData;
>> +
>> +static int memory_attribute_rdm_replay_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    MemoryAttributeReplayData *data = arg;
>> +
>> +    return data->fn(section, data->opaque);
>> +}
>> +
>> +static int memory_attribute_rdm_replay_populated(const
>> RamDiscardManager *rdm,
>> +                                                 MemoryRegionSection
>> *section,
>> +                                                 ReplayRamStateChange
>> replay_fn,
>> +                                                 void *opaque)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == mgr->mr);
>> +    return memory_attribute_for_each_populated_section(mgr, section,
>> &data,
>> +                                                      
>> memory_attribute_rdm_replay_cb);
>> +}
>> +
>> +static int memory_attribute_rdm_replay_discarded(const
>> RamDiscardManager *rdm,
>> +                                                 MemoryRegionSection
>> *section,
>> +                                                 ReplayRamStateChange
>> replay_fn,
>> +                                                 void *opaque)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == mgr->mr);
>> +    return memory_attribute_for_each_discarded_section(mgr, section,
>> &data,
>> +                                                      
>> memory_attribute_rdm_replay_cb);
>> +}
>> +
>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>> MemoryRegion *mr)
>> +{
>> +    uint64_t shared_bitmap_size;
>> +    const int block_size  = qemu_real_host_page_size();
>> +    int ret;
>> +
>> +    shared_bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>> +
>> +    mgr->mr = mr;
>> +    ret = memory_region_set_ram_discard_manager(mgr->mr,
>> RAM_DISCARD_MANAGER(mgr));
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +    mgr->shared_bitmap_size = shared_bitmap_size;
>> +    mgr->shared_bitmap = bitmap_new(shared_bitmap_size);
>> +
>> +    return ret;
>> +}
>> +
>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
>> +{
>> +    g_free(mgr->shared_bitmap);
>> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
>> +}
>> +
>> +static void memory_attribute_manager_init(Object *obj)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
>> +
>> +    QLIST_INIT(&mgr->rdl_list);
>> +}
>> +
>> +static void memory_attribute_manager_finalize(Object *obj)
>> +{
>> +}
>> +
>> +static void memory_attribute_manager_class_init(ObjectClass *oc, void
>> *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>> +
>> +    rdmc->get_min_granularity =
>> memory_attribute_rdm_get_min_granularity;
>> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
>> +    rdmc->unregister_listener =
>> memory_attribute_rdm_unregister_listener;
>> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
>> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
>> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
>> +}
> 
> Would this initialization be for
>> diff --git a/system/meson.build b/system/meson.build
>> index 4952f4b2c7..ab07ff1442 100644
>> --- a/system/meson.build
>> +++ b/system/meson.build
>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>     'dirtylimit.c',
>>     'dma-helpers.c',
>>     'globals.c',
>> +  'memory-attribute-manager.c',
>>     'memory_mapping.c',
>>     'qdev-monitor.c',
>>     'qtest.c',
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-03-10  8:18 ` [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-03-17  3:00   ` Kishen Maloor
  2025-03-17  4:11     ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Kishen Maloor @ 2025-03-17  3:00 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 3/10/25 1:18 AM, Chenyi Qiang wrote:
> ...
> diff --git a/migration/ram.c b/migration/ram.c
> index ce28328141..053730367b 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -816,8 +816,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
>       return ret;
>   }
>   
> -static void dirty_bitmap_clear_section(MemoryRegionSection *section,
> -                                       void *opaque)
> +static int dirty_bitmap_clear_section(MemoryRegionSection *section,
> +                                      void *opaque)
>   {
>       const hwaddr offset = section->offset_within_region;
>       const hwaddr size = int128_get64(section->size);
> @@ -836,6 +836,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
>       }
>       *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
>       bitmap_clear(rb->bmap, start, npages);

It appears that the ram_discard_manager_replay_discarded() path would clear all private pages
from the dirty bitmap over here, and thus break migration.

Perhaps the MemoryAttributeManager should be excluded from this flow, something like below?

@@ -910,6 +909,9 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
              .size = int128_make64(qemu_ram_get_used_length(rb)),
          };
  
+        if (object_dynamic_cast(OBJECT(rdm), TYPE_MEMORY_ATTRIBUTE_MANAGER))
+            return 0;
+
          ram_discard_manager_replay_discarded(rdm, &section,
                                               dirty_bitmap_clear_section,
                                               &cleared_bits);



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-03-17  3:00   ` Kishen Maloor
@ 2025-03-17  4:11     ` Chenyi Qiang
  0 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-17  4:11 UTC (permalink / raw)
  To: Kishen Maloor, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/17/2025 11:00 AM, Kishen Maloor wrote:
> On 3/10/25 1:18 AM, Chenyi Qiang wrote:
>> ...
>> diff --git a/migration/ram.c b/migration/ram.c
>> index ce28328141..053730367b 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -816,8 +816,8 @@ static inline bool
>> migration_bitmap_clear_dirty(RAMState *rs,
>>       return ret;
>>   }
>>   -static void dirty_bitmap_clear_section(MemoryRegionSection *section,
>> -                                       void *opaque)
>> +static int dirty_bitmap_clear_section(MemoryRegionSection *section,
>> +                                      void *opaque)
>>   {
>>       const hwaddr offset = section->offset_within_region;
>>       const hwaddr size = int128_get64(section->size);
>> @@ -836,6 +836,7 @@ static void
>> dirty_bitmap_clear_section(MemoryRegionSection *section,
>>       }
>>       *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start,
>> npages);
>>       bitmap_clear(rb->bmap, start, npages);
> 
> It appears that the ram_discard_manager_replay_discarded() path would
> clear all private pages
> from the dirty bitmap over here, and thus break migration.

Exactly, but migration for confidential VM support with guest_memfd is
not supported yet.

> 
> Perhaps the MemoryAttributeManager should be excluded from this flow,
> something like below?

I think it would already fail before reaching this path if initiating
confidential VMs migration. But adding such check or assert for some
hints is also OK for me. No strong preference.

> 
> @@ -910,6 +909,9 @@ static uint64_t
> ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
>              .size = int128_make64(qemu_ram_get_used_length(rb)),
>          };
>  
> +        if (object_dynamic_cast(OBJECT(rdm),
> TYPE_MEMORY_ATTRIBUTE_MANAGER))
> +            return 0;
> +
>          ram_discard_manager_replay_discarded(rdm, &section,
>                                               dirty_bitmap_clear_section,
>                                               &cleared_bits);
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-10  8:18 ` [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
  2025-03-14  8:21   ` Chenyi Qiang
@ 2025-03-17  6:18   ` Tony Lindgren
  2025-03-17  7:32     ` Chenyi Qiang
  1 sibling, 1 reply; 31+ messages in thread
From: Tony Lindgren @ 2025-03-17  6:18 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun, Li Xiaoyao,
	Maloor, Kishen

Hi,

On Mon, Mar 10, 2025 at 04:18:34PM +0800, Chenyi Qiang wrote:
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>              qemu_mutex_unlock_ramlist();
>              goto out_free;
>          }
> +
> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
> +            error_setg(errp, "Failed to realize memory attribute manager");
> +            object_unref(OBJECT(new_block->memory_attribute_manager));
> +            close(new_block->guest_memfd);
> +            ram_block_discard_require(false);
> +            qemu_mutex_unlock_ramlist();
> +            goto out_free;
> +        }
>      }
>  
>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;

Might as well put the above into a separate memory manager init function
to start with. It keeps the goto out_free error path unified, and makes
things more future proof if the rest of ram_block_add() ever develops a
need to check for errors.

Regards,

Tony


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-17  6:18   ` Tony Lindgren
@ 2025-03-17  7:32     ` Chenyi Qiang
  2025-03-17  9:45       ` Tony Lindgren
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-17  7:32 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun, Li Xiaoyao,
	Maloor, Kishen



On 3/17/2025 2:18 PM, Tony Lindgren wrote:
> Hi,
> 
> On Mon, Mar 10, 2025 at 04:18:34PM +0800, Chenyi Qiang wrote:
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>              qemu_mutex_unlock_ramlist();
>>              goto out_free;
>>          }
>> +
>> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
>> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
>> +            error_setg(errp, "Failed to realize memory attribute manager");
>> +            object_unref(OBJECT(new_block->memory_attribute_manager));
>> +            close(new_block->guest_memfd);
>> +            ram_block_discard_require(false);
>> +            qemu_mutex_unlock_ramlist();
>> +            goto out_free;
>> +        }
>>      }
>>  
>>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> 
> Might as well put the above into a separate memory manager init function
> to start with. It keeps the goto out_free error path unified, and makes
> things more future proof if the rest of ram_block_add() ever develops a
> need to check for errors.

Which part to be defined in a separate function? The init function of
object_new() + realize(), or the error handling operation
(object_unref() + close() + ram_block_discard_require(false))?

If need to check for errors in the rest of ram_block_add() in future,
how about adding a new label before out_free and move the error handling
there?

> 
> Regards,
> 
> Tony



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-17  7:32     ` Chenyi Qiang
@ 2025-03-17  9:45       ` Tony Lindgren
  2025-03-17 10:21         ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Tony Lindgren @ 2025-03-17  9:45 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun, Li Xiaoyao,
	Maloor, Kishen

On Mon, Mar 17, 2025 at 03:32:16PM +0800, Chenyi Qiang wrote:
> 
> 
> On 3/17/2025 2:18 PM, Tony Lindgren wrote:
> > Hi,
> > 
> > On Mon, Mar 10, 2025 at 04:18:34PM +0800, Chenyi Qiang wrote:
> >> --- a/system/physmem.c
> >> +++ b/system/physmem.c
> >> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >>              qemu_mutex_unlock_ramlist();
> >>              goto out_free;
> >>          }
> >> +
> >> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
> >> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
> >> +            error_setg(errp, "Failed to realize memory attribute manager");
> >> +            object_unref(OBJECT(new_block->memory_attribute_manager));
> >> +            close(new_block->guest_memfd);
> >> +            ram_block_discard_require(false);
> >> +            qemu_mutex_unlock_ramlist();
> >> +            goto out_free;
> >> +        }
> >>      }
> >>  
> >>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> > 
> > Might as well put the above into a separate memory manager init function
> > to start with. It keeps the goto out_free error path unified, and makes
> > things more future proof if the rest of ram_block_add() ever develops a
> > need to check for errors.
> 
> Which part to be defined in a separate function? The init function of
> object_new() + realize(), or the error handling operation
> (object_unref() + close() + ram_block_discard_require(false))?

I was thinking the whole thing, including freeing :) But maybe there's
something more to consider to keep calls paired.

> If need to check for errors in the rest of ram_block_add() in future,
> how about adding a new label before out_free and move the error handling
> there?

Yeah that would work too.

Regards,

Tony


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-17  9:45       ` Tony Lindgren
@ 2025-03-17 10:21         ` Chenyi Qiang
  2025-03-17 11:07           ` Tony Lindgren
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-17 10:21 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun, Li Xiaoyao,
	Maloor, Kishen



On 3/17/2025 5:45 PM, Tony Lindgren wrote:
> On Mon, Mar 17, 2025 at 03:32:16PM +0800, Chenyi Qiang wrote:
>>
>>
>> On 3/17/2025 2:18 PM, Tony Lindgren wrote:
>>> Hi,
>>>
>>> On Mon, Mar 10, 2025 at 04:18:34PM +0800, Chenyi Qiang wrote:
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>              qemu_mutex_unlock_ramlist();
>>>>              goto out_free;
>>>>          }
>>>> +
>>>> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
>>>> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
>>>> +            error_setg(errp, "Failed to realize memory attribute manager");
>>>> +            object_unref(OBJECT(new_block->memory_attribute_manager));
>>>> +            close(new_block->guest_memfd);
>>>> +            ram_block_discard_require(false);
>>>> +            qemu_mutex_unlock_ramlist();
>>>> +            goto out_free;
>>>> +        }
>>>>      }
>>>>  
>>>>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
>>>
>>> Might as well put the above into a separate memory manager init function
>>> to start with. It keeps the goto out_free error path unified, and makes
>>> things more future proof if the rest of ram_block_add() ever develops a
>>> need to check for errors.
>>
>> Which part to be defined in a separate function? The init function of
>> object_new() + realize(), or the error handling operation
>> (object_unref() + close() + ram_block_discard_require(false))?
> 
> I was thinking the whole thing, including freeing :) But maybe there's
> something more to consider to keep calls paired.

If putting the whole thing separately, I think the rest part to do error
handling still needs to add the same operation. Or I misunderstand
something?

> 
>> If need to check for errors in the rest of ram_block_add() in future,
>> how about adding a new label before out_free and move the error handling
>> there?
> 
> Yeah that would work too.

I'm not sure if we should add such change directly, or we can wait for
the real error check introduced in future.

> 
> Regards,
> 
> Tony



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-17  2:54     ` Chenyi Qiang
@ 2025-03-17 10:36       ` David Hildenbrand
  2025-03-17 17:01         ` Gupta, Pankaj
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-03-17 10:36 UTC (permalink / raw)
  To: Chenyi Qiang, Gupta, Pankaj, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 17.03.25 03:54, Chenyi Qiang wrote:
> 
> 
> On 3/14/2025 8:11 PM, Gupta, Pankaj wrote:
>> On 3/10/2025 9:18 AM, Chenyi Qiang wrote:
>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>> disable ram block discard. However, guest_memfd relies on the discard
>>> operation to perform page conversion between private and shared memory.
>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>> device to a confidential VM via shared memory. To address this, it is
>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>
>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>> VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other. Therefore, similar actions are required for page
>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>> facilitate this process.
>>>
>>> Since guest_memfd is not an object, it cannot directly implement the
>>> RamDiscardManager interface. One potential attempt is to implement it in
>>> HostMemoryBackend. This is not appropriate because guest_memfd is per
>>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>>> particular, the ones like virtual BIOS calling
>>> memory_region_init_ram_guest_memfd() do not.
>>>
>>> To manage the RAMBlocks with guest_memfd, define a new object named
>>> MemoryAttributeManager to implement the RamDiscardManager interface. The
>>
>> Isn't this should be the other way around. 'MemoryAttributeManager'
>> should be an interface and RamDiscardManager a type of it, an
>> implementation?
> 
> We want to use 'MemoryAttributeManager' to represent RAMBlock to
> implement the RamDiscardManager interface callbacks because RAMBlock is
> not an object. It includes some metadata of guest_memfd like
> shared_bitmap at the same time.
> 
> I can't get it that make 'MemoryAttributeManager' an interface and
> RamDiscardManager a type of it. Can you elaborate it a little bit? I
> think at least we need someone to implement the RamDiscardManager interface.

shared <-> private is translated (abstracted) to "populated <-> 
discarded", which makes sense. The other way around would be wrong.

It's going to be interesting once we have more logical states, for 
example supporting virtio-mem for confidential VMs.

Then we'd have "shared+populated, private+populated, shared+discard, 
private+discarded". Not sure if this could simply be achieved by 
allowing multiple RamDiscardManager that are effectively chained, or if 
we'd want a different interface.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-17 10:21         ` Chenyi Qiang
@ 2025-03-17 11:07           ` Tony Lindgren
  0 siblings, 0 replies; 31+ messages in thread
From: Tony Lindgren @ 2025-03-17 11:07 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun, Li Xiaoyao,
	Maloor, Kishen

On Mon, Mar 17, 2025 at 06:21:13PM +0800, Chenyi Qiang wrote:
> 
> 
> On 3/17/2025 5:45 PM, Tony Lindgren wrote:
> > On Mon, Mar 17, 2025 at 03:32:16PM +0800, Chenyi Qiang wrote:
> >>
> >>
> >> On 3/17/2025 2:18 PM, Tony Lindgren wrote:
> >>> Hi,
> >>>
> >>> On Mon, Mar 10, 2025 at 04:18:34PM +0800, Chenyi Qiang wrote:
> >>>> --- a/system/physmem.c
> >>>> +++ b/system/physmem.c
> >>>> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >>>>              qemu_mutex_unlock_ramlist();
> >>>>              goto out_free;
> >>>>          }
> >>>> +
> >>>> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
> >>>> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
> >>>> +            error_setg(errp, "Failed to realize memory attribute manager");
> >>>> +            object_unref(OBJECT(new_block->memory_attribute_manager));
> >>>> +            close(new_block->guest_memfd);
> >>>> +            ram_block_discard_require(false);
> >>>> +            qemu_mutex_unlock_ramlist();
> >>>> +            goto out_free;
> >>>> +        }
> >>>>      }
> >>>>  
> >>>>      ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> >>>
> >>> Might as well put the above into a separate memory manager init function
> >>> to start with. It keeps the goto out_free error path unified, and makes
> >>> things more future proof if the rest of ram_block_add() ever develops a
> >>> need to check for errors.
> >>
> >> Which part to be defined in a separate function? The init function of
> >> object_new() + realize(), or the error handling operation
> >> (object_unref() + close() + ram_block_discard_require(false))?
> > 
> > I was thinking the whole thing, including freeing :) But maybe there's
> > something more to consider to keep calls paired.
> 
> If putting the whole thing separately, I think the rest part to do error
> handling still needs to add the same operation. Or I misunderstand
> something?

So maybe you suggestion of just a separate clean-up function would work:

new_block->memory_attribute_manager =
    MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
if (memory_attribute_manager_realize(new_block->memory_attribute_manager,
    new_block->mr)) {
    memory_attribute_manager_cleanup(...);
    goto out_free;
}

> >> If need to check for errors in the rest of ram_block_add() in future,
> >> how about adding a new label before out_free and move the error handling
> >> there?
> > 
> > Yeah that would work too.
> 
> I'm not sure if we should add such change directly, or we can wait for
> the real error check introduced in future.

Right, not sure either.

Regards,

Tony


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-17 10:36       ` David Hildenbrand
@ 2025-03-17 17:01         ` Gupta, Pankaj
  2025-03-18  1:54           ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Gupta, Pankaj @ 2025-03-17 17:01 UTC (permalink / raw)
  To: David Hildenbrand, Chenyi Qiang, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 3/17/2025 11:36 AM, David Hildenbrand wrote:
> On 17.03.25 03:54, Chenyi Qiang wrote:
>>
>>
>> On 3/14/2025 8:11 PM, Gupta, Pankaj wrote:
>>> On 3/10/2025 9:18 AM, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory. To address this, it is
>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other. Therefore, similar actions are required for page
>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>> facilitate this process.
>>>>
>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>> RamDiscardManager interface. One potential attempt is to implement 
>>>> it in
>>>> HostMemoryBackend. This is not appropriate because guest_memfd is per
>>>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>>>> particular, the ones like virtual BIOS calling
>>>> memory_region_init_ram_guest_memfd() do not.
>>>>
>>>> To manage the RAMBlocks with guest_memfd, define a new object named
>>>> MemoryAttributeManager to implement the RamDiscardManager interface. 
>>>> The
>>>
>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>> should be an interface and RamDiscardManager a type of it, an
>>> implementation?
>>
>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>> implement the RamDiscardManager interface callbacks because RAMBlock is
>> not an object. It includes some metadata of guest_memfd like
>> shared_bitmap at the same time.
>>
>> I can't get it that make 'MemoryAttributeManager' an interface and
>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>> think at least we need someone to implement the RamDiscardManager 
>> interface.
> 
> shared <-> private is translated (abstracted) to "populated <-> 
> discarded", which makes sense. The other way around would be wrong.
> 
> It's going to be interesting once we have more logical states, for 
> example supporting virtio-mem for confidential VMs.
> 
> Then we'd have "shared+populated, private+populated, shared+discard, 
> private+discarded". Not sure if this could simply be achieved by 
> allowing multiple RamDiscardManager that are effectively chained, or if 
> we'd want a different interface.

Exactly! In any case generic manager (parent class) would make more 
sense that can work on different operations/states implemented in child 
classes (can be chained as well).

Best regards,
Pankaj
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-17 17:01         ` Gupta, Pankaj
@ 2025-03-18  1:54           ` Chenyi Qiang
  2025-03-19  8:55             ` Gupta, Pankaj
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-18  1:54 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/18/2025 1:01 AM, Gupta, Pankaj wrote:
> On 3/17/2025 11:36 AM, David Hildenbrand wrote:
>> On 17.03.25 03:54, Chenyi Qiang wrote:
>>>
>>>
>>> On 3/14/2025 8:11 PM, Gupta, Pankaj wrote:
>>>> On 3/10/2025 9:18 AM, Chenyi Qiang wrote:
>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>> operation to perform page conversion between private and shared
>>>>> memory.
>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>> device to a confidential VM via shared memory. To address this, it is
>>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>>
>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>> adjust
>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>>> back in the other. Therefore, similar actions are required for page
>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>> facilitate this process.
>>>>>
>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>> RamDiscardManager interface. One potential attempt is to implement
>>>>> it in
>>>>> HostMemoryBackend. This is not appropriate because guest_memfd is per
>>>>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>>>>> particular, the ones like virtual BIOS calling
>>>>> memory_region_init_ram_guest_memfd() do not.
>>>>>
>>>>> To manage the RAMBlocks with guest_memfd, define a new object named
>>>>> MemoryAttributeManager to implement the RamDiscardManager
>>>>> interface. The
>>>>
>>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>>> should be an interface and RamDiscardManager a type of it, an
>>>> implementation?
>>>
>>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>>> implement the RamDiscardManager interface callbacks because RAMBlock is
>>> not an object. It includes some metadata of guest_memfd like
>>> shared_bitmap at the same time.
>>>
>>> I can't get it that make 'MemoryAttributeManager' an interface and
>>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>>> think at least we need someone to implement the RamDiscardManager
>>> interface.
>>
>> shared <-> private is translated (abstracted) to "populated <->
>> discarded", which makes sense. The other way around would be wrong.
>>
>> It's going to be interesting once we have more logical states, for
>> example supporting virtio-mem for confidential VMs.
>>
>> Then we'd have "shared+populated, private+populated, shared+discard,
>> private+discarded". Not sure if this could simply be achieved by
>> allowing multiple RamDiscardManager that are effectively chained, or
>> if we'd want a different interface.
> 
> Exactly! In any case generic manager (parent class) would make more
> sense that can work on different operations/states implemented in child
> classes (can be chained as well).

Ah, we are talking about the generic state management. Sorry for my slow
reaction.

So we need to
1. Define a generic manager Interface, e.g.
MemoryStateManager/GenericStateManager.
2. Make RamDiscardManager the child of MemoryStateManager which manages
the state of populated and discarded.
3. Define a new child manager Interface PrivateSharedManager which
manages the state of private and shared.
4. Define a new object ConfidentialMemoryAttribute to implement the
PrivateSharedManager interface.
(Welcome to rename the above Interface/Object)

Is my understanding correct?

> 
> Best regards,
> Pankaj
>>
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-18  1:54           ` Chenyi Qiang
@ 2025-03-19  8:55             ` Gupta, Pankaj
  2025-03-19 11:23               ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Gupta, Pankaj @ 2025-03-19  8:55 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao


>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>> operation to perform page conversion between private and shared
>>>>>> memory.
>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>> device to a confidential VM via shared memory. To address this, it is
>>>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>>>
>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>> adjust
>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>>>> back in the other. Therefore, similar actions are required for page
>>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>>> facilitate this process.
>>>>>>
>>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>>> RamDiscardManager interface. One potential attempt is to implement
>>>>>> it in
>>>>>> HostMemoryBackend. This is not appropriate because guest_memfd is per
>>>>>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>>>>>> particular, the ones like virtual BIOS calling
>>>>>> memory_region_init_ram_guest_memfd() do not.
>>>>>>
>>>>>> To manage the RAMBlocks with guest_memfd, define a new object named
>>>>>> MemoryAttributeManager to implement the RamDiscardManager
>>>>>> interface. The
>>>>>
>>>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>>>> should be an interface and RamDiscardManager a type of it, an
>>>>> implementation?
>>>>
>>>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>>>> implement the RamDiscardManager interface callbacks because RAMBlock is
>>>> not an object. It includes some metadata of guest_memfd like
>>>> shared_bitmap at the same time.
>>>>
>>>> I can't get it that make 'MemoryAttributeManager' an interface and
>>>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>>>> think at least we need someone to implement the RamDiscardManager
>>>> interface.
>>>
>>> shared <-> private is translated (abstracted) to "populated <->
>>> discarded", which makes sense. The other way around would be wrong.
>>>
>>> It's going to be interesting once we have more logical states, for
>>> example supporting virtio-mem for confidential VMs.
>>>
>>> Then we'd have "shared+populated, private+populated, shared+discard,
>>> private+discarded". Not sure if this could simply be achieved by
>>> allowing multiple RamDiscardManager that are effectively chained, or
>>> if we'd want a different interface.
>>
>> Exactly! In any case generic manager (parent class) would make more
>> sense that can work on different operations/states implemented in child
>> classes (can be chained as well).
> 
> Ah, we are talking about the generic state management. Sorry for my slow
> reaction.
> 
> So we need to
> 1. Define a generic manager Interface, e.g.
> MemoryStateManager/GenericStateManager.
> 2. Make RamDiscardManager the child of MemoryStateManager which manages
> the state of populated and discarded.
> 3. Define a new child manager Interface PrivateSharedManager which
> manages the state of private and shared.
> 4. Define a new object ConfidentialMemoryAttribute to implement the
> PrivateSharedManager interface.
> (Welcome to rename the above Interface/Object)
> 
> Is my understanding correct?

Yes, in that direction. Where 'RamDiscardManager' & 
'PrivateSharedManager' are both child of 'GenericStateManager'.

Depending on listeners registered, corresponding handlers can be called.

Best regards,
Pankaj


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-19  8:55             ` Gupta, Pankaj
@ 2025-03-19 11:23               ` Chenyi Qiang
  2025-03-19 11:56                 ` Gupta, Pankaj
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-19 11:23 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/19/2025 4:55 PM, Gupta, Pankaj wrote:
> 
>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>> discard
>>>>>>> operation to perform page conversion between private and shared
>>>>>>> memory.
>>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>>> device to a confidential VM via shared memory. To address this,
>>>>>>> it is
>>>>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>>>>
>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>> adjust
>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>> adding it
>>>>>>> back in the other. Therefore, similar actions are required for page
>>>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>>>> facilitate this process.
>>>>>>>
>>>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>>>> RamDiscardManager interface. One potential attempt is to implement
>>>>>>> it in
>>>>>>> HostMemoryBackend. This is not appropriate because guest_memfd is
>>>>>>> per
>>>>>>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>>>>>>> particular, the ones like virtual BIOS calling
>>>>>>> memory_region_init_ram_guest_memfd() do not.
>>>>>>>
>>>>>>> To manage the RAMBlocks with guest_memfd, define a new object named
>>>>>>> MemoryAttributeManager to implement the RamDiscardManager
>>>>>>> interface. The
>>>>>>
>>>>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>>>>> should be an interface and RamDiscardManager a type of it, an
>>>>>> implementation?
>>>>>
>>>>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>>>>> implement the RamDiscardManager interface callbacks because
>>>>> RAMBlock is
>>>>> not an object. It includes some metadata of guest_memfd like
>>>>> shared_bitmap at the same time.
>>>>>
>>>>> I can't get it that make 'MemoryAttributeManager' an interface and
>>>>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>>>>> think at least we need someone to implement the RamDiscardManager
>>>>> interface.
>>>>
>>>> shared <-> private is translated (abstracted) to "populated <->
>>>> discarded", which makes sense. The other way around would be wrong.
>>>>
>>>> It's going to be interesting once we have more logical states, for
>>>> example supporting virtio-mem for confidential VMs.
>>>>
>>>> Then we'd have "shared+populated, private+populated, shared+discard,
>>>> private+discarded". Not sure if this could simply be achieved by
>>>> allowing multiple RamDiscardManager that are effectively chained, or
>>>> if we'd want a different interface.
>>>
>>> Exactly! In any case generic manager (parent class) would make more
>>> sense that can work on different operations/states implemented in child
>>> classes (can be chained as well).
>>
>> Ah, we are talking about the generic state management. Sorry for my slow
>> reaction.
>>
>> So we need to
>> 1. Define a generic manager Interface, e.g.
>> MemoryStateManager/GenericStateManager.
>> 2. Make RamDiscardManager the child of MemoryStateManager which manages
>> the state of populated and discarded.
>> 3. Define a new child manager Interface PrivateSharedManager which
>> manages the state of private and shared.
>> 4. Define a new object ConfidentialMemoryAttribute to implement the
>> PrivateSharedManager interface.
>> (Welcome to rename the above Interface/Object)
>>
>> Is my understanding correct?
> 
> Yes, in that direction. Where 'RamDiscardManager' &
> 'PrivateSharedManager' are both child of 'GenericStateManager'.
> 
> Depending on listeners registered, corresponding handlers can be called.

Yes, it would be more generic and future extensive.

Do we need to add this framework change directly? Or keep the current
structure (abstract private/shared as discard/populated) and add the
generic manager until the real case like virtio-mem for confidential VMs.

> 
> Best regards,
> Pankaj
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-19 11:23               ` Chenyi Qiang
@ 2025-03-19 11:56                 ` Gupta, Pankaj
  2025-03-20  3:15                   ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Gupta, Pankaj @ 2025-03-19 11:56 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 3/19/2025 12:23 PM, Chenyi Qiang wrote:
> 
> 
> On 3/19/2025 4:55 PM, Gupta, Pankaj wrote:
>>
>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>> discard
>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>> memory.
>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>>>> device to a confidential VM via shared memory. To address this,
>>>>>>>> it is
>>>>>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>>>>>
>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>> adjust
>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>> adding it
>>>>>>>> back in the other. Therefore, similar actions are required for page
>>>>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>>>>> facilitate this process.
>>>>>>>>
>>>>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>>>>> RamDiscardManager interface. One potential attempt is to implement
>>>>>>>> it in
>>>>>>>> HostMemoryBackend. This is not appropriate because guest_memfd is
>>>>>>>> per
>>>>>>>> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
>>>>>>>> particular, the ones like virtual BIOS calling
>>>>>>>> memory_region_init_ram_guest_memfd() do not.
>>>>>>>>
>>>>>>>> To manage the RAMBlocks with guest_memfd, define a new object named
>>>>>>>> MemoryAttributeManager to implement the RamDiscardManager
>>>>>>>> interface. The
>>>>>>>
>>>>>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>>>>>> should be an interface and RamDiscardManager a type of it, an
>>>>>>> implementation?
>>>>>>
>>>>>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>>>>>> implement the RamDiscardManager interface callbacks because
>>>>>> RAMBlock is
>>>>>> not an object. It includes some metadata of guest_memfd like
>>>>>> shared_bitmap at the same time.
>>>>>>
>>>>>> I can't get it that make 'MemoryAttributeManager' an interface and
>>>>>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>>>>>> think at least we need someone to implement the RamDiscardManager
>>>>>> interface.
>>>>>
>>>>> shared <-> private is translated (abstracted) to "populated <->
>>>>> discarded", which makes sense. The other way around would be wrong.
>>>>>
>>>>> It's going to be interesting once we have more logical states, for
>>>>> example supporting virtio-mem for confidential VMs.
>>>>>
>>>>> Then we'd have "shared+populated, private+populated, shared+discard,
>>>>> private+discarded". Not sure if this could simply be achieved by
>>>>> allowing multiple RamDiscardManager that are effectively chained, or
>>>>> if we'd want a different interface.
>>>>
>>>> Exactly! In any case generic manager (parent class) would make more
>>>> sense that can work on different operations/states implemented in child
>>>> classes (can be chained as well).
>>>
>>> Ah, we are talking about the generic state management. Sorry for my slow
>>> reaction.
>>>
>>> So we need to
>>> 1. Define a generic manager Interface, e.g.
>>> MemoryStateManager/GenericStateManager.
>>> 2. Make RamDiscardManager the child of MemoryStateManager which manages
>>> the state of populated and discarded.
>>> 3. Define a new child manager Interface PrivateSharedManager which
>>> manages the state of private and shared.
>>> 4. Define a new object ConfidentialMemoryAttribute to implement the
>>> PrivateSharedManager interface.
>>> (Welcome to rename the above Interface/Object)
>>>
>>> Is my understanding correct?
>>
>> Yes, in that direction. Where 'RamDiscardManager' &
>> 'PrivateSharedManager' are both child of 'GenericStateManager'.
>>
>> Depending on listeners registered, corresponding handlers can be called.
> 
> Yes, it would be more generic and future extensive.
> 
> Do we need to add this framework change directly? Or keep the current
> structure (abstract private/shared as discard/populated) and add the
> generic manager until the real case like virtio-mem for confidential VMs.
> 

Yes, maybe to start with we should add new (discard/populated) changes 
with the new framework.

In future the current framework can be extended for in-place conversion 
for private-shared conversion (if require userspace help) and virtio-mem 
like interfaces. Important is to have proper hierarchy with base bits there.

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-19 11:56                 ` Gupta, Pankaj
@ 2025-03-20  3:15                   ` Chenyi Qiang
  2025-03-24  4:04                     ` Chenyi Qiang
  0 siblings, 1 reply; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-20  3:15 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/19/2025 7:56 PM, Gupta, Pankaj wrote:
> On 3/19/2025 12:23 PM, Chenyi Qiang wrote:
>>
>>
>> On 3/19/2025 4:55 PM, Gupta, Pankaj wrote:
>>>
>>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>>> discard
>>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>>> memory.
>>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>>> hardware
>>>>>>>>> device to a confidential VM via shared memory. To address this,
>>>>>>>>> it is
>>>>>>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>>>>>>
>>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>>> adjust
>>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>>> adding it
>>>>>>>>> back in the other. Therefore, similar actions are required for
>>>>>>>>> page
>>>>>>>>> conversion events. Introduce the RamDiscardManager to
>>>>>>>>> guest_memfd to
>>>>>>>>> facilitate this process.
>>>>>>>>>
>>>>>>>>> Since guest_memfd is not an object, it cannot directly
>>>>>>>>> implement the
>>>>>>>>> RamDiscardManager interface. One potential attempt is to implement
>>>>>>>>> it in
>>>>>>>>> HostMemoryBackend. This is not appropriate because guest_memfd is
>>>>>>>>> per
>>>>>>>>> RAMBlock. Some RAMBlocks have a memory backend but others do
>>>>>>>>> not. In
>>>>>>>>> particular, the ones like virtual BIOS calling
>>>>>>>>> memory_region_init_ram_guest_memfd() do not.
>>>>>>>>>
>>>>>>>>> To manage the RAMBlocks with guest_memfd, define a new object
>>>>>>>>> named
>>>>>>>>> MemoryAttributeManager to implement the RamDiscardManager
>>>>>>>>> interface. The
>>>>>>>>
>>>>>>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>>>>>>> should be an interface and RamDiscardManager a type of it, an
>>>>>>>> implementation?
>>>>>>>
>>>>>>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>>>>>>> implement the RamDiscardManager interface callbacks because
>>>>>>> RAMBlock is
>>>>>>> not an object. It includes some metadata of guest_memfd like
>>>>>>> shared_bitmap at the same time.
>>>>>>>
>>>>>>> I can't get it that make 'MemoryAttributeManager' an interface and
>>>>>>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>>>>>>> think at least we need someone to implement the RamDiscardManager
>>>>>>> interface.
>>>>>>
>>>>>> shared <-> private is translated (abstracted) to "populated <->
>>>>>> discarded", which makes sense. The other way around would be wrong.
>>>>>>
>>>>>> It's going to be interesting once we have more logical states, for
>>>>>> example supporting virtio-mem for confidential VMs.
>>>>>>
>>>>>> Then we'd have "shared+populated, private+populated, shared+discard,
>>>>>> private+discarded". Not sure if this could simply be achieved by
>>>>>> allowing multiple RamDiscardManager that are effectively chained, or
>>>>>> if we'd want a different interface.
>>>>>
>>>>> Exactly! In any case generic manager (parent class) would make more
>>>>> sense that can work on different operations/states implemented in
>>>>> child
>>>>> classes (can be chained as well).
>>>>
>>>> Ah, we are talking about the generic state management. Sorry for my
>>>> slow
>>>> reaction.
>>>>
>>>> So we need to
>>>> 1. Define a generic manager Interface, e.g.
>>>> MemoryStateManager/GenericStateManager.
>>>> 2. Make RamDiscardManager the child of MemoryStateManager which manages
>>>> the state of populated and discarded.
>>>> 3. Define a new child manager Interface PrivateSharedManager which
>>>> manages the state of private and shared.
>>>> 4. Define a new object ConfidentialMemoryAttribute to implement the
>>>> PrivateSharedManager interface.
>>>> (Welcome to rename the above Interface/Object)
>>>>
>>>> Is my understanding correct?
>>>
>>> Yes, in that direction. Where 'RamDiscardManager' &
>>> 'PrivateSharedManager' are both child of 'GenericStateManager'.
>>>
>>> Depending on listeners registered, corresponding handlers can be called.
>>
>> Yes, it would be more generic and future extensive.
>>
>> Do we need to add this framework change directly? Or keep the current
>> structure (abstract private/shared as discard/populated) and add the
>> generic manager until the real case like virtio-mem for confidential VMs.
>>
> 
> Yes, maybe to start with we should add new (discard/populated) changes
> with the new framework.
> 
> In future the current framework can be extended for in-place conversion
> for private-shared conversion (if require userspace help) and virtio-mem
> like interfaces. Important is to have proper hierarchy with base bits
> there.

Thanks. Then I will follow this direction.

To abstract the common parent class, what I can think of is to abstract
it to manage a pair of opposite states (state set and state clear, like
populate and discard) and define some similar common callbacks like
notify_set() and notify_clear(), as long as we don't use it to manage
more than two states in the future. Otherwise I may define a stub parent
class.

> 
> Thanks,
> Pankaj



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-03-20  3:15                   ` Chenyi Qiang
@ 2025-03-24  4:04                     ` Chenyi Qiang
  0 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-24  4:04 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/20/2025 11:15 AM, Chenyi Qiang wrote:
> 
> 
> On 3/19/2025 7:56 PM, Gupta, Pankaj wrote:
>> On 3/19/2025 12:23 PM, Chenyi Qiang wrote:
>>>
>>>
>>> On 3/19/2025 4:55 PM, Gupta, Pankaj wrote:
>>>>
>>>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO may
>>>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>>>> discard
>>>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>>>> memory.
>>>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>>>> hardware
>>>>>>>>>> device to a confidential VM via shared memory. To address this,
>>>>>>>>>> it is
>>>>>>>>>> crucial to ensure systems like VFIO refresh its IOMMU mappings.
>>>>>>>>>>
>>>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>>>> adjust
>>>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>>>> adding it
>>>>>>>>>> back in the other. Therefore, similar actions are required for
>>>>>>>>>> page
>>>>>>>>>> conversion events. Introduce the RamDiscardManager to
>>>>>>>>>> guest_memfd to
>>>>>>>>>> facilitate this process.
>>>>>>>>>>
>>>>>>>>>> Since guest_memfd is not an object, it cannot directly
>>>>>>>>>> implement the
>>>>>>>>>> RamDiscardManager interface. One potential attempt is to implement
>>>>>>>>>> it in
>>>>>>>>>> HostMemoryBackend. This is not appropriate because guest_memfd is
>>>>>>>>>> per
>>>>>>>>>> RAMBlock. Some RAMBlocks have a memory backend but others do
>>>>>>>>>> not. In
>>>>>>>>>> particular, the ones like virtual BIOS calling
>>>>>>>>>> memory_region_init_ram_guest_memfd() do not.
>>>>>>>>>>
>>>>>>>>>> To manage the RAMBlocks with guest_memfd, define a new object
>>>>>>>>>> named
>>>>>>>>>> MemoryAttributeManager to implement the RamDiscardManager
>>>>>>>>>> interface. The
>>>>>>>>>
>>>>>>>>> Isn't this should be the other way around. 'MemoryAttributeManager'
>>>>>>>>> should be an interface and RamDiscardManager a type of it, an
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> We want to use 'MemoryAttributeManager' to represent RAMBlock to
>>>>>>>> implement the RamDiscardManager interface callbacks because
>>>>>>>> RAMBlock is
>>>>>>>> not an object. It includes some metadata of guest_memfd like
>>>>>>>> shared_bitmap at the same time.
>>>>>>>>
>>>>>>>> I can't get it that make 'MemoryAttributeManager' an interface and
>>>>>>>> RamDiscardManager a type of it. Can you elaborate it a little bit? I
>>>>>>>> think at least we need someone to implement the RamDiscardManager
>>>>>>>> interface.
>>>>>>>
>>>>>>> shared <-> private is translated (abstracted) to "populated <->
>>>>>>> discarded", which makes sense. The other way around would be wrong.
>>>>>>>
>>>>>>> It's going to be interesting once we have more logical states, for
>>>>>>> example supporting virtio-mem for confidential VMs.
>>>>>>>
>>>>>>> Then we'd have "shared+populated, private+populated, shared+discard,
>>>>>>> private+discarded". Not sure if this could simply be achieved by
>>>>>>> allowing multiple RamDiscardManager that are effectively chained, or
>>>>>>> if we'd want a different interface.
>>>>>>
>>>>>> Exactly! In any case generic manager (parent class) would make more
>>>>>> sense that can work on different operations/states implemented in
>>>>>> child
>>>>>> classes (can be chained as well).
>>>>>
>>>>> Ah, we are talking about the generic state management. Sorry for my
>>>>> slow
>>>>> reaction.
>>>>>
>>>>> So we need to
>>>>> 1. Define a generic manager Interface, e.g.
>>>>> MemoryStateManager/GenericStateManager.
>>>>> 2. Make RamDiscardManager the child of MemoryStateManager which manages
>>>>> the state of populated and discarded.
>>>>> 3. Define a new child manager Interface PrivateSharedManager which
>>>>> manages the state of private and shared.
>>>>> 4. Define a new object ConfidentialMemoryAttribute to implement the
>>>>> PrivateSharedManager interface.
>>>>> (Welcome to rename the above Interface/Object)
>>>>>
>>>>> Is my understanding correct?
>>>>
>>>> Yes, in that direction. Where 'RamDiscardManager' &
>>>> 'PrivateSharedManager' are both child of 'GenericStateManager'.
>>>>
>>>> Depending on listeners registered, corresponding handlers can be called.
>>>
>>> Yes, it would be more generic and future extensive.
>>>
>>> Do we need to add this framework change directly? Or keep the current
>>> structure (abstract private/shared as discard/populated) and add the
>>> generic manager until the real case like virtio-mem for confidential VMs.
>>>
>>
>> Yes, maybe to start with we should add new (discard/populated) changes
>> with the new framework.
>>
>> In future the current framework can be extended for in-place conversion
>> for private-shared conversion (if require userspace help) and virtio-mem
>> like interfaces. Important is to have proper hierarchy with base bits
>> there.
> 
> Thanks. Then I will follow this direction.
> 
> To abstract the common parent class, what I can think of is to abstract
> it to manage a pair of opposite states (state set and state clear, like
> populate and discard) and define some similar common callbacks like
> notify_set() and notify_clear(), as long as we don't use it to manage
> more than two states in the future. Otherwise I may define a stub parent
> class.

Hi Pankaj & David,

How about defining them like:

---
For class definition: extract all the callbacks into parent class:

typedef int (*ReplayStateChange)(MemoryRegionSection *section, void
*opaque);

struct GenericStateManagerClass {
    InterfaceClass parent_class;
    uint64_t (*get_min_granularity)(const GenericStateManager *gsm,
                                    const MemoryRegion *mr);
    bool (*is_state_set)(const GenericStateManager *gsm,
                         const MemoryRegionSection *section);
    int (*replay_state_set)(const GenericStateManager *gsm,
                            MemoryRegionSection *section,
                            ReplayStateChange replay_fn, void *opaque);
    int (*replay_state_clear)(const GenericStateManager *gsm,
                              MemoryRegionSection *section,
                              ReplayStateChange replay_fn, void *opaque);
    void (*register_listener)(GenericStateManager *gsm,
                              void* listener,
                              MemoryRegionSection *section);
    void (*unregister_listener)(GenericStateManager *gsm,
                                void *listener);
}

struct RamDiscardManagerClass {
    /* private */
    GenericStateManagerClass parent_class;
};

struct PrivateSharedManagerClass {
    /* private */
    GenericStateManagerClass parent_class;
}

---

For listeners, embed the generic listener into specific listeners:

typedef int (*NotifyStateSet)(void *listener, MemoryRegionSection *section);
typedef int (*NotifyStateClear)(void *listener, MemoryRegionSection
*section);

struct StateChangeListener {
    NotifyStateSet notify_state_set;
    NotifyStateClear notify_state_clear;

    MemoryRegionSection *section;
}

struct RamDiscardListener {
    struct StateChangeListener scl;
    bool double_discard_supported;

    QLIST_ENTRY(RamDiscardListener) next;
};

struct PrivateSharedListener {
    struct StateChangeListener scl;

    QLIST_ENTRY(PrivateSharedListener) next;
}

---

For helpers,

- define the generic helpers:

void generic_state_manager_register_listener(GenericStateManager *gsm,
                                             void *listener,
                                             MemoryRegionSection *s)
{
    GenericStateManagerClass gsmc = GENERATE_STATE_MANAGER_GET_CLASS(rdm);

    g_assert(gsmc->register_listener);
    gsmc->register_listener(gsm, listener, s);
}

- Keep RamDiscardManager specific helpers for compatibility if necessary:

void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                           RamDiscardListener *rdl,
                                           MemoryRegionSection *section)
{
    GenericStateManagerClass gsmc = GENERATE_STATE_MANAGER_GET_CLASS(rdm);
    GenericStateManager *gsm = GENERIC_STATE_MANAGER(rdm);

    g_assert(gsmc->register_listener);
    gsmc->register_listener(gsm, (void*)rdl, section);
}

> 
>>
>> Thanks,
>> Pankaj
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-03-14 10:23           ` Chenyi Qiang
@ 2025-03-31  8:55             ` Chenyi Qiang
  0 siblings, 0 replies; 31+ messages in thread
From: Chenyi Qiang @ 2025-03-31  8:55 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 3/14/2025 6:23 PM, Chenyi Qiang wrote:
> 
> 
> On 3/14/2025 5:50 PM, David Hildenbrand wrote:
>> On 14.03.25 10:30, Chenyi Qiang wrote:
>>>
>>>
>>> On 3/14/2025 5:00 PM, David Hildenbrand wrote:
>>>> On 14.03.25 09:21, Chenyi Qiang wrote:
>>>>> Hi David & Alexey,
>>>>
>>>> Hi,
>>>>
>>>>>
>>>>> To keep the bitmap aligned, I add the undo operation for
>>>>> set_memory_attributes() and use the bitmap + replay callback to do
>>>>> set_memory_attributes(). Does this change make sense?
>>>>
>>>> I assume you mean this hunk:
>>>>
>>>> +    ret =
>>>> memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
>>>> +                                                offset, size,
>>>> to_private);
>>>> +    if (ret) {
>>>> +        warn_report("Failed to notify the listener the state change
>>>> of "
>>>> +                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
>>>> +                    start, size, to_private ? "private" : "shared");
>>>> +        args.to_private = !to_private;
>>>> +        if (to_private) {
>>>> +            ret = ram_discard_manager_replay_populated(mr->rdm,
>>>> &section,
>>>> +
>>>> kvm_set_memory_attributes_cb,
>>>> +                                                       &args);
>>>> +        } else {
>>>> +            ret = ram_discard_manager_replay_discarded(mr->rdm,
>>>> &section,
>>>> +
>>>> kvm_set_memory_attributes_cb,
>>>> +                                                       &args);
>>>> +        }
>>>> +        if (ret) {
>>>> +            goto out_unref;
>>>> +        }
>>>>
>>
>> We should probably document that memory_attribute_state_change() cannot
>> fail with "to_private", so you can simplify it to only handle the "to
>> shared" case.
> 
> Yes, "to_private" branch is unnecessary.
> 
>>
>>>>
>>>> Why is that undo necessary? The bitmap + listeners should be held in
>>>> sync inside of
>>>> memory_attribute_manager_state_change(). Handling this in the caller
>>>> looks wrong.
>>>
>>> state_change() handles the listener, i.e. VFIO/IOMMU. And the caller
>>> handles the core mm side (guest_memfd set_attribute()) undo if
>>> state_change() failed. Just want to keep the attribute consistent with
>>> the bitmap on both side. Do we need this? If not, the bitmap can only
>>> represent the status of listeners.
>>
>> Ah, so you meant that you effectively want to undo the attribute change,
>> because the state effectively cannot change, and we want to revert the
>> attribute change.
>>
>> That makes sense when we are converting private->shared.
>>
>>
>> BTW, I'm thinking if the orders should be the following (with in-place
>> conversion in mind where we would mmap guest_memfd for the shared memory
>> parts).
>>
>> On private -> shared conversion:
>>
>> (1) change_attribute()
>> (2) state_change(): IOMMU pins shared memory
>> (3) restore_attribute() if it failed
>>
>> On shared -> private conversion
>> (1) state_change(): IOMMU unpins shared memory
>> (2) change_attribute(): can convert in-place because there are not pins
>>
>> I'm wondering if the whole attribute change could actually be a
>> listener, invoked by the memory_attribute_manager_state_change() call
>> itself in the right order.
>>
>> We'd probably need listener priorities, and invoke them in reverse order
>> on shared -> private conversion. Just an idea to get rid of the manual
>> ram_discard_manager_replay_discarded() call in your code.
> 
> Good idea. I think listener priorities can make it more elegant with
> in-place conversion. And the change_attribute() listener can be given a
> highest or lowest priority. Maybe we can add this change in advance
> before in-place conversion available.

Hi David,

To add the change_attribute() listener priorities changes, I can think
of several steps:

1) change the *NotifyRamDiscard() to return the result, because change
attribute to private could return failure.
2) Add a list in confidential_guest_support structure (or some other
place) to save the wrapped the listener like VFIORamDiscardListener. And
add related listener_register/unregister() in kvm_region_add/del()
3) Add priority in listener and related handling to follow the listener
expected sequence.

Regarding the step 1), with change_attribute() listener, the to_private
path could also fail. The tricky part is still the error handling. If we
simply assume it couldn't fail, maybe add a g_assert() to avoid expected
case. Or we also need to add the undo operation for to_private case.

> 
>>
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-03-31  8:58 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-10  8:18 [PATCH v3 0/7] Enable shared device assignment Chenyi Qiang
2025-03-10  8:18 ` [PATCH v3 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-03-10  8:18 ` [PATCH v3 2/7] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-03-10  8:18 ` [PATCH v3 3/7] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-03-17  3:00   ` Kishen Maloor
2025-03-17  4:11     ` Chenyi Qiang
2025-03-10  8:18 ` [PATCH v3 4/7] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
2025-03-14 12:11   ` Gupta, Pankaj
2025-03-17  2:54     ` Chenyi Qiang
2025-03-17 10:36       ` David Hildenbrand
2025-03-17 17:01         ` Gupta, Pankaj
2025-03-18  1:54           ` Chenyi Qiang
2025-03-19  8:55             ` Gupta, Pankaj
2025-03-19 11:23               ` Chenyi Qiang
2025-03-19 11:56                 ` Gupta, Pankaj
2025-03-20  3:15                   ` Chenyi Qiang
2025-03-24  4:04                     ` Chenyi Qiang
2025-03-10  8:18 ` [PATCH v3 5/7] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
2025-03-10  8:18 ` [PATCH v3 6/7] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-03-14  8:21   ` Chenyi Qiang
2025-03-14  9:00     ` David Hildenbrand
2025-03-14  9:30       ` Chenyi Qiang
2025-03-14  9:50         ` David Hildenbrand
2025-03-14 10:23           ` Chenyi Qiang
2025-03-31  8:55             ` Chenyi Qiang
2025-03-17  6:18   ` Tony Lindgren
2025-03-17  7:32     ` Chenyi Qiang
2025-03-17  9:45       ` Tony Lindgren
2025-03-17 10:21         ` Chenyi Qiang
2025-03-17 11:07           ` Tony Lindgren
2025-03-10  8:18 ` [PATCH v3 7/7] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).