kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] Enable shared device assignment
@ 2025-05-30  8:32 Chenyi Qiang
  2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-05-30  8:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

This is the v6 series of the shared device assignment support.

Compared with the last version [1], this series retains the basic support
and removes the additional complex error handling, which can be added
back when necessary. Meanwhile, the patchset has been re-organized to
be clearer.

Overview of this series:

- Patch 1-3: Preparation patches. These include function exposure and
  some function prototype changes.
- Patch 4: Introduce a new object to implement RamDiscardManager
  interface and a helper to notify the shared/private state change.
- Patch 5: Enable coordinated discarding of RAM with guest_memfd through
  the RamDiscardManager interface.

More small changes or details can be found in the individual patches.

---

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared and private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [2] when we try to enable shared
device assignment in confidential VMs.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversions is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardmanager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Limitation and future extension
===============================
This series only supports the basic shared device assignment functionality.
There are still some limitations and areas that can be extended and
optimized in the future.

Relationship with in-place conversion
-------------------------------------
In-place page conversion is the ongoing work to allow mmap() of
guest_memfd to userspace so that both private and shared memory can use
the same physical memory as the backend. This new design eliminates the
need to discard pages during shared/private conversions. When it is
ready, shared device assignment needs be adjusted to achieve an
unmap-before-conversion-to-private and map-after-conversion-to-shared
sequence to be compatible with the change.

Partial unmap limitation
------------------------
VFIO expects the DMA mapping for a specific IOVA to be mapped and
unmapped with the same granularity. The guest may perform partial
conversion, such as converting a small region within a larger one. To
prevent such invalid cases, current operations are performed with 4K
granularity. This could be optimized after DMA mapping cut operation
[3] is introduced in the future. We can always perform a
split-before-unmap if partial conversions happens. If the split
succeeds, the unmap will succeed and be atomic. If the split fails, the
unmap process fails.

More attributes management
--------------------------
Current RamDiscardManager can only manage a pair of opposite states like
populated/discared or shared/private. If more states need to be
considered, for example, support virtio-mem in confidential VMs, three
states would be possible (shared populated/private populated/discard).
Current framework cannot handle such scenario and we need to think of
some new framework at that time [4].

Memory overhead optimization
----------------------------
A comment from Baolu [5] suggests considering using Maple Tree or a generic
interval tree to manage private/shared state instead of a bitmap, which
can reduce memory consumption. This optmization can also be considered
in other bitmap use cases like dirty bitmaps for guest RAM.

Testing
=======
This patch series is tested based on mainline kernel since TDX base
support has been merged. The QEMU repo is available at
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0


Because the new features like cut_mapping operation will only be support in iommufd.
It is more recommended to use the iommufd-backed VFIO.

Related link
============
[1] https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qiang@intel.com/
[2] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[3] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/
[4] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
[5] https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23decf@linux.intel.com/

Chenyi Qiang (5):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  memory: Change memory_region_set_ram_discard_manager() to return the
    result
  memory: Unify the definiton of ReplayRamPopulate() and
    ReplayRamDiscard()
  ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock
    with guest_memfd
  physmem: Support coordinated discarding of RAM with guest_memfd

 MAINTAINERS                   |   1 +
 accel/kvm/kvm-all.c           |   9 +
 hw/virtio/virtio-mem.c        |  83 +++---
 include/system/memory.h       | 100 +++++--
 include/system/ramblock.h     |  22 ++
 migration/ram.c               |   5 +-
 system/memory.c               |  22 +-
 system/meson.build            |   1 +
 system/physmem.c              |  18 +-
 system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
 system/trace-events           |   3 +
 11 files changed, 660 insertions(+), 84 deletions(-)
 create mode 100644 system/ram-block-attributes.c

-- 
2.43.5


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
@ 2025-05-30  8:32 ` Chenyi Qiang
  2025-06-01  9:19   ` Gupta, Pankaj
  2025-06-01 15:44   ` Xiaoyao Li
  2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-05-30  8:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Rename the helper to memory_region_section_intersect_range() to make it
more generic. Meanwhile, define the @end as Int128 and replace the
related operations with Int128_* format since the helper is exported as
a wider API.

Suggested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v6:
    - No change.

Changes in v5:
    - Indent change for int128 ops to avoid the line over 80
    - Add two Review-by from Alexey and Zhao

Changes in v4:
    - No change.
---
 hw/virtio/virtio-mem.c  | 32 +++++---------------------------
 include/system/memory.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index a3d1a676e7..b3c126ea1e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -244,28 +244,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -287,7 +265,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -319,7 +297,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -355,7 +333,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -371,7 +349,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -388,7 +366,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/system/memory.h b/include/system/memory.h
index fbbf4cf911..b961c4076a 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -1211,6 +1211,36 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+static inline bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                                         uint64_t offset,
+                                                         uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    Int128 end = int128_min(int128_add(int128_make64(s->offset_within_region),
+                                       s->size),
+                            int128_add(int128_make64(offset),
+                                       int128_make64(size)));
+
+    if (int128_le(end, int128_make64(start))) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_sub(end, int128_make64(start));
+    return true;
+}
+
 /**
  * memory_region_init: Initialize a memory region
  *
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
  2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2025-05-30  8:32 ` Chenyi Qiang
  2025-06-01  9:21   ` Gupta, Pankaj
                     ` (2 more replies)
  2025-05-30  8:32 ` [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-05-30  8:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
RamDiscardManager is already set in the MemoryRegion. The caller must
handle this failure, such as having virtio-mem undo its actions and fail
the realize() process. Opportunistically move the call earlier to avoid
complex error handling.

This change is beneficial when introducing a new RamDiscardManager
instance besides virtio-mem. After
ram_block_coordinated_discard_require(true) unlocks all
RamDiscardManager instances, only one instance is allowed to be set for
one MemoryRegion at present.

Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v6:
    - Add Reviewed-by from David.

Changes in v5:
    - Nit in commit message (return false -> -EBUSY)
    - Add set_ram_discard_manager(NULL) when ram_block_discard_range()
      fails.

Changes in v3:
    - Move set_ram_discard_manager() up to avoid a g_free()
    - Clean up set_ram_discard_manager() definition
---
 hw/virtio/virtio-mem.c  | 30 +++++++++++++++++-------------
 include/system/memory.h |  6 +++---
 system/memory.c         | 10 +++++++---
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b3c126ea1e..2e491e8c44 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1047,6 +1047,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    /*
+     * Set ourselves as RamDiscardManager before the plug handler maps the
+     * memory region and exposes it via an address space.
+     */
+    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
+                                              RAM_DISCARD_MANAGER(vmem))) {
+        error_setg(errp, "Failed to set RamDiscardManager");
+        ram_block_coordinated_discard_require(false);
+        return;
+    }
+
     /*
      * We don't know at this point whether shared RAM is migrated using
      * QEMU or migrated using the file content. "x-ignore-shared" will be
@@ -1061,6 +1072,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
         if (ret) {
             error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
+            memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
             ram_block_coordinated_discard_require(false);
             return;
         }
@@ -1122,13 +1134,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
     vmem->system_reset->vmem = vmem;
     qemu_register_resettable(obj);
-
-    /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
-                                          RAM_DISCARD_MANAGER(vmem));
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -1136,12 +1141,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-
     qemu_unregister_resettable(OBJECT(vmem->system_reset));
     object_unref(OBJECT(vmem->system_reset));
 
@@ -1154,6 +1153,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
+    /*
+     * The unplug handler unmapped the memory region, it cannot be
+     * found via an address space anymore. Unset ourselves.
+     */
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     ram_block_coordinated_discard_require(false);
 }
 
diff --git a/include/system/memory.h b/include/system/memory.h
index b961c4076a..896948deb1 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2499,13 +2499,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
  *
  * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
  * that does not cover RAM, or a #MemoryRegion that already has a
- * #RamDiscardManager assigned.
+ * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
  *
  * @mr: the #MemoryRegion
  * @rdm: #RamDiscardManager to set
  */
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm);
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm);
 
 /**
  * memory_region_find: translate an address/size relative to a
diff --git a/system/memory.c b/system/memory.c
index 63b983efcd..b45b508dce 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2106,12 +2106,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
     return mr->rdm;
 }
 
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm)
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm)
 {
     g_assert(memory_region_is_ram(mr));
-    g_assert(!rdm || !mr->rdm);
+    if (mr->rdm && rdm) {
+        return -EBUSY;
+    }
+
     mr->rdm = rdm;
+    return 0;
 }
 
 uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
  2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-05-30  8:32 ` Chenyi Qiang
  2025-06-01 16:12   ` Xiaoyao Li
  2025-06-02  9:20   ` Gupta, Pankaj
  2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-05-30  8:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Update ReplayRamDiscard() function to return the result and unify the
ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
the same time due to their identical definitions. This unification
simplifies related structures, such as VirtIOMEMReplayData, which makes
it cleaner.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v6:
    - Add Reviewed-by from David
    - Add a documentation comment for the prototype change

Changes in v5:
    - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
    - return data->fn(s, data->opaque) instead of 0 in
      virtio_mem_rdm_replay_discarded_cb(). (Alexey)

Changes in v4:
    - Modify the commit message. We won't use Replay() operation when
      doing the attribute change like v3.
---
 hw/virtio/virtio-mem.c  | 21 +++++++-------
 include/system/memory.h | 64 ++++++++++++++++++++++++++++++-----------
 migration/ram.c         |  5 ++--
 system/memory.c         | 12 ++++----
 4 files changed, 66 insertions(+), 36 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 2e491e8c44..c46f6f9c3e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1732,7 +1732,7 @@ static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
 }
 
 struct VirtIOMEMReplayData {
-    void *fn;
+    ReplayRamDiscardState fn;
     void *opaque;
 };
 
@@ -1740,12 +1740,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
+    return data->fn(s, data->opaque);
 }
 
 static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                            MemoryRegionSection *s,
-                                           ReplayRamPopulate replay_fn,
+                                           ReplayRamDiscardState replay_fn,
                                            void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
@@ -1764,14 +1764,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    ((ReplayRamDiscard)data->fn)(s, data->opaque);
-    return 0;
+    return data->fn(s, data->opaque);
 }
 
-static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
-                                            MemoryRegionSection *s,
-                                            ReplayRamDiscard replay_fn,
-                                            void *opaque)
+static int virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                           MemoryRegionSection *s,
+                                           ReplayRamDiscardState replay_fn,
+                                           void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
     struct VirtIOMEMReplayData data = {
@@ -1780,8 +1779,8 @@ static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
     };
 
     g_assert(s->mr == &vmem->memdev->mr);
-    virtio_mem_for_each_unplugged_section(vmem, s, &data,
-                                          virtio_mem_rdm_replay_discarded_cb);
+    return virtio_mem_for_each_unplugged_section(vmem, s, &data,
+                                                 virtio_mem_rdm_replay_discarded_cb);
 }
 
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
diff --git a/include/system/memory.h b/include/system/memory.h
index 896948deb1..4f45a187d6 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -575,8 +575,20 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
     rdl->double_discard_supported = double_discard_supported;
 }
 
-typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
-typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
+/**
+ * ReplayRamDiscardState:
+ *
+ * The callback handler for #RamDiscardManagerClass.replay_populated/
+ * #RamDiscardManagerClass.replay_discarded to invoke on populated/discarded
+ * parts.
+ *
+ * @section: the #MemoryRegionSection of populated/discarded part
+ * @opaque: pointer to forward to the callback
+ *
+ * Returns 0 on success, or a negative error if failed.
+ */
+typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
+                                     void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -650,36 +662,38 @@ struct RamDiscardManagerClass {
     /**
      * @replay_populated:
      *
-     * Call the #ReplayRamPopulate callback for all populated parts within the
-     * #MemoryRegionSection via the #RamDiscardManager.
+     * Call the #ReplayRamDiscardState callback for all populated parts within
+     * the #MemoryRegionSection via the #RamDiscardManager.
      *
      * In case any call fails, no further calls are made.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamPopulate callback
+     * @replay_fn: the #ReplayRamDiscardState callback
      * @opaque: pointer to forward to the callback
      *
      * Returns 0 on success, or a negative error if any notification failed.
      */
     int (*replay_populated)(const RamDiscardManager *rdm,
                             MemoryRegionSection *section,
-                            ReplayRamPopulate replay_fn, void *opaque);
+                            ReplayRamDiscardState replay_fn, void *opaque);
 
     /**
      * @replay_discarded:
      *
-     * Call the #ReplayRamDiscard callback for all discarded parts within the
-     * #MemoryRegionSection via the #RamDiscardManager.
+     * Call the #ReplayRamDiscardState callback for all discarded parts within
+     * the #MemoryRegionSection via the #RamDiscardManager.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamDiscard callback
+     * @replay_fn: the #ReplayRamDiscardState callback
      * @opaque: pointer to forward to the callback
+     *
+     * Returns 0 on success, or a negative error if any notification failed.
      */
-    void (*replay_discarded)(const RamDiscardManager *rdm,
-                             MemoryRegionSection *section,
-                             ReplayRamDiscard replay_fn, void *opaque);
+    int (*replay_discarded)(const RamDiscardManager *rdm,
+                            MemoryRegionSection *section,
+                            ReplayRamDiscardState replay_fn, void *opaque);
 
     /**
      * @register_listener:
@@ -720,15 +734,31 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
 bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
                                       const MemoryRegionSection *section);
 
+/**
+ * ram_discard_manager_replay_populated:
+ *
+ * A wrapper to call the #RamDiscardManagerClass.replay_populated callback
+ * of the #RamDiscardManager.
+ *
+ * Returns 0 on success, or a negative error if any notification failed.
+ */
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamDiscardState replay_fn,
                                          void *opaque);
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque);
+/**
+ * ram_discard_manager_replay_discarded:
+ *
+ * A wrapper to call the #RamDiscardManagerClass.replay_discarded callback
+ * of the #RamDiscardManager.
+ *
+ * Returns 0 on success, or a negative error if any notification failed.
+ */
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque);
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
diff --git a/migration/ram.c b/migration/ram.c
index e12913b43e..c004f37060 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -848,8 +848,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
-static void dirty_bitmap_clear_section(MemoryRegionSection *section,
-                                       void *opaque)
+static int dirty_bitmap_clear_section(MemoryRegionSection *section,
+                                      void *opaque)
 {
     const hwaddr offset = section->offset_within_region;
     const hwaddr size = int128_get64(section->size);
@@ -868,6 +868,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
     }
     *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
     bitmap_clear(rb->bmap, start, npages);
+    return 0;
 }
 
 /*
diff --git a/system/memory.c b/system/memory.c
index b45b508dce..de45fbdd3f 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2138,7 +2138,7 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamDiscardState replay_fn,
                                          void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
@@ -2147,15 +2147,15 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
     return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque)
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_discarded);
-    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+    return rdmc->replay_discarded(rdm, section, replay_fn, opaque);
 }
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2025-05-30  8:32 ` [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-05-30  8:32 ` Chenyi Qiang
  2025-06-01  9:58   ` Gupta, Pankaj
                     ` (2 more replies)
  2025-05-30  8:32 ` [PATCH v6 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
  2025-06-09  3:23 ` [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
  5 siblings, 3 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-05-30  8:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") highlighted that subsystems like VFIO may disable RAM block
discard. However, guest_memfd relies on discard operations for page
conversion between private and shared memory, potentially leading to
the stale IOMMU mapping issue when assigning hardware devices to
confidential VMs via shared memory. To address this and allow shared
device assignement, it is crucial to ensure the VFIO system refreshes
its IOMMU mappings.

RamDiscardManager is an existing interface (used by virtio-mem) to
adjust VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.

Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. Implementing it in HostMemoryBackend is
not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
have a memory backend while others do not. Notably, virtual BIOS
RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
backend.

To manage RAMBlocks with guest_memfd, define a new object named
RamBlockAttributes to implement the RamDiscardManager interface. This
object can store the guest_memfd information such as bitmap for shared
memory and the registered listeners for event notification. In the
context of RamDiscardManager, shared state is analogous to populated, and
private state is signified as discarded. To notify the conversion events,
a new state_change() helper is exported for the users to notify the
listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
shared mapping.

Note that the memory state is tracked at the host page size granularity,
as the minimum conversion size can be one page per request and VFIO
expects the DMA mapping for a specific iova to be mapped and unmapped
with the same granularity. Confidential VMs may perform partial
conversions, such as conversions on small regions within larger ones.
To prevent such invalid cases and until DMA mapping cut operation
support is available, all operations are performed with 4K granularity.

In addition, memory conversion failures cause QEMU to quit instead of
resuming the guest or retrying the operation at present. It would be
future work to add more error handling or rollback mechanisms once
conversion failures are allowed. For example, in-place conversion of
guest_memfd could retry the unmap operation during the conversion from
shared to private. For now, keep the complex error handling out of the
picture as it is not required.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v6:
    - Change the object type name from RamBlockAttribute to
      RamBlockAttributes. (David)
    - Save the associated RAMBlock instead MemoryRegion in
      RamBlockAttributes. (David)
    - Squash the state_change() helper introduction in this commit as
      well as the mixture conversion case handling. (David)
    - Change the block_size type from int to size_t and some cleanup in
      validation check. (Alexey)
    - Add a tracepoint to track the state changes. (Alexey)

Changes in v5:
    - Revert to use RamDiscardManager interface instead of introducing
      new hierarchy of class to manage private/shared state, and keep
      using the new name of RamBlockAttribute compared with the
      MemoryAttributeManager in v3.
    - Use *simple* version of object_define and object_declare since the
      state_change() function is changed as an exported function instead
      of a virtual function in later patch.
    - Move the introduction of RamBlockAttribute field to this patch and
      rename it to ram_shared. (Alexey)
    - call the exit() when register/unregister failed. (Zhao)
    - Add the ram-block-attribute.c to Memory API related part in
      MAINTAINERS.

Changes in v4:
    - Change the name from memory-attribute-manager to
      ram-block-attribute.
    - Implement the newly-introduced PrivateSharedManager instead of
      RamDiscardManager and change related commit message.
    - Define the new object in ramblock.h instead of adding a new file.
---
 MAINTAINERS                   |   1 +
 include/system/ramblock.h     |  21 ++
 system/meson.build            |   1 +
 system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
 system/trace-events           |   3 +
 5 files changed, 506 insertions(+)
 create mode 100644 system/ram-block-attributes.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 6dacd6d004..8ec39aa7f8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3149,6 +3149,7 @@ F: system/memory.c
 F: system/memory_mapping.c
 F: system/physmem.c
 F: system/memory-internal.h
+F: system/ram-block-attributes.c
 F: scripts/coccinelle/memory-region-housekeeping.cocci
 
 Memory devices
diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index d8a116ba99..1bab9e2dac 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -22,6 +22,10 @@
 #include "exec/cpu-common.h"
 #include "qemu/rcu.h"
 #include "exec/ramlist.h"
+#include "system/hostmem.h"
+
+#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
+OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -91,4 +95,21 @@ struct RAMBlock {
     ram_addr_t postcopy_length;
 };
 
+struct RamBlockAttributes {
+    Object parent;
+
+    RAMBlock *ram_block;
+
+    /* 1-setting of the bitmap represents ram is populated (shared) */
+    unsigned bitmap_size;
+    unsigned long *bitmap;
+
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
+void ram_block_attributes_destroy(RamBlockAttributes *attr);
+int ram_block_attributes_state_change(RamBlockAttributes *attr, uint64_t offset,
+                                      uint64_t size, bool to_discard);
+
 #endif
diff --git a/system/meson.build b/system/meson.build
index c2f0082766..2747dbde80 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -17,6 +17,7 @@ libsystem_ss.add(files(
   'dma-helpers.c',
   'globals.c',
   'ioport.c',
+  'ram-block-attributes.c',
   'memory_mapping.c',
   'memory.c',
   'physmem.c',
diff --git a/system/ram-block-attributes.c b/system/ram-block-attributes.c
new file mode 100644
index 0000000000..514252413f
--- /dev/null
+++ b/system/ram-block-attributes.c
@@ -0,0 +1,480 @@
+/*
+ * QEMU ram block attributes
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "system/ramblock.h"
+#include "trace.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
+                                          ram_block_attributes,
+                                          RAM_BLOCK_ATTRIBUTES,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static size_t
+ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
+{
+    /*
+     * Because page conversion could be manipulated in the size of at least 4K
+     * or 4K aligned, Use the host page size as the granularity to track the
+     * memory attribute.
+     */
+    g_assert(attr && attr->ram_block);
+    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
+    return attr->ram_block->page_size;
+}
+
+
+static bool
+ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
+                                      const MemoryRegionSection *section)
+{
+    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    const uint64_t first_bit = section->offset_within_region / block_size;
+    const uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
+    unsigned long first_discarded_bit;
+
+    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
+                                           first_bit);
+    return first_discarded_bit > last_bit;
+}
+
+typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
+                                               void *arg);
+
+static int
+ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
+                                        void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int
+ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
+                                       void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+    return 0;
+}
+
+static int
+ram_block_attributes_for_each_populated_section(const RamBlockAttributes *attr,
+                                                MemoryRegionSection *section,
+                                                void *arg,
+                                                ram_block_attributes_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                              first_bit);
+
+    while (first_bit < attr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
+                                      first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                  last_bit + 2);
+    }
+
+    return ret;
+}
+
+static int
+ram_block_attributes_for_each_discarded_section(const RamBlockAttributes *attr,
+                                                MemoryRegionSection *section,
+                                                void *arg,
+                                                ram_block_attributes_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
+                                   first_bit);
+
+    while (first_bit < attr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                 first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_zero_bit(attr->bitmap,
+                                       attr->bitmap_size,
+                                       last_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t
+ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                             const MemoryRegion *mr)
+{
+    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+
+    g_assert(mr == attr->ram_block->mr);
+    return ram_block_attributes_get_block_size(attr);
+}
+
+static void
+ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
+                                           RamDiscardListener *rdl,
+                                           MemoryRegionSection *section)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    int ret;
+
+    g_assert(section->mr == attr->ram_block->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
+
+    ret = ram_block_attributes_for_each_populated_section(attr, section, rdl,
+                                    ram_block_attributes_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s",
+                     __func__, strerror(-ret));
+        exit(1);
+    }
+}
+
+static void
+ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
+                                             RamDiscardListener *rdl)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == attr->ram_block->mr);
+
+    if (rdl->double_discard_supported) {
+        rdl->notify_discard(rdl, rdl->section);
+    } else {
+        ret = ram_block_attributes_for_each_populated_section(attr,
+                rdl->section, rdl, ram_block_attributes_notify_discard_cb);
+        if (ret) {
+            error_report("%s: Failed to unregister RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            exit(1);
+        }
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+}
+
+typedef struct RamBlockAttributesReplayData {
+    ReplayRamDiscardState fn;
+    void *opaque;
+} RamBlockAttributesReplayData;
+
+static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection *section,
+                                              void *arg)
+{
+    RamBlockAttributesReplayData *data = arg;
+
+    return data->fn(section, data->opaque);
+}
+
+static int
+ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscardState replay_fn,
+                                          void *opaque)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == attr->ram_block->mr);
+    return ram_block_attributes_for_each_populated_section(attr, section, &data,
+                                            ram_block_attributes_rdm_replay_cb);
+}
+
+static int
+ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscardState replay_fn,
+                                          void *opaque)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == attr->ram_block->mr);
+    return ram_block_attributes_for_each_discarded_section(attr, section, &data,
+                                            ram_block_attributes_rdm_replay_cb);
+}
+
+static bool
+ram_block_attributes_is_valid_range(RamBlockAttributes *attr, uint64_t offset,
+                                    uint64_t size)
+{
+    MemoryRegion *mr = attr->ram_block->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+
+    if (!QEMU_IS_ALIGNED(offset, block_size) ||
+        !QEMU_IS_ALIGNED(size, block_size)) {
+        return false;
+    }
+    if (offset + size <= offset) {
+        return false;
+    }
+    if (offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void ram_block_attributes_notify_discard(RamBlockAttributes *attr,
+                                                uint64_t offset,
+                                                uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        rdl->notify_discard(rdl, &tmp);
+    }
+}
+
+static int
+ram_block_attributes_notify_populate(RamBlockAttributes *attr,
+                                     uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        ret = rdl->notify_populate(rdl, &tmp);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
+static bool ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
+                                                    uint64_t offset,
+                                                    uint64_t size)
+{
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
+                                   first_bit);
+    return found_bit > last_bit;
+}
+
+static bool
+ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
+                                        uint64_t offset, uint64_t size)
+{
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+int ram_block_attributes_state_change(RamBlockAttributes *attr,
+                                      uint64_t offset, uint64_t size,
+                                      bool to_discard)
+{
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    bool is_range_discarded, is_range_populated;
+    const uint64_t end = offset + size;
+    unsigned long bit;
+    uint64_t cur;
+    int ret = 0;
+
+    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -EINVAL;
+    }
+
+    is_range_discarded = ram_block_attributes_is_range_discarded(attr, offset,
+                                                                 size);
+    is_range_populated = ram_block_attributes_is_range_populated(attr, offset,
+                                                                 size);
+
+    trace_ram_block_attributes_state_change(offset, size,
+                                            is_range_discarded ? "discarded" :
+                                            is_range_populated ? "populated" :
+                                            "mixture",
+                                            to_discard ? "discarded" :
+                                            "populated");
+    if (to_discard) {
+        if (is_range_discarded) {
+            /* Already private */
+        } else if (is_range_populated) {
+            /* Completely shared */
+            bitmap_clear(attr->bitmap, first_bit, nbits);
+            ram_block_attributes_notify_discard(attr, offset, size);
+        } else {
+            /* Unexpected mixture: process individual blocks */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (!test_bit(bit, attr->bitmap)) {
+                    continue;
+                }
+                clear_bit(bit, attr->bitmap);
+                ram_block_attributes_notify_discard(attr, cur, block_size);
+            }
+        }
+    } else {
+        if (is_range_populated) {
+            /* Already shared */
+        } else if (is_range_discarded) {
+            /* Complete private */
+            bitmap_set(attr->bitmap, first_bit, nbits);
+            ret = ram_block_attributes_notify_populate(attr, offset, size);
+        } else {
+            /* Unexpected mixture: process individual blocks */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (test_bit(bit, attr->bitmap)) {
+                    continue;
+                }
+                set_bit(bit, attr->bitmap);
+                ret = ram_block_attributes_notify_populate(attr, cur,
+                                                           block_size);
+                if (ret) {
+                    break;
+                }
+            }
+        }
+    }
+
+    return ret;
+}
+
+RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
+{
+    uint64_t bitmap_size;
+    const int block_size  = qemu_real_host_page_size();
+    RamBlockAttributes *attr;
+    int ret;
+    MemoryRegion *mr = ram_block->mr;
+
+    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
+
+    attr->ram_block = ram_block;
+    ret = memory_region_set_ram_discard_manager(mr, RAM_DISCARD_MANAGER(attr));
+    if (ret) {
+        object_unref(OBJECT(attr));
+        return NULL;
+    }
+    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
+    attr->bitmap_size = bitmap_size;
+    attr->bitmap = bitmap_new(bitmap_size);
+
+    return attr;
+}
+
+void ram_block_attributes_destroy(RamBlockAttributes *attr)
+{
+    if (!attr) {
+        return;
+    }
+
+    g_free(attr->bitmap);
+    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
+    object_unref(OBJECT(attr));
+}
+
+static void ram_block_attributes_init(Object *obj)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
+
+    QLIST_INIT(&attr->rdl_list);
+}
+
+static void ram_block_attributes_finalize(Object *obj)
+{
+}
+
+static void ram_block_attributes_class_init(ObjectClass *klass,
+                                            const void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
+
+    rdmc->get_min_granularity = ram_block_attributes_rdm_get_min_granularity;
+    rdmc->register_listener = ram_block_attributes_rdm_register_listener;
+    rdmc->unregister_listener = ram_block_attributes_rdm_unregister_listener;
+    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
+    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
+    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
+}
diff --git a/system/trace-events b/system/trace-events
index be12ebfb41..82856e44f2 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
 dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
 dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
 dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
+
+# ram-block-attributes.c
+ram_block_attributes_state_change(uint64_t offset, uint64_t size, const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64" from '%s' to '%s'"
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 5/5] physmem: Support coordinated discarding of RAM with guest_memfd
  2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
@ 2025-05-30  8:32 ` Chenyi Qiang
  2025-06-04 11:04   ` Alexey Kardashevskiy
  2025-06-04 11:29   ` David Hildenbrand
  2025-06-09  3:23 ` [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
  5 siblings, 2 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-05-30  8:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

A new field, attributes, was introduced in RAMBlock to link to a
RamBlockAttributes object, which centralizes all guest_memfd related
information (such as fd and shared bitmap) within a RAMBlock.

Create and initialize the RamBlockAttributes object upon ram_block_add().
Meanwhile, register the object in the target RAMBlock's MemoryRegion.
After that, guest_memfd-backed RAMBlock is associated with the
RamDiscardManager interface, and the users can execute RamDiscardManager
specific handling. For example, VFIO will register the
RamDiscardListener and get notifications when the state_change() helper
invokes.

As coordinate discarding of RAM with guest_memfd is now supported, only
block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v6:
    - Squash the unblocking of cooridnate discard into this commit.
    - Remove the checks in migration path.

Changes in v5:
    - Revert to use RamDiscardManager interface.
    - Move the object_new() into the ram_block_attribute_create()
      helper.
    - Add some check in migration path.

Changes in v4:
    - Remove the replay operations for attribute changes which will be
      handled in a listener in following patches.
    - Add some comment in the error path of realize() to remind the
      future development of the unified error path.

Changes in v3:
    - Use ram_discard_manager_reply_populated/discarded() to set the
      memory attribute and add the undo support if state_change()
      failed.
    - Didn't add Reviewed-by from Alexey due to the new changes in this
      commit.
---
 accel/kvm/kvm-all.c       |  9 +++++++++
 include/system/ramblock.h |  1 +
 system/physmem.c          | 18 ++++++++++++++++--
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 51526d301b..3b390bbb09 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3089,6 +3089,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    ret = ram_block_attributes_state_change(RAM_BLOCK_ATTRIBUTES(mr->rdm),
+                                            offset, size, to_private);
+    if (ret) {
+        error_report("Failed to notify the listener the state change of "
+                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
+                     start, size, to_private ? "private" : "shared");
+        goto out_unref;
+    }
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index 1bab9e2dac..87e847e184 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -46,6 +46,7 @@ struct RAMBlock {
     int fd;
     uint64_t fd_offset;
     int guest_memfd;
+    RamBlockAttributes *attributes;
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca309e..1f1217fa0a 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         }
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");
@@ -1931,6 +1931,19 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             goto out_free;
         }
 
+        new_block->attributes = ram_block_attributes_create(new_block);
+        if (!new_block->attributes) {
+            error_setg(errp, "Failed to create ram block attribute");
+            /*
+             * The error path could be unified if the rest of ram_block_add()
+             * ever develops a need to check for errors.
+             */
+            close(new_block->guest_memfd);
+            ram_block_coordinated_discard_require(false);
+            qemu_mutex_unlock_ramlist();
+            goto out_free;
+        }
+
         /*
          * Add a specific guest_memfd blocker if a generic one would not be
          * added by ram_block_add_cpr_blocker.
@@ -2287,8 +2300,9 @@ static void reclaim_ramblock(RAMBlock *block)
     }
 
     if (block->guest_memfd >= 0) {
+        ram_block_attributes_destroy(block->attributes);
         close(block->guest_memfd);
-        ram_block_discard_require(false);
+        ram_block_coordinated_discard_require(false);
     }
 
     g_free(block);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2025-06-01  9:19   ` Gupta, Pankaj
  2025-06-01 15:44   ` Xiaoyao Li
  1 sibling, 0 replies; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-01  9:19 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson


> Rename the helper to memory_region_section_intersect_range() to make it
> more generic. Meanwhile, define the @end as Int128 and replace the
> related operations with Int128_* format since the helper is exported as
> a wider API.
> 
> Suggested-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

> ---
> Changes in v6:
>      - No change.
> 
> Changes in v5:
>      - Indent change for int128 ops to avoid the line over 80
>      - Add two Review-by from Alexey and Zhao
> 
> Changes in v4:
>      - No change.
> ---
>   hw/virtio/virtio-mem.c  | 32 +++++---------------------------
>   include/system/memory.h | 30 ++++++++++++++++++++++++++++++
>   2 files changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index a3d1a676e7..b3c126ea1e 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -244,28 +244,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
>       return ret;
>   }
>   
> -/*
> - * Adjust the memory section to cover the intersection with the given range.
> - *
> - * Returns false if the intersection is empty, otherwise returns true.
> - */
> -static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
> -                                                uint64_t offset, uint64_t size)
> -{
> -    uint64_t start = MAX(s->offset_within_region, offset);
> -    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
> -                       offset + size);
> -
> -    if (end <= start) {
> -        return false;
> -    }
> -
> -    s->offset_within_address_space += start - s->offset_within_region;
> -    s->offset_within_region = start;
> -    s->size = int128_make64(end - start);
> -    return true;
> -}
> -
>   typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
>   
>   static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
> @@ -287,7 +265,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
>                                         first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -319,7 +297,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
>                                    first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -355,7 +333,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
>           rdl->notify_discard(rdl, &tmp);
> @@ -371,7 +349,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
>           ret = rdl->notify_populate(rdl, &tmp);
> @@ -388,7 +366,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>               if (rdl2 == rdl) {
>                   break;
>               }
> -            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
> +            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>                   continue;
>               }
>               rdl2->notify_discard(rdl2, &tmp);
> diff --git a/include/system/memory.h b/include/system/memory.h
> index fbbf4cf911..b961c4076a 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -1211,6 +1211,36 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
>    */
>   void memory_region_section_free_copy(MemoryRegionSection *s);
>   
> +/**
> + * memory_region_section_intersect_range: Adjust the memory section to cover
> + * the intersection with the given range.
> + *
> + * @s: the #MemoryRegionSection to be adjusted
> + * @offset: the offset of the given range in the memory region
> + * @size: the size of the given range
> + *
> + * Returns false if the intersection is empty, otherwise returns true.
> + */
> +static inline bool memory_region_section_intersect_range(MemoryRegionSection *s,
> +                                                         uint64_t offset,
> +                                                         uint64_t size)
> +{
> +    uint64_t start = MAX(s->offset_within_region, offset);
> +    Int128 end = int128_min(int128_add(int128_make64(s->offset_within_region),
> +                                       s->size),
> +                            int128_add(int128_make64(offset),
> +                                       int128_make64(size)));
> +
> +    if (int128_le(end, int128_make64(start))) {
> +        return false;
> +    }
> +
> +    s->offset_within_address_space += start - s->offset_within_region;
> +    s->offset_within_region = start;
> +    s->size = int128_sub(end, int128_make64(start));
> +    return true;
> +}
> +
>   /**
>    * memory_region_init: Initialize a memory region
>    *


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-06-01  9:21   ` Gupta, Pankaj
  2025-06-01 15:53   ` Xiaoyao Li
  2025-06-04 14:37   ` Alexey Kardashevskiy
  2 siblings, 0 replies; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-01  9:21 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson


> Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
> RamDiscardManager is already set in the MemoryRegion. The caller must
> handle this failure, such as having virtio-mem undo its actions and fail
> the realize() process. Opportunistically move the call earlier to avoid
> complex error handling.
> 
> This change is beneficial when introducing a new RamDiscardManager
> instance besides virtio-mem. After
> ram_block_coordinated_discard_require(true) unlocks all
> RamDiscardManager instances, only one instance is allowed to be set for
> one MemoryRegion at present.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

> ---
> Changes in v6:
>      - Add Reviewed-by from David.
> 
> Changes in v5:
>      - Nit in commit message (return false -> -EBUSY)
>      - Add set_ram_discard_manager(NULL) when ram_block_discard_range()
>        fails.
> 
> Changes in v3:
>      - Move set_ram_discard_manager() up to avoid a g_free()
>      - Clean up set_ram_discard_manager() definition
> ---
>   hw/virtio/virtio-mem.c  | 30 +++++++++++++++++-------------
>   include/system/memory.h |  6 +++---
>   system/memory.c         | 10 +++++++---
>   3 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index b3c126ea1e..2e491e8c44 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1047,6 +1047,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>           return;
>       }
>   
> +    /*
> +     * Set ourselves as RamDiscardManager before the plug handler maps the
> +     * memory region and exposes it via an address space.
> +     */
> +    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> +                                              RAM_DISCARD_MANAGER(vmem))) {
> +        error_setg(errp, "Failed to set RamDiscardManager");
> +        ram_block_coordinated_discard_require(false);
> +        return;
> +    }
> +
>       /*
>        * We don't know at this point whether shared RAM is migrated using
>        * QEMU or migrated using the file content. "x-ignore-shared" will be
> @@ -1061,6 +1072,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>           ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
>           if (ret) {
>               error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
> +            memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>               ram_block_coordinated_discard_require(false);
>               return;
>           }
> @@ -1122,13 +1134,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>       vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
>       vmem->system_reset->vmem = vmem;
>       qemu_register_resettable(obj);
> -
> -    /*
> -     * Set ourselves as RamDiscardManager before the plug handler maps the
> -     * memory region and exposes it via an address space.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> -                                          RAM_DISCARD_MANAGER(vmem));
>   }
>   
>   static void virtio_mem_device_unrealize(DeviceState *dev)
> @@ -1136,12 +1141,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>       VirtIOMEM *vmem = VIRTIO_MEM(dev);
>   
> -    /*
> -     * The unplug handler unmapped the memory region, it cannot be
> -     * found via an address space anymore. Unset ourselves.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
> -
>       qemu_unregister_resettable(OBJECT(vmem->system_reset));
>       object_unref(OBJECT(vmem->system_reset));
>   
> @@ -1154,6 +1153,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       virtio_del_queue(vdev, 0);
>       virtio_cleanup(vdev);
>       g_free(vmem->bitmap);
> +    /*
> +     * The unplug handler unmapped the memory region, it cannot be
> +     * found via an address space anymore. Unset ourselves.
> +     */
> +    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>       ram_block_coordinated_discard_require(false);
>   }
>   
> diff --git a/include/system/memory.h b/include/system/memory.h
> index b961c4076a..896948deb1 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2499,13 +2499,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
>    *
>    * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
>    * that does not cover RAM, or a #MemoryRegion that already has a
> - * #RamDiscardManager assigned.
> + * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
>    *
>    * @mr: the #MemoryRegion
>    * @rdm: #RamDiscardManager to set
>    */
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm);
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm);
>   
>   /**
>    * memory_region_find: translate an address/size relative to a
> diff --git a/system/memory.c b/system/memory.c
> index 63b983efcd..b45b508dce 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2106,12 +2106,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
>       return mr->rdm;
>   }
>   
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm)
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm)
>   {
>       g_assert(memory_region_is_ram(mr));
> -    g_assert(!rdm || !mr->rdm);
> +    if (mr->rdm && rdm) {
> +        return -EBUSY;
> +    }
> +
>       mr->rdm = rdm;
> +    return 0;
>   }
>   
>   uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
@ 2025-06-01  9:58   ` Gupta, Pankaj
  2025-06-03  1:26     ` Chenyi Qiang
  2025-06-02 21:10   ` David Hildenbrand
  2025-06-04 11:04   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-01  9:58 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

On 5/30/2025 10:32 AM, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") highlighted that subsystems like VFIO may disable RAM block
> discard. However, guest_memfd relies on discard operations for page
> conversion between private and shared memory, potentially leading to
> the stale IOMMU mapping issue when assigning hardware devices to
> confidential VMs via shared memory. To address this and allow shared
> device assignement, it is crucial to ensure the VFIO system refreshes
> its IOMMU mappings.
> 
> RamDiscardManager is an existing interface (used by virtio-mem) to
> adjust VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. Implementing it in HostMemoryBackend is
> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
> have a memory backend while others do not. Notably, virtual BIOS
> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
> backend.
> 
> To manage RAMBlocks with guest_memfd, define a new object named
> RamBlockAttributes to implement the RamDiscardManager interface. This
> object can store the guest_memfd information such as bitmap for shared
> memory and the registered listeners for event notification. In the
> context of RamDiscardManager, shared state is analogous to populated, and
> private state is signified as discarded. To notify the conversion events,
> a new state_change() helper is exported for the users to notify the
> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
> shared mapping.
> 
> Note that the memory state is tracked at the host page size granularity,
> as the minimum conversion size can be one page per request and VFIO
> expects the DMA mapping for a specific iova to be mapped and unmapped
> with the same granularity. Confidential VMs may perform partial
> conversions, such as conversions on small regions within larger ones.
> To prevent such invalid cases and until DMA mapping cut operation
> support is available, all operations are performed with 4K granularity.
> 
> In addition, memory conversion failures cause QEMU to quit instead of
> resuming the guest or retrying the operation at present. It would be
> future work to add more error handling or rollback mechanisms once
> conversion failures are allowed. For example, in-place conversion of
> guest_memfd could retry the unmap operation during the conversion from
> shared to private. For now, keep the complex error handling out of the
> picture as it is not required.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v6:
>      - Change the object type name from RamBlockAttribute to
>        RamBlockAttributes. (David)
>      - Save the associated RAMBlock instead MemoryRegion in
>        RamBlockAttributes. (David)
>      - Squash the state_change() helper introduction in this commit as
>        well as the mixture conversion case handling. (David)
>      - Change the block_size type from int to size_t and some cleanup in
>        validation check. (Alexey)
>      - Add a tracepoint to track the state changes. (Alexey)
> 
> Changes in v5:
>      - Revert to use RamDiscardManager interface instead of introducing
>        new hierarchy of class to manage private/shared state, and keep
>        using the new name of RamBlockAttribute compared with the
>        MemoryAttributeManager in v3.
>      - Use *simple* version of object_define and object_declare since the
>        state_change() function is changed as an exported function instead
>        of a virtual function in later patch.
>      - Move the introduction of RamBlockAttribute field to this patch and
>        rename it to ram_shared. (Alexey)
>      - call the exit() when register/unregister failed. (Zhao)
>      - Add the ram-block-attribute.c to Memory API related part in
>        MAINTAINERS.
> 
> Changes in v4:
>      - Change the name from memory-attribute-manager to
>        ram-block-attribute.
>      - Implement the newly-introduced PrivateSharedManager instead of
>        RamDiscardManager and change related commit message.
>      - Define the new object in ramblock.h instead of adding a new file.
> ---
>   MAINTAINERS                   |   1 +
>   include/system/ramblock.h     |  21 ++
>   system/meson.build            |   1 +
>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>   system/trace-events           |   3 +
>   5 files changed, 506 insertions(+)
>   create mode 100644 system/ram-block-attributes.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6dacd6d004..8ec39aa7f8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3149,6 +3149,7 @@ F: system/memory.c
>   F: system/memory_mapping.c
>   F: system/physmem.c
>   F: system/memory-internal.h
> +F: system/ram-block-attributes.c
>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>   
>   Memory devices
> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
> index d8a116ba99..1bab9e2dac 100644
> --- a/include/system/ramblock.h
> +++ b/include/system/ramblock.h
> @@ -22,6 +22,10 @@
>   #include "exec/cpu-common.h"
>   #include "qemu/rcu.h"
>   #include "exec/ramlist.h"
> +#include "system/hostmem.h"
> +
> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>   
>   struct RAMBlock {
>       struct rcu_head rcu;
> @@ -91,4 +95,21 @@ struct RAMBlock {
>       ram_addr_t postcopy_length;
>   };
>   
> +struct RamBlockAttributes {
> +    Object parent;
> +
> +    RAMBlock *ram_block;
> +
> +    /* 1-setting of the bitmap represents ram is populated (shared) */
> +    unsigned bitmap_size;
> +    unsigned long *bitmap;
> +
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
> +int ram_block_attributes_state_change(RamBlockAttributes *attr, uint64_t offset,
> +                                      uint64_t size, bool to_discard);
> +
>   #endif
> diff --git a/system/meson.build b/system/meson.build
> index c2f0082766..2747dbde80 100644
> --- a/system/meson.build
> +++ b/system/meson.build
> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>     'dma-helpers.c',
>     'globals.c',
>     'ioport.c',
> +  'ram-block-attributes.c',
>     'memory_mapping.c',
>     'memory.c',
>     'physmem.c',
> diff --git a/system/ram-block-attributes.c b/system/ram-block-attributes.c
> new file mode 100644
> index 0000000000..514252413f
> --- /dev/null
> +++ b/system/ram-block-attributes.c
> @@ -0,0 +1,480 @@
> +/*
> + * QEMU ram block attributes
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "system/ramblock.h"
> +#include "trace.h"
> +
> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
> +                                          ram_block_attributes,
> +                                          RAM_BLOCK_ATTRIBUTES,
> +                                          OBJECT,
> +                                          { TYPE_RAM_DISCARD_MANAGER },
> +                                          { })
> +
> +static size_t
> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
> +{
> +    /*
> +     * Because page conversion could be manipulated in the size of at least 4K
> +     * or 4K aligned, Use the host page size as the granularity to track the
> +     * memory attribute.
> +     */
> +    g_assert(attr && attr->ram_block);
> +    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
> +    return attr->ram_block->page_size;
> +}
> +
> +
> +static bool
> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
> +                                      const MemoryRegionSection *section)
> +{
> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const uint64_t first_bit = section->offset_within_region / block_size;
> +    const uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
> +    unsigned long first_discarded_bit;
> +
> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
> +                                           first_bit);
> +    return first_discarded_bit > last_bit;
> +}
> +
> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
> +                                               void *arg);
> +
> +static int
> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
> +                                        void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    return rdl->notify_populate(rdl, section);
> +}
> +
> +static int
> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
> +                                       void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    rdl->notify_discard(rdl, section);
> +    return 0;
> +}
> +
> +static int
> +ram_block_attributes_for_each_populated_section(const RamBlockAttributes *attr,
> +                                                MemoryRegionSection *section,
> +                                                void *arg,
> +                                                ram_block_attributes_section_cb cb)
> +{
> +    unsigned long first_bit, last_bit;
> +    uint64_t offset, size;
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    int ret = 0;
> +
> +    first_bit = section->offset_within_region / block_size;
> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
> +                              first_bit);
> +
> +    while (first_bit < attr->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_bit * block_size;
> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
> +                                      first_bit + 1) - 1;
> +        size = (last_bit - first_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s",
> +                         __func__, strerror(-ret));
> +            break;
> +        }
> +
> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
> +                                  last_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static int
> +ram_block_attributes_for_each_discarded_section(const RamBlockAttributes *attr,
> +                                                MemoryRegionSection *section,
> +                                                void *arg,
> +                                                ram_block_attributes_section_cb cb)
> +{
> +    unsigned long first_bit, last_bit;
> +    uint64_t offset, size;
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    int ret = 0;
> +
> +    first_bit = section->offset_within_region / block_size;
> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
> +                                   first_bit);
> +
> +    while (first_bit < attr->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_bit * block_size;
> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
> +                                 first_bit + 1) - 1;
> +        size = (last_bit - first_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s",
> +                         __func__, strerror(-ret));
> +            break;
> +        }
> +
> +        first_bit = find_next_zero_bit(attr->bitmap,
> +                                       attr->bitmap_size,
> +                                       last_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint64_t
> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager *rdm,
> +                                             const MemoryRegion *mr)
> +{
> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +
> +    g_assert(mr == attr->ram_block->mr);
> +    return ram_block_attributes_get_block_size(attr);
> +}
> +
> +static void
> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
> +                                           RamDiscardListener *rdl,
> +                                           MemoryRegionSection *section)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    int ret;
> +
> +    g_assert(section->mr == attr->ram_block->mr);
> +    rdl->section = memory_region_section_new_copy(section);
> +
> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
> +
> +    ret = ram_block_attributes_for_each_populated_section(attr, section, rdl,
> +                                    ram_block_attributes_notify_populate_cb);
> +    if (ret) {
> +        error_report("%s: Failed to register RAM discard listener: %s",
> +                     __func__, strerror(-ret));
> +        exit(1);
> +    }
> +}
> +
> +static void
> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
> +                                             RamDiscardListener *rdl)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    int ret;
> +
> +    g_assert(rdl->section);
> +    g_assert(rdl->section->mr == attr->ram_block->mr);
> +
> +    if (rdl->double_discard_supported) {
> +        rdl->notify_discard(rdl, rdl->section);
> +    } else {
> +        ret = ram_block_attributes_for_each_populated_section(attr,
> +                rdl->section, rdl, ram_block_attributes_notify_discard_cb);
> +        if (ret) {
> +            error_report("%s: Failed to unregister RAM discard listener: %s",
> +                         __func__, strerror(-ret));
> +            exit(1);
> +        }
> +    }
> +
> +    memory_region_section_free_copy(rdl->section);
> +    rdl->section = NULL;
> +    QLIST_REMOVE(rdl, next);
> +}
> +
> +typedef struct RamBlockAttributesReplayData {
> +    ReplayRamDiscardState fn;
> +    void *opaque;
> +} RamBlockAttributesReplayData;
> +
> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection *section,
> +                                              void *arg)
> +{
> +    RamBlockAttributesReplayData *data = arg;
> +
> +    return data->fn(section, data->opaque);
> +}
> +
> +static int
> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
> +                                          MemoryRegionSection *section,
> +                                          ReplayRamDiscardState replay_fn,
> +                                          void *opaque)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == attr->ram_block->mr);
> +    return ram_block_attributes_for_each_populated_section(attr, section, &data,
> +                                            ram_block_attributes_rdm_replay_cb);
> +}
> +
> +static int
> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                          MemoryRegionSection *section,
> +                                          ReplayRamDiscardState replay_fn,
> +                                          void *opaque)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == attr->ram_block->mr);
> +    return ram_block_attributes_for_each_discarded_section(attr, section, &data,
> +                                            ram_block_attributes_rdm_replay_cb);
> +}
> +
> +static bool
> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr, uint64_t offset,
> +                                    uint64_t size)
> +{
> +    MemoryRegion *mr = attr->ram_block->mr;
> +
> +    g_assert(mr);
> +
> +    uint64_t region_size = memory_region_size(mr);
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +
> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
> +        !QEMU_IS_ALIGNED(size, block_size)) {
> +        return false;
> +    }
> +    if (offset + size <= offset) {
> +        return false;
> +    }
> +    if (offset + size > region_size) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static void ram_block_attributes_notify_discard(RamBlockAttributes *attr,
> +                                                uint64_t offset,
> +                                                uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +
> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +        rdl->notify_discard(rdl, &tmp);
> +    }
> +}
> +
> +static int
> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
> +                                     uint64_t offset, uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +        ret = rdl->notify_populate(rdl, &tmp);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static bool ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
> +                                                    uint64_t offset,
> +                                                    uint64_t size)
> +{
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
> +                                   first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +static bool
> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
> +                                        uint64_t offset, uint64_t size)
> +{
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
> +                                      uint64_t offset, uint64_t size,
> +                                      bool to_discard)
> +{
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long nbits = size / block_size;
> +    bool is_range_discarded, is_range_populated;
> +    const uint64_t end = offset + size;
> +    unsigned long bit;
> +    uint64_t cur;
> +    int ret = 0;
> +
> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
> +                     __func__, offset, size);
> +        return -EINVAL;
> +    }
> +
> +    is_range_discarded = ram_block_attributes_is_range_discarded(attr, offset,
> +                                                                 size);
> +    is_range_populated = ram_block_attributes_is_range_populated(attr, offset,
> +                                                                 size);
> +
> +    trace_ram_block_attributes_state_change(offset, size,
> +                                            is_range_discarded ? "discarded" :
> +                                            is_range_populated ? "populated" :
> +                                            "mixture",
> +                                            to_discard ? "discarded" :
> +                                            "populated");
> +    if (to_discard) {
> +        if (is_range_discarded) {
> +            /* Already private */
> +        } else if (is_range_populated) {
> +            /* Completely shared */
> +            bitmap_clear(attr->bitmap, first_bit, nbits);
> +            ram_block_attributes_notify_discard(attr, offset, size);

In this patch series we are only maintaining the bitmap for Ram 
discard/populate state not for regular guest_memfd private/shared?

> +        } else {
> +            /* Unexpected mixture: process individual blocks */
> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (!test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                clear_bit(bit, attr->bitmap);
> +                ram_block_attributes_notify_discard(attr, cur, block_size);
> +            }
> +        }
> +    } else {
> +        if (is_range_populated) {
> +            /* Already shared */
> +        } else if (is_range_discarded) {
> +            /* Complete private */
> +            bitmap_set(attr->bitmap, first_bit, nbits);
> +            ret = ram_block_attributes_notify_populate(attr, offset, size);
> +        } else {
> +            /* Unexpected mixture: process individual blocks */
> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                set_bit(bit, attr->bitmap);
> +                ret = ram_block_attributes_notify_populate(attr, cur,
> +                                                           block_size);
> +                if (ret) {
> +                    break;
> +                }
> +            }
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
> +{
> +    uint64_t bitmap_size;
> +    const int block_size  = qemu_real_host_page_size();
> +    RamBlockAttributes *attr;
> +    int ret;
> +    MemoryRegion *mr = ram_block->mr;
> +
> +    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
> +
> +    attr->ram_block = ram_block;
> +    ret = memory_region_set_ram_discard_manager(mr, RAM_DISCARD_MANAGER(attr));
> +    if (ret) {
> +        object_unref(OBJECT(attr));
> +        return NULL;
> +    }
> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
> +    attr->bitmap_size = bitmap_size;
> +    attr->bitmap = bitmap_new(bitmap_size);
> +
> +    return attr;
> +}
> +
> +void ram_block_attributes_destroy(RamBlockAttributes *attr)
> +{
> +    if (!attr) {
> +        return;
> +    }
> +
> +    g_free(attr->bitmap);
> +    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
> +    object_unref(OBJECT(attr));
> +}
> +
> +static void ram_block_attributes_init(Object *obj)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
> +
> +    QLIST_INIT(&attr->rdl_list);
> +}
> +
> +static void ram_block_attributes_finalize(Object *obj)
> +{
> +}
> +
> +static void ram_block_attributes_class_init(ObjectClass *klass,
> +                                            const void *data)
> +{
> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
> +
> +    rdmc->get_min_granularity = ram_block_attributes_rdm_get_min_granularity;
> +    rdmc->register_listener = ram_block_attributes_rdm_register_listener;
> +    rdmc->unregister_listener = ram_block_attributes_rdm_unregister_listener;
> +    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
> +    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
> +    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
> +}
> diff --git a/system/trace-events b/system/trace-events
> index be12ebfb41..82856e44f2 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
>   dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>   dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>   dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
> +
> +# ram-block-attributes.c
> +ram_block_attributes_state_change(uint64_t offset, uint64_t size, const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64" from '%s' to '%s'"


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2025-06-01  9:19   ` Gupta, Pankaj
@ 2025-06-01 15:44   ` Xiaoyao Li
  1 sibling, 0 replies; 30+ messages in thread
From: Xiaoyao Li @ 2025-06-01 15:44 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Cédric Le Goater, Alex Williamson

On 5/30/2025 4:32 PM, Chenyi Qiang wrote:
> Rename the helper to memory_region_section_intersect_range() to make it
> more generic. Meanwhile, define the @end as Int128 and replace the
> related operations with Int128_* format since the helper is exported as
> a wider API.
> 
> Suggested-by: Alexey Kardashevskiy<aik@amd.com>
> Reviewed-by: Alexey Kardashevskiy<aik@amd.com>
> Reviewed-by: David Hildenbrand<david@redhat.com>
> Reviewed-by: Zhao Liu<zhao1.liu@intel.com>
> Signed-off-by: Chenyi Qiang<chenyi.qiang@intel.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
  2025-06-01  9:21   ` Gupta, Pankaj
@ 2025-06-01 15:53   ` Xiaoyao Li
  2025-06-04 14:37   ` Alexey Kardashevskiy
  2 siblings, 0 replies; 30+ messages in thread
From: Xiaoyao Li @ 2025-06-01 15:53 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Cédric Le Goater, Alex Williamson

On 5/30/2025 4:32 PM, Chenyi Qiang wrote:
> Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
> RamDiscardManager is already set in the MemoryRegion. The caller must
> handle this failure, such as having virtio-mem undo its actions and fail
> the realize() process. Opportunistically move the call earlier to avoid
> complex error handling.
> 
> This change is beneficial when introducing a new RamDiscardManager
> instance besides virtio-mem. After
> ram_block_coordinated_discard_require(true) unlocks all
> RamDiscardManager instances, only one instance is allowed to be set for
> one MemoryRegion at present.
> 
> Suggested-by: David Hildenbrand<david@redhat.com>
> Reviewed-by: David Hildenbrand<david@redhat.com>
> Signed-off-by: Chenyi Qiang<chenyi.qiang@intel.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-30  8:32 ` [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-06-01 16:12   ` Xiaoyao Li
  2025-06-02  9:20   ` Gupta, Pankaj
  1 sibling, 0 replies; 30+ messages in thread
From: Xiaoyao Li @ 2025-06-01 16:12 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Cédric Le Goater, Alex Williamson

On 5/30/2025 4:32 PM, Chenyi Qiang wrote:
> Update ReplayRamDiscard() function to return the result and unify the
> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
> the same time due to their identical definitions. This unification
> simplifies related structures, such as VirtIOMEMReplayData, which makes
> it cleaner.
> 
> Reviewed-by: David Hildenbrand<david@redhat.com>
> Signed-off-by: Chenyi Qiang<chenyi.qiang@intel.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-30  8:32 ` [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
  2025-06-01 16:12   ` Xiaoyao Li
@ 2025-06-02  9:20   ` Gupta, Pankaj
  1 sibling, 0 replies; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-02  9:20 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson


> Update ReplayRamDiscard() function to return the result and unify the
> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
> the same time due to their identical definitions. This unification
> simplifies related structures, such as VirtIOMEMReplayData, which makes
> it cleaner.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

> ---
> Changes in v6:
>      - Add Reviewed-by from David
>      - Add a documentation comment for the prototype change
> 
> Changes in v5:
>      - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
>      - return data->fn(s, data->opaque) instead of 0 in
>        virtio_mem_rdm_replay_discarded_cb(). (Alexey)
> 
> Changes in v4:
>      - Modify the commit message. We won't use Replay() operation when
>        doing the attribute change like v3.
> ---
>   hw/virtio/virtio-mem.c  | 21 +++++++-------
>   include/system/memory.h | 64 ++++++++++++++++++++++++++++++-----------
>   migration/ram.c         |  5 ++--
>   system/memory.c         | 12 ++++----
>   4 files changed, 66 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 2e491e8c44..c46f6f9c3e 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1732,7 +1732,7 @@ static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
>   }
>   
>   struct VirtIOMEMReplayData {
> -    void *fn;
> +    ReplayRamDiscardState fn;
>       void *opaque;
>   };
>   
> @@ -1740,12 +1740,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
>   {
>       struct VirtIOMEMReplayData *data = arg;
>   
> -    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
> +    return data->fn(s, data->opaque);
>   }
>   
>   static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
>                                              MemoryRegionSection *s,
> -                                           ReplayRamPopulate replay_fn,
> +                                           ReplayRamDiscardState replay_fn,
>                                              void *opaque)
>   {
>       const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
> @@ -1764,14 +1764,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
>   {
>       struct VirtIOMEMReplayData *data = arg;
>   
> -    ((ReplayRamDiscard)data->fn)(s, data->opaque);
> -    return 0;
> +    return data->fn(s, data->opaque);
>   }
>   
> -static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
> -                                            MemoryRegionSection *s,
> -                                            ReplayRamDiscard replay_fn,
> -                                            void *opaque)
> +static int virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                           MemoryRegionSection *s,
> +                                           ReplayRamDiscardState replay_fn,
> +                                           void *opaque)
>   {
>       const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
>       struct VirtIOMEMReplayData data = {
> @@ -1780,8 +1779,8 @@ static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
>       };
>   
>       g_assert(s->mr == &vmem->memdev->mr);
> -    virtio_mem_for_each_unplugged_section(vmem, s, &data,
> -                                          virtio_mem_rdm_replay_discarded_cb);
> +    return virtio_mem_for_each_unplugged_section(vmem, s, &data,
> +                                                 virtio_mem_rdm_replay_discarded_cb);
>   }
>   
>   static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 896948deb1..4f45a187d6 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -575,8 +575,20 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
>       rdl->double_discard_supported = double_discard_supported;
>   }
>   
> -typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
> -typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
> +/**
> + * ReplayRamDiscardState:
> + *
> + * The callback handler for #RamDiscardManagerClass.replay_populated/
> + * #RamDiscardManagerClass.replay_discarded to invoke on populated/discarded
> + * parts.
> + *
> + * @section: the #MemoryRegionSection of populated/discarded part
> + * @opaque: pointer to forward to the callback
> + *
> + * Returns 0 on success, or a negative error if failed.
> + */
> +typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
> +                                     void *opaque);
>   
>   /*
>    * RamDiscardManagerClass:
> @@ -650,36 +662,38 @@ struct RamDiscardManagerClass {
>       /**
>        * @replay_populated:
>        *
> -     * Call the #ReplayRamPopulate callback for all populated parts within the
> -     * #MemoryRegionSection via the #RamDiscardManager.
> +     * Call the #ReplayRamDiscardState callback for all populated parts within
> +     * the #MemoryRegionSection via the #RamDiscardManager.
>        *
>        * In case any call fails, no further calls are made.
>        *
>        * @rdm: the #RamDiscardManager
>        * @section: the #MemoryRegionSection
> -     * @replay_fn: the #ReplayRamPopulate callback
> +     * @replay_fn: the #ReplayRamDiscardState callback
>        * @opaque: pointer to forward to the callback
>        *
>        * Returns 0 on success, or a negative error if any notification failed.
>        */
>       int (*replay_populated)(const RamDiscardManager *rdm,
>                               MemoryRegionSection *section,
> -                            ReplayRamPopulate replay_fn, void *opaque);
> +                            ReplayRamDiscardState replay_fn, void *opaque);
>   
>       /**
>        * @replay_discarded:
>        *
> -     * Call the #ReplayRamDiscard callback for all discarded parts within the
> -     * #MemoryRegionSection via the #RamDiscardManager.
> +     * Call the #ReplayRamDiscardState callback for all discarded parts within
> +     * the #MemoryRegionSection via the #RamDiscardManager.
>        *
>        * @rdm: the #RamDiscardManager
>        * @section: the #MemoryRegionSection
> -     * @replay_fn: the #ReplayRamDiscard callback
> +     * @replay_fn: the #ReplayRamDiscardState callback
>        * @opaque: pointer to forward to the callback
> +     *
> +     * Returns 0 on success, or a negative error if any notification failed.
>        */
> -    void (*replay_discarded)(const RamDiscardManager *rdm,
> -                             MemoryRegionSection *section,
> -                             ReplayRamDiscard replay_fn, void *opaque);
> +    int (*replay_discarded)(const RamDiscardManager *rdm,
> +                            MemoryRegionSection *section,
> +                            ReplayRamDiscardState replay_fn, void *opaque);
>   
>       /**
>        * @register_listener:
> @@ -720,15 +734,31 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
>   bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
>                                         const MemoryRegionSection *section);
>   
> +/**
> + * ram_discard_manager_replay_populated:
> + *
> + * A wrapper to call the #RamDiscardManagerClass.replay_populated callback
> + * of the #RamDiscardManager.
> + *
> + * Returns 0 on success, or a negative error if any notification failed.
> + */
>   int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>                                            MemoryRegionSection *section,
> -                                         ReplayRamPopulate replay_fn,
> +                                         ReplayRamDiscardState replay_fn,
>                                            void *opaque);
>   
> -void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> -                                          MemoryRegionSection *section,
> -                                          ReplayRamDiscard replay_fn,
> -                                          void *opaque);
> +/**
> + * ram_discard_manager_replay_discarded:
> + *
> + * A wrapper to call the #RamDiscardManagerClass.replay_discarded callback
> + * of the #RamDiscardManager.
> + *
> + * Returns 0 on success, or a negative error if any notification failed.
> + */
> +int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> +                                         MemoryRegionSection *section,
> +                                         ReplayRamDiscardState replay_fn,
> +                                         void *opaque);
>   
>   void ram_discard_manager_register_listener(RamDiscardManager *rdm,
>                                              RamDiscardListener *rdl,
> diff --git a/migration/ram.c b/migration/ram.c
> index e12913b43e..c004f37060 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -848,8 +848,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
>       return ret;
>   }
>   
> -static void dirty_bitmap_clear_section(MemoryRegionSection *section,
> -                                       void *opaque)
> +static int dirty_bitmap_clear_section(MemoryRegionSection *section,
> +                                      void *opaque)
>   {
>       const hwaddr offset = section->offset_within_region;
>       const hwaddr size = int128_get64(section->size);
> @@ -868,6 +868,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
>       }
>       *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
>       bitmap_clear(rb->bmap, start, npages);
> +    return 0;
>   }
>   
>   /*
> diff --git a/system/memory.c b/system/memory.c
> index b45b508dce..de45fbdd3f 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2138,7 +2138,7 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
>   
>   int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>                                            MemoryRegionSection *section,
> -                                         ReplayRamPopulate replay_fn,
> +                                         ReplayRamDiscardState replay_fn,
>                                            void *opaque)
>   {
>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
> @@ -2147,15 +2147,15 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>       return rdmc->replay_populated(rdm, section, replay_fn, opaque);
>   }
>   
> -void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> -                                          MemoryRegionSection *section,
> -                                          ReplayRamDiscard replay_fn,
> -                                          void *opaque)
> +int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> +                                         MemoryRegionSection *section,
> +                                         ReplayRamDiscardState replay_fn,
> +                                         void *opaque)
>   {
>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
>   
>       g_assert(rdmc->replay_discarded);
> -    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
> +    return rdmc->replay_discarded(rdm, section, replay_fn, opaque);
>   }
>   
>   void ram_discard_manager_register_listener(RamDiscardManager *rdm,


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
  2025-06-01  9:58   ` Gupta, Pankaj
@ 2025-06-02 21:10   ` David Hildenbrand
  2025-06-03  1:57     ` Chenyi Qiang
  2025-06-04 11:04   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-06-02 21:10 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

On 30.05.25 10:32, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") highlighted that subsystems like VFIO may disable RAM block
> discard. However, guest_memfd relies on discard operations for page
> conversion between private and shared memory, potentially leading to
> the stale IOMMU mapping issue when assigning hardware devices to
> confidential VMs via shared memory. To address this and allow shared
> device assignement, it is crucial to ensure the VFIO system refreshes
> its IOMMU mappings.
> 
> RamDiscardManager is an existing interface (used by virtio-mem) to
> adjust VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. Implementing it in HostMemoryBackend is
> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
> have a memory backend while others do not. Notably, virtual BIOS
> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
> backend.
> 
> To manage RAMBlocks with guest_memfd, define a new object named
> RamBlockAttributes to implement the RamDiscardManager interface. This
> object can store the guest_memfd information such as bitmap for shared
> memory and the registered listeners for event notification. In the
> context of RamDiscardManager, shared state is analogous to populated, and
> private state is signified as discarded. To notify the conversion events,
> a new state_change() helper is exported for the users to notify the
> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
> shared mapping.
> 
> Note that the memory state is tracked at the host page size granularity,
> as the minimum conversion size can be one page per request and VFIO
> expects the DMA mapping for a specific iova to be mapped and unmapped
> with the same granularity. Confidential VMs may perform partial
> conversions, such as conversions on small regions within larger ones.
> To prevent such invalid cases and until DMA mapping cut operation
> support is available, all operations are performed with 4K granularity.
> 
> In addition, memory conversion failures cause QEMU to quit instead of
> resuming the guest or retrying the operation at present. It would be
> future work to add more error handling or rollback mechanisms once
> conversion failures are allowed. For example, in-place conversion of
> guest_memfd could retry the unmap operation during the conversion from
> shared to private. For now, keep the complex error handling out of the
> picture as it is not required.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v6:
>      - Change the object type name from RamBlockAttribute to
>        RamBlockAttributes. (David)
>      - Save the associated RAMBlock instead MemoryRegion in
>        RamBlockAttributes. (David)
>      - Squash the state_change() helper introduction in this commit as
>        well as the mixture conversion case handling. (David)
>      - Change the block_size type from int to size_t and some cleanup in
>        validation check. (Alexey)
>      - Add a tracepoint to track the state changes. (Alexey)
> 
> Changes in v5:
>      - Revert to use RamDiscardManager interface instead of introducing
>        new hierarchy of class to manage private/shared state, and keep
>        using the new name of RamBlockAttribute compared with the
>        MemoryAttributeManager in v3.
>      - Use *simple* version of object_define and object_declare since the
>        state_change() function is changed as an exported function instead
>        of a virtual function in later patch.
>      - Move the introduction of RamBlockAttribute field to this patch and
>        rename it to ram_shared. (Alexey)
>      - call the exit() when register/unregister failed. (Zhao)
>      - Add the ram-block-attribute.c to Memory API related part in
>        MAINTAINERS.
> 
> Changes in v4:
>      - Change the name from memory-attribute-manager to
>        ram-block-attribute.
>      - Implement the newly-introduced PrivateSharedManager instead of
>        RamDiscardManager and change related commit message.
>      - Define the new object in ramblock.h instead of adding a new file.
> ---
>   MAINTAINERS                   |   1 +
>   include/system/ramblock.h     |  21 ++
>   system/meson.build            |   1 +
>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>   system/trace-events           |   3 +
>   5 files changed, 506 insertions(+)
>   create mode 100644 system/ram-block-attributes.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6dacd6d004..8ec39aa7f8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3149,6 +3149,7 @@ F: system/memory.c
>   F: system/memory_mapping.c
>   F: system/physmem.c
>   F: system/memory-internal.h
> +F: system/ram-block-attributes.c
>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>   
>   Memory devices
> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
> index d8a116ba99..1bab9e2dac 100644
> --- a/include/system/ramblock.h
> +++ b/include/system/ramblock.h
> @@ -22,6 +22,10 @@
>   #include "exec/cpu-common.h"
>   #include "qemu/rcu.h"
>   #include "exec/ramlist.h"
> +#include "system/hostmem.h"
> +
> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>   
>   struct RAMBlock {
>       struct rcu_head rcu;
> @@ -91,4 +95,21 @@ struct RAMBlock {
>       ram_addr_t postcopy_length;
>   };
>   
> +struct RamBlockAttributes {
> +    Object parent;
> +
> +    RAMBlock *ram_block;
> +
> +    /* 1-setting of the bitmap represents ram is populated (shared) */
> +    unsigned bitmap_size;
> +    unsigned long *bitmap;

So, initially, all memory starts out as private, correct?

I guess this mimics what kvm_set_phys_mem() ends up doing, when it does 
the kvm_set_memory_attributes_private() call.

So there is a short period of inconsistency, between creating the 
RAMBlock and mapping it into the PA space.

It might be wroth spelling that out / documenting it somewhere.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-01  9:58   ` Gupta, Pankaj
@ 2025-06-03  1:26     ` Chenyi Qiang
  2025-06-03  5:26       ` Gupta, Pankaj
  0 siblings, 1 reply; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-03  1:26 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/1/2025 5:58 PM, Gupta, Pankaj wrote:
> On 5/30/2025 10:32 AM, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") highlighted that subsystems like VFIO may disable RAM block
>> discard. However, guest_memfd relies on discard operations for page
>> conversion between private and shared memory, potentially leading to
>> the stale IOMMU mapping issue when assigning hardware devices to
>> confidential VMs via shared memory. To address this and allow shared
>> device assignement, it is crucial to ensure the VFIO system refreshes
>> its IOMMU mappings.
>>
>> RamDiscardManager is an existing interface (used by virtio-mem) to
>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>> have a memory backend while others do not. Notably, virtual BIOS
>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>> backend.
>>
>> To manage RAMBlocks with guest_memfd, define a new object named
>> RamBlockAttributes to implement the RamDiscardManager interface. This
>> object can store the guest_memfd information such as bitmap for shared
>> memory and the registered listeners for event notification. In the
>> context of RamDiscardManager, shared state is analogous to populated, and
>> private state is signified as discarded. To notify the conversion events,
>> a new state_change() helper is exported for the users to notify the
>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>> shared mapping.
>>
>> Note that the memory state is tracked at the host page size granularity,
>> as the minimum conversion size can be one page per request and VFIO
>> expects the DMA mapping for a specific iova to be mapped and unmapped
>> with the same granularity. Confidential VMs may perform partial
>> conversions, such as conversions on small regions within larger ones.
>> To prevent such invalid cases and until DMA mapping cut operation
>> support is available, all operations are performed with 4K granularity.
>>
>> In addition, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation at present. It would be
>> future work to add more error handling or rollback mechanisms once
>> conversion failures are allowed. For example, in-place conversion of
>> guest_memfd could retry the unmap operation during the conversion from
>> shared to private. For now, keep the complex error handling out of the
>> picture as it is not required.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v6:
>>      - Change the object type name from RamBlockAttribute to
>>        RamBlockAttributes. (David)
>>      - Save the associated RAMBlock instead MemoryRegion in
>>        RamBlockAttributes. (David)
>>      - Squash the state_change() helper introduction in this commit as
>>        well as the mixture conversion case handling. (David)
>>      - Change the block_size type from int to size_t and some cleanup in
>>        validation check. (Alexey)
>>      - Add a tracepoint to track the state changes. (Alexey)
>>
>> Changes in v5:
>>      - Revert to use RamDiscardManager interface instead of introducing
>>        new hierarchy of class to manage private/shared state, and keep
>>        using the new name of RamBlockAttribute compared with the
>>        MemoryAttributeManager in v3.
>>      - Use *simple* version of object_define and object_declare since the
>>        state_change() function is changed as an exported function instead
>>        of a virtual function in later patch.
>>      - Move the introduction of RamBlockAttribute field to this patch and
>>        rename it to ram_shared. (Alexey)
>>      - call the exit() when register/unregister failed. (Zhao)
>>      - Add the ram-block-attribute.c to Memory API related part in
>>        MAINTAINERS.
>>
>> Changes in v4:
>>      - Change the name from memory-attribute-manager to
>>        ram-block-attribute.
>>      - Implement the newly-introduced PrivateSharedManager instead of
>>        RamDiscardManager and change related commit message.
>>      - Define the new object in ramblock.h instead of adding a new file.
>> ---
>>   MAINTAINERS                   |   1 +
>>   include/system/ramblock.h     |  21 ++
>>   system/meson.build            |   1 +
>>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>>   system/trace-events           |   3 +
>>   5 files changed, 506 insertions(+)
>>   create mode 100644 system/ram-block-attributes.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 6dacd6d004..8ec39aa7f8 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>   F: system/memory_mapping.c
>>   F: system/physmem.c
>>   F: system/memory-internal.h
>> +F: system/ram-block-attributes.c
>>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>>     Memory devices
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index d8a116ba99..1bab9e2dac 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -22,6 +22,10 @@
>>   #include "exec/cpu-common.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/ramlist.h"
>> +#include "system/hostmem.h"
>> +
>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>     struct RAMBlock {
>>       struct rcu_head rcu;
>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>       ram_addr_t postcopy_length;
>>   };
>>   +struct RamBlockAttributes {
>> +    Object parent;
>> +
>> +    RAMBlock *ram_block;
>> +
>> +    /* 1-setting of the bitmap represents ram is populated (shared) */
>> +    unsigned bitmap_size;
>> +    unsigned long *bitmap;
>> +
>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>> +
>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
>> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>> uint64_t offset,
>> +                                      uint64_t size, bool to_discard);
>> +
>>   #endif
>> diff --git a/system/meson.build b/system/meson.build
>> index c2f0082766..2747dbde80 100644
>> --- a/system/meson.build
>> +++ b/system/meson.build
>> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>>     'dma-helpers.c',
>>     'globals.c',
>>     'ioport.c',
>> +  'ram-block-attributes.c',
>>     'memory_mapping.c',
>>     'memory.c',
>>     'physmem.c',
>> diff --git a/system/ram-block-attributes.c b/system/ram-block-
>> attributes.c
>> new file mode 100644
>> index 0000000000..514252413f
>> --- /dev/null
>> +++ b/system/ram-block-attributes.c
>> @@ -0,0 +1,480 @@
>> +/*
>> + * QEMU ram block attributes
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "system/ramblock.h"
>> +#include "trace.h"
>> +
>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
>> +                                          ram_block_attributes,
>> +                                          RAM_BLOCK_ATTRIBUTES,
>> +                                          OBJECT,
>> +                                          { TYPE_RAM_DISCARD_MANAGER },
>> +                                          { })
>> +
>> +static size_t
>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
>> +{
>> +    /*
>> +     * Because page conversion could be manipulated in the size of at
>> least 4K
>> +     * or 4K aligned, Use the host page size as the granularity to
>> track the
>> +     * memory attribute.
>> +     */
>> +    g_assert(attr && attr->ram_block);
>> +    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
>> +    return attr->ram_block->page_size;
>> +}
>> +
>> +
>> +static bool
>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
>> +                                      const MemoryRegionSection
>> *section)
>> +{
>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const uint64_t first_bit = section->offset_within_region /
>> block_size;
>> +    const uint64_t last_bit = first_bit + int128_get64(section-
>> >size) / block_size - 1;
>> +    unsigned long first_discarded_bit;
>> +
>> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> +                                           first_bit);
>> +    return first_discarded_bit > last_bit;
>> +}
>> +
>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
>> +                                               void *arg);
>> +
>> +static int
>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
>> +                                        void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    return rdl->notify_populate(rdl, section);
>> +}
>> +
>> +static int
>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
>> +                                       void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    rdl->notify_discard(rdl, section);
>> +    return 0;
>> +}
>> +
>> +static int
>> +ram_block_attributes_for_each_populated_section(const
>> RamBlockAttributes *attr,
>> +                                                MemoryRegionSection
>> *section,
>> +                                                void *arg,
>> +                                               
>> ram_block_attributes_section_cb cb)
>> +{
>> +    unsigned long first_bit, last_bit;
>> +    uint64_t offset, size;
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    int ret = 0;
>> +
>> +    first_bit = section->offset_within_region / block_size;
>> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>> +                              first_bit);
>> +
>> +    while (first_bit < attr->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_bit * block_size;
>> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>> +                                      first_bit + 1) - 1;
>> +        size = (last_bit - first_bit + 1) * block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener:
>> %s",
>> +                         __func__, strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>> +                                  last_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int
>> +ram_block_attributes_for_each_discarded_section(const
>> RamBlockAttributes *attr,
>> +                                                MemoryRegionSection
>> *section,
>> +                                                void *arg,
>> +                                               
>> ram_block_attributes_section_cb cb)
>> +{
>> +    unsigned long first_bit, last_bit;
>> +    uint64_t offset, size;
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    int ret = 0;
>> +
>> +    first_bit = section->offset_within_region / block_size;
>> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>> +                                   first_bit);
>> +
>> +    while (first_bit < attr->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_bit * block_size;
>> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>> +                                 first_bit + 1) - 1;
>> +        size = (last_bit - first_bit + 1) * block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener:
>> %s",
>> +                         __func__, strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_bit = find_next_zero_bit(attr->bitmap,
>> +                                       attr->bitmap_size,
>> +                                       last_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint64_t
>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager
>> *rdm,
>> +                                             const MemoryRegion *mr)
>> +{
>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +
>> +    g_assert(mr == attr->ram_block->mr);
>> +    return ram_block_attributes_get_block_size(attr);
>> +}
>> +
>> +static void
>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
>> +                                           RamDiscardListener *rdl,
>> +                                           MemoryRegionSection *section)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    int ret;
>> +
>> +    g_assert(section->mr == attr->ram_block->mr);
>> +    rdl->section = memory_region_section_new_copy(section);
>> +
>> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
>> +
>> +    ret = ram_block_attributes_for_each_populated_section(attr,
>> section, rdl,
>> +                                   
>> ram_block_attributes_notify_populate_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to register RAM discard listener: %s",
>> +                     __func__, strerror(-ret));
>> +        exit(1);
>> +    }
>> +}
>> +
>> +static void
>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
>> +                                             RamDiscardListener *rdl)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    int ret;
>> +
>> +    g_assert(rdl->section);
>> +    g_assert(rdl->section->mr == attr->ram_block->mr);
>> +
>> +    if (rdl->double_discard_supported) {
>> +        rdl->notify_discard(rdl, rdl->section);
>> +    } else {
>> +        ret = ram_block_attributes_for_each_populated_section(attr,
>> +                rdl->section, rdl,
>> ram_block_attributes_notify_discard_cb);
>> +        if (ret) {
>> +            error_report("%s: Failed to unregister RAM discard
>> listener: %s",
>> +                         __func__, strerror(-ret));
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    memory_region_section_free_copy(rdl->section);
>> +    rdl->section = NULL;
>> +    QLIST_REMOVE(rdl, next);
>> +}
>> +
>> +typedef struct RamBlockAttributesReplayData {
>> +    ReplayRamDiscardState fn;
>> +    void *opaque;
>> +} RamBlockAttributesReplayData;
>> +
>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection
>> *section,
>> +                                              void *arg)
>> +{
>> +    RamBlockAttributesReplayData *data = arg;
>> +
>> +    return data->fn(section, data->opaque);
>> +}
>> +
>> +static int
>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
>> +                                          MemoryRegionSection *section,
>> +                                          ReplayRamDiscardState
>> replay_fn,
>> +                                          void *opaque)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == attr->ram_block->mr);
>> +    return ram_block_attributes_for_each_populated_section(attr,
>> section, &data,
>> +                                           
>> ram_block_attributes_rdm_replay_cb);
>> +}
>> +
>> +static int
>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
>> +                                          MemoryRegionSection *section,
>> +                                          ReplayRamDiscardState
>> replay_fn,
>> +                                          void *opaque)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == attr->ram_block->mr);
>> +    return ram_block_attributes_for_each_discarded_section(attr,
>> section, &data,
>> +                                           
>> ram_block_attributes_rdm_replay_cb);
>> +}
>> +
>> +static bool
>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr,
>> uint64_t offset,
>> +                                    uint64_t size)
>> +{
>> +    MemoryRegion *mr = attr->ram_block->mr;
>> +
>> +    g_assert(mr);
>> +
>> +    uint64_t region_size = memory_region_size(mr);
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +
>> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
>> +        !QEMU_IS_ALIGNED(size, block_size)) {
>> +        return false;
>> +    }
>> +    if (offset + size <= offset) {
>> +        return false;
>> +    }
>> +    if (offset + size > region_size) {
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static void ram_block_attributes_notify_discard(RamBlockAttributes
>> *attr,
>> +                                                uint64_t offset,
>> +                                                uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            continue;
>> +        }
>> +        rdl->notify_discard(rdl, &tmp);
>> +    }
>> +}
>> +
>> +static int
>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
>> +                                     uint64_t offset, uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +    int ret = 0;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            continue;
>> +        }
>> +        ret = rdl->notify_populate(rdl, &tmp);
>> +        if (ret) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static bool
>> ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
>> +                                                    uint64_t offset,
>> +                                                    uint64_t size)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> +                                   first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +static bool
>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
>> +                                        uint64_t offset, uint64_t size)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>> +                                      uint64_t offset, uint64_t size,
>> +                                      bool to_discard)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long nbits = size / block_size;
>> +    bool is_range_discarded, is_range_populated;
>> +    const uint64_t end = offset + size;
>> +    unsigned long bit;
>> +    uint64_t cur;
>> +    int ret = 0;
>> +
>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    is_range_discarded =
>> ram_block_attributes_is_range_discarded(attr, offset,
>> +                                                                 size);
>> +    is_range_populated =
>> ram_block_attributes_is_range_populated(attr, offset,
>> +                                                                 size);
>> +
>> +    trace_ram_block_attributes_state_change(offset, size,
>> +                                            is_range_discarded ?
>> "discarded" :
>> +                                            is_range_populated ?
>> "populated" :
>> +                                            "mixture",
>> +                                            to_discard ? "discarded" :
>> +                                            "populated");
>> +    if (to_discard) {
>> +        if (is_range_discarded) {
>> +            /* Already private */
>> +        } else if (is_range_populated) {
>> +            /* Completely shared */
>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            ram_block_attributes_notify_discard(attr, offset, size);
> 
> In this patch series we are only maintaining the bitmap for Ram discard/
> populate state not for regular guest_memfd private/shared?

As mentioned in changelog, "In the context of RamDiscardManager, shared
state is analogous to populated, and private state is signified as
discarded." To keep consistent with RamDiscardManager, I used the ram
"populated/discareded" in variable and function names.

Of course, we can use private/shared if we rename the RamDiscardManager
to something like RamStateManager. But I haven't done it in this series.
Because I think we can also view the bitmap as the state of shared
memory (shared discard/shared populate) at present. The VFIO user only
manipulate the dma map/unmap of shared mapping. (We need to consider how
to extend the RDM framwork to manage the shared/private/discard states
in the future when need to distinguish private and discard states.)

> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-02 21:10   ` David Hildenbrand
@ 2025-06-03  1:57     ` Chenyi Qiang
  0 siblings, 0 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-03  1:57 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/3/2025 5:10 AM, David Hildenbrand wrote:
> On 30.05.25 10:32, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") highlighted that subsystems like VFIO may disable RAM block
>> discard. However, guest_memfd relies on discard operations for page
>> conversion between private and shared memory, potentially leading to
>> the stale IOMMU mapping issue when assigning hardware devices to
>> confidential VMs via shared memory. To address this and allow shared
>> device assignement, it is crucial to ensure the VFIO system refreshes
>> its IOMMU mappings.
>>
>> RamDiscardManager is an existing interface (used by virtio-mem) to
>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>> have a memory backend while others do not. Notably, virtual BIOS
>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>> backend.
>>
>> To manage RAMBlocks with guest_memfd, define a new object named
>> RamBlockAttributes to implement the RamDiscardManager interface. This
>> object can store the guest_memfd information such as bitmap for shared
>> memory and the registered listeners for event notification. In the
>> context of RamDiscardManager, shared state is analogous to populated, and
>> private state is signified as discarded. To notify the conversion events,
>> a new state_change() helper is exported for the users to notify the
>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>> shared mapping.
>>
>> Note that the memory state is tracked at the host page size granularity,
>> as the minimum conversion size can be one page per request and VFIO
>> expects the DMA mapping for a specific iova to be mapped and unmapped
>> with the same granularity. Confidential VMs may perform partial
>> conversions, such as conversions on small regions within larger ones.
>> To prevent such invalid cases and until DMA mapping cut operation
>> support is available, all operations are performed with 4K granularity.
>>
>> In addition, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation at present. It would be
>> future work to add more error handling or rollback mechanisms once
>> conversion failures are allowed. For example, in-place conversion of
>> guest_memfd could retry the unmap operation during the conversion from
>> shared to private. For now, keep the complex error handling out of the
>> picture as it is not required.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v6:
>>      - Change the object type name from RamBlockAttribute to
>>        RamBlockAttributes. (David)
>>      - Save the associated RAMBlock instead MemoryRegion in
>>        RamBlockAttributes. (David)
>>      - Squash the state_change() helper introduction in this commit as
>>        well as the mixture conversion case handling. (David)
>>      - Change the block_size type from int to size_t and some cleanup in
>>        validation check. (Alexey)
>>      - Add a tracepoint to track the state changes. (Alexey)
>>
>> Changes in v5:
>>      - Revert to use RamDiscardManager interface instead of introducing
>>        new hierarchy of class to manage private/shared state, and keep
>>        using the new name of RamBlockAttribute compared with the
>>        MemoryAttributeManager in v3.
>>      - Use *simple* version of object_define and object_declare since the
>>        state_change() function is changed as an exported function instead
>>        of a virtual function in later patch.
>>      - Move the introduction of RamBlockAttribute field to this patch and
>>        rename it to ram_shared. (Alexey)
>>      - call the exit() when register/unregister failed. (Zhao)
>>      - Add the ram-block-attribute.c to Memory API related part in
>>        MAINTAINERS.
>>
>> Changes in v4:
>>      - Change the name from memory-attribute-manager to
>>        ram-block-attribute.
>>      - Implement the newly-introduced PrivateSharedManager instead of
>>        RamDiscardManager and change related commit message.
>>      - Define the new object in ramblock.h instead of adding a new file.
>> ---
>>   MAINTAINERS                   |   1 +
>>   include/system/ramblock.h     |  21 ++
>>   system/meson.build            |   1 +
>>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>>   system/trace-events           |   3 +
>>   5 files changed, 506 insertions(+)
>>   create mode 100644 system/ram-block-attributes.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 6dacd6d004..8ec39aa7f8 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>   F: system/memory_mapping.c
>>   F: system/physmem.c
>>   F: system/memory-internal.h
>> +F: system/ram-block-attributes.c
>>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>>     Memory devices
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index d8a116ba99..1bab9e2dac 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -22,6 +22,10 @@
>>   #include "exec/cpu-common.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/ramlist.h"
>> +#include "system/hostmem.h"
>> +
>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>     struct RAMBlock {
>>       struct rcu_head rcu;
>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>       ram_addr_t postcopy_length;
>>   };
>>   +struct RamBlockAttributes {
>> +    Object parent;
>> +
>> +    RAMBlock *ram_block;
>> +
>> +    /* 1-setting of the bitmap represents ram is populated (shared) */
>> +    unsigned bitmap_size;
>> +    unsigned long *bitmap;
> 
> So, initially, all memory starts out as private, correct?

Yes.

> 
> I guess this mimics what kvm_set_phys_mem() ends up doing, when it does
> the kvm_set_memory_attributes_private() call.
> 
> So there is a short period of inconsistency, between creating the
> RAMBlock and mapping it into the PA space.

I initially had such a patch [1] to describe the inconsistency in RFC
series. The bitmap was a 1-setting private bitmap at that time to keep
consistent with guest_memfd and it required an explicit bitmap_fill().

[1]
https://lore.kernel.org/qemu-devel/20240725072118.358923-6-chenyi.qiang@intel.com/

> 
> It might be wroth spelling that out / documenting it somewhere.

OK. I missed the above commit document after changing the bitmap to
shared. I can add it back if need a new version.

> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-03  1:26     ` Chenyi Qiang
@ 2025-06-03  5:26       ` Gupta, Pankaj
  2025-06-03  6:33         ` Chenyi Qiang
  0 siblings, 1 reply; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-03  5:26 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

On 6/3/2025 3:26 AM, Chenyi Qiang wrote:
> 
> 
> On 6/1/2025 5:58 PM, Gupta, Pankaj wrote:
>> On 5/30/2025 10:32 AM, Chenyi Qiang wrote:
>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>> discard. However, guest_memfd relies on discard operations for page
>>> conversion between private and shared memory, potentially leading to
>>> the stale IOMMU mapping issue when assigning hardware devices to
>>> confidential VMs via shared memory. To address this and allow shared
>>> device assignement, it is crucial to ensure the VFIO system refreshes
>>> its IOMMU mappings.
>>>
>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other. Therefore, similar actions are required for page
>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>> facilitate this process.
>>>
>>> Since guest_memfd is not an object, it cannot directly implement the
>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>> have a memory backend while others do not. Notably, virtual BIOS
>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>> backend.
>>>
>>> To manage RAMBlocks with guest_memfd, define a new object named
>>> RamBlockAttributes to implement the RamDiscardManager interface. This
>>> object can store the guest_memfd information such as bitmap for shared
>>> memory and the registered listeners for event notification. In the
>>> context of RamDiscardManager, shared state is analogous to populated, and
>>> private state is signified as discarded. To notify the conversion events,
>>> a new state_change() helper is exported for the users to notify the
>>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>>> shared mapping.
>>>
>>> Note that the memory state is tracked at the host page size granularity,
>>> as the minimum conversion size can be one page per request and VFIO
>>> expects the DMA mapping for a specific iova to be mapped and unmapped
>>> with the same granularity. Confidential VMs may perform partial
>>> conversions, such as conversions on small regions within larger ones.
>>> To prevent such invalid cases and until DMA mapping cut operation
>>> support is available, all operations are performed with 4K granularity.
>>>
>>> In addition, memory conversion failures cause QEMU to quit instead of
>>> resuming the guest or retrying the operation at present. It would be
>>> future work to add more error handling or rollback mechanisms once
>>> conversion failures are allowed. For example, in-place conversion of
>>> guest_memfd could retry the unmap operation during the conversion from
>>> shared to private. For now, keep the complex error handling out of the
>>> picture as it is not required.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Changes in v6:
>>>       - Change the object type name from RamBlockAttribute to
>>>         RamBlockAttributes. (David)
>>>       - Save the associated RAMBlock instead MemoryRegion in
>>>         RamBlockAttributes. (David)
>>>       - Squash the state_change() helper introduction in this commit as
>>>         well as the mixture conversion case handling. (David)
>>>       - Change the block_size type from int to size_t and some cleanup in
>>>         validation check. (Alexey)
>>>       - Add a tracepoint to track the state changes. (Alexey)
>>>
>>> Changes in v5:
>>>       - Revert to use RamDiscardManager interface instead of introducing
>>>         new hierarchy of class to manage private/shared state, and keep
>>>         using the new name of RamBlockAttribute compared with the
>>>         MemoryAttributeManager in v3.
>>>       - Use *simple* version of object_define and object_declare since the
>>>         state_change() function is changed as an exported function instead
>>>         of a virtual function in later patch.
>>>       - Move the introduction of RamBlockAttribute field to this patch and
>>>         rename it to ram_shared. (Alexey)
>>>       - call the exit() when register/unregister failed. (Zhao)
>>>       - Add the ram-block-attribute.c to Memory API related part in
>>>         MAINTAINERS.
>>>
>>> Changes in v4:
>>>       - Change the name from memory-attribute-manager to
>>>         ram-block-attribute.
>>>       - Implement the newly-introduced PrivateSharedManager instead of
>>>         RamDiscardManager and change related commit message.
>>>       - Define the new object in ramblock.h instead of adding a new file.
>>> ---
>>>    MAINTAINERS                   |   1 +
>>>    include/system/ramblock.h     |  21 ++
>>>    system/meson.build            |   1 +
>>>    system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>>>    system/trace-events           |   3 +
>>>    5 files changed, 506 insertions(+)
>>>    create mode 100644 system/ram-block-attributes.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 6dacd6d004..8ec39aa7f8 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>    F: system/memory_mapping.c
>>>    F: system/physmem.c
>>>    F: system/memory-internal.h
>>> +F: system/ram-block-attributes.c
>>>    F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>      Memory devices
>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>> index d8a116ba99..1bab9e2dac 100644
>>> --- a/include/system/ramblock.h
>>> +++ b/include/system/ramblock.h
>>> @@ -22,6 +22,10 @@
>>>    #include "exec/cpu-common.h"
>>>    #include "qemu/rcu.h"
>>>    #include "exec/ramlist.h"
>>> +#include "system/hostmem.h"
>>> +
>>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>>      struct RAMBlock {
>>>        struct rcu_head rcu;
>>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>>        ram_addr_t postcopy_length;
>>>    };
>>>    +struct RamBlockAttributes {
>>> +    Object parent;
>>> +
>>> +    RAMBlock *ram_block;
>>> +
>>> +    /* 1-setting of the bitmap represents ram is populated (shared) */
>>> +    unsigned bitmap_size;
>>> +    unsigned long *bitmap;
>>> +
>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>> +};
>>> +
>>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
>>> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>> uint64_t offset,
>>> +                                      uint64_t size, bool to_discard);
>>> +
>>>    #endif
>>> diff --git a/system/meson.build b/system/meson.build
>>> index c2f0082766..2747dbde80 100644
>>> --- a/system/meson.build
>>> +++ b/system/meson.build
>>> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>>>      'dma-helpers.c',
>>>      'globals.c',
>>>      'ioport.c',
>>> +  'ram-block-attributes.c',
>>>      'memory_mapping.c',
>>>      'memory.c',
>>>      'physmem.c',
>>> diff --git a/system/ram-block-attributes.c b/system/ram-block-
>>> attributes.c
>>> new file mode 100644
>>> index 0000000000..514252413f
>>> --- /dev/null
>>> +++ b/system/ram-block-attributes.c
>>> @@ -0,0 +1,480 @@
>>> +/*
>>> + * QEMU ram block attributes
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>> later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/error-report.h"
>>> +#include "system/ramblock.h"
>>> +#include "trace.h"
>>> +
>>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
>>> +                                          ram_block_attributes,
>>> +                                          RAM_BLOCK_ATTRIBUTES,
>>> +                                          OBJECT,
>>> +                                          { TYPE_RAM_DISCARD_MANAGER },
>>> +                                          { })
>>> +
>>> +static size_t
>>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
>>> +{
>>> +    /*
>>> +     * Because page conversion could be manipulated in the size of at
>>> least 4K
>>> +     * or 4K aligned, Use the host page size as the granularity to
>>> track the
>>> +     * memory attribute.
>>> +     */
>>> +    g_assert(attr && attr->ram_block);
>>> +    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
>>> +    return attr->ram_block->page_size;
>>> +}
>>> +
>>> +
>>> +static bool
>>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
>>> +                                      const MemoryRegionSection
>>> *section)
>>> +{
>>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +    const uint64_t first_bit = section->offset_within_region /
>>> block_size;
>>> +    const uint64_t last_bit = first_bit + int128_get64(section-
>>>> size) / block_size - 1;
>>> +    unsigned long first_discarded_bit;
>>> +
>>> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>> +                                           first_bit);
>>> +    return first_discarded_bit > last_bit;
>>> +}
>>> +
>>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
>>> +                                               void *arg);
>>> +
>>> +static int
>>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
>>> +                                        void *arg)
>>> +{
>>> +    RamDiscardListener *rdl = arg;
>>> +
>>> +    return rdl->notify_populate(rdl, section);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
>>> +                                       void *arg)
>>> +{
>>> +    RamDiscardListener *rdl = arg;
>>> +
>>> +    rdl->notify_discard(rdl, section);
>>> +    return 0;
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_for_each_populated_section(const
>>> RamBlockAttributes *attr,
>>> +                                                MemoryRegionSection
>>> *section,
>>> +                                                void *arg,
>>> +
>>> ram_block_attributes_section_cb cb)
>>> +{
>>> +    unsigned long first_bit, last_bit;
>>> +    uint64_t offset, size;
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +    int ret = 0;
>>> +
>>> +    first_bit = section->offset_within_region / block_size;
>>> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> +                              first_bit);
>>> +
>>> +    while (first_bit < attr->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_bit * block_size;
>>> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>> +                                      first_bit + 1) - 1;
>>> +        size = (last_bit - first_bit + 1) * block_size;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to notify RAM discard listener:
>>> %s",
>>> +                         __func__, strerror(-ret));
>>> +            break;
>>> +        }
>>> +
>>> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> +                                  last_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_for_each_discarded_section(const
>>> RamBlockAttributes *attr,
>>> +                                                MemoryRegionSection
>>> *section,
>>> +                                                void *arg,
>>> +
>>> ram_block_attributes_section_cb cb)
>>> +{
>>> +    unsigned long first_bit, last_bit;
>>> +    uint64_t offset, size;
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +    int ret = 0;
>>> +
>>> +    first_bit = section->offset_within_region / block_size;
>>> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>> +                                   first_bit);
>>> +
>>> +    while (first_bit < attr->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_bit * block_size;
>>> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> +                                 first_bit + 1) - 1;
>>> +        size = (last_bit - first_bit + 1) * block_size;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to notify RAM discard listener:
>>> %s",
>>> +                         __func__, strerror(-ret));
>>> +            break;
>>> +        }
>>> +
>>> +        first_bit = find_next_zero_bit(attr->bitmap,
>>> +                                       attr->bitmap_size,
>>> +                                       last_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static uint64_t
>>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager
>>> *rdm,
>>> +                                             const MemoryRegion *mr)
>>> +{
>>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +
>>> +    g_assert(mr == attr->ram_block->mr);
>>> +    return ram_block_attributes_get_block_size(attr);
>>> +}
>>> +
>>> +static void
>>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
>>> +                                           RamDiscardListener *rdl,
>>> +                                           MemoryRegionSection *section)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(section->mr == attr->ram_block->mr);
>>> +    rdl->section = memory_region_section_new_copy(section);
>>> +
>>> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
>>> +
>>> +    ret = ram_block_attributes_for_each_populated_section(attr,
>>> section, rdl,
>>> +
>>> ram_block_attributes_notify_populate_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to register RAM discard listener: %s",
>>> +                     __func__, strerror(-ret));
>>> +        exit(1);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
>>> +                                             RamDiscardListener *rdl)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(rdl->section);
>>> +    g_assert(rdl->section->mr == attr->ram_block->mr);
>>> +
>>> +    if (rdl->double_discard_supported) {
>>> +        rdl->notify_discard(rdl, rdl->section);
>>> +    } else {
>>> +        ret = ram_block_attributes_for_each_populated_section(attr,
>>> +                rdl->section, rdl,
>>> ram_block_attributes_notify_discard_cb);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to unregister RAM discard
>>> listener: %s",
>>> +                         __func__, strerror(-ret));
>>> +            exit(1);
>>> +        }
>>> +    }
>>> +
>>> +    memory_region_section_free_copy(rdl->section);
>>> +    rdl->section = NULL;
>>> +    QLIST_REMOVE(rdl, next);
>>> +}
>>> +
>>> +typedef struct RamBlockAttributesReplayData {
>>> +    ReplayRamDiscardState fn;
>>> +    void *opaque;
>>> +} RamBlockAttributesReplayData;
>>> +
>>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection
>>> *section,
>>> +                                              void *arg)
>>> +{
>>> +    RamBlockAttributesReplayData *data = arg;
>>> +
>>> +    return data->fn(section, data->opaque);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
>>> +                                          MemoryRegionSection *section,
>>> +                                          ReplayRamDiscardState
>>> replay_fn,
>>> +                                          void *opaque)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == attr->ram_block->mr);
>>> +    return ram_block_attributes_for_each_populated_section(attr,
>>> section, &data,
>>> +
>>> ram_block_attributes_rdm_replay_cb);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
>>> +                                          MemoryRegionSection *section,
>>> +                                          ReplayRamDiscardState
>>> replay_fn,
>>> +                                          void *opaque)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == attr->ram_block->mr);
>>> +    return ram_block_attributes_for_each_discarded_section(attr,
>>> section, &data,
>>> +
>>> ram_block_attributes_rdm_replay_cb);
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr,
>>> uint64_t offset,
>>> +                                    uint64_t size)
>>> +{
>>> +    MemoryRegion *mr = attr->ram_block->mr;
>>> +
>>> +    g_assert(mr);
>>> +
>>> +    uint64_t region_size = memory_region_size(mr);
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +
>>> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
>>> +        !QEMU_IS_ALIGNED(size, block_size)) {
>>> +        return false;
>>> +    }
>>> +    if (offset + size <= offset) {
>>> +        return false;
>>> +    }
>>> +    if (offset + size > region_size) {
>>> +        return false;
>>> +    }
>>> +    return true;
>>> +}
>>> +
>>> +static void ram_block_attributes_notify_discard(RamBlockAttributes
>>> *attr,
>>> +                                                uint64_t offset,
>>> +                                                uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        rdl->notify_discard(rdl, &tmp);
>>> +    }
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
>>> +                                     uint64_t offset, uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +    int ret = 0;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        ret = rdl->notify_populate(rdl, &tmp);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static bool
>>> ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
>>> +                                                    uint64_t offset,
>>> +                                                    uint64_t size)
>>> +{
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>> +                                   first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
>>> +                                        uint64_t offset, uint64_t size)
>>> +{
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>> +                                      uint64_t offset, uint64_t size,
>>> +                                      bool to_discard)
>>> +{
>>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long nbits = size / block_size;
>>> +    bool is_range_discarded, is_range_populated;
>>> +    const uint64_t end = offset + size;
>>> +    unsigned long bit;
>>> +    uint64_t cur;
>>> +    int ret = 0;
>>> +
>>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>> +                     __func__, offset, size);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    is_range_discarded =
>>> ram_block_attributes_is_range_discarded(attr, offset,
>>> +                                                                 size);
>>> +    is_range_populated =
>>> ram_block_attributes_is_range_populated(attr, offset,
>>> +                                                                 size);
>>> +
>>> +    trace_ram_block_attributes_state_change(offset, size,
>>> +                                            is_range_discarded ?
>>> "discarded" :
>>> +                                            is_range_populated ?
>>> "populated" :
>>> +                                            "mixture",
>>> +                                            to_discard ? "discarded" :
>>> +                                            "populated");
>>> +    if (to_discard) {
>>> +        if (is_range_discarded) {
>>> +            /* Already private */
>>> +        } else if (is_range_populated) {
>>> +            /* Completely shared */
>>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>>> +            ram_block_attributes_notify_discard(attr, offset, size);
>>
>> In this patch series we are only maintaining the bitmap for Ram discard/
>> populate state not for regular guest_memfd private/shared?
> 
> As mentioned in changelog, "In the context of RamDiscardManager, shared
> state is analogous to populated, and private state is signified as
> discarded." To keep consistent with RamDiscardManager, I used the ram
> "populated/discareded" in variable and function names.
> 
> Of course, we can use private/shared if we rename the RamDiscardManager
> to something like RamStateManager. But I haven't done it in this series.
> Because I think we can also view the bitmap as the state of shared
> memory (shared discard/shared populate) at present. The VFIO user only
> manipulate the dma map/unmap of shared mapping. (We need to consider how
> to extend the RDM framwork to manage the shared/private/discard states
> in the future when need to distinguish private and discard states.)

As function name 'ram_block_attributes_state_change' is generic. Maybe 
for now metadata update for only two states (shared/private) is enough 
as it also aligns with discard vs populate states?

As we would also need the shared vs private state metadata for other 
COCO operations e.g live migration, so wondering having this metadata 
already there would be helpful. This also will keep the legacy interface 
(prior to in-place conversion) consistent (As memory-attributes handling 
is generic operation anyway).

I saw you had it in previous versions(v3) but removed later?

https://lore.kernel.org/qemu-devel/20250310081837.13123-6-chenyi.qiang@intel.com/ 


Best regards,
Pankaj



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-03  5:26       ` Gupta, Pankaj
@ 2025-06-03  6:33         ` Chenyi Qiang
  2025-06-03  7:17           ` Gupta, Pankaj
  0 siblings, 1 reply; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-03  6:33 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/3/2025 1:26 PM, Gupta, Pankaj wrote:
> On 6/3/2025 3:26 AM, Chenyi Qiang wrote:
>>
>>
>> On 6/1/2025 5:58 PM, Gupta, Pankaj wrote:
>>> On 5/30/2025 10:32 AM, Chenyi Qiang wrote:
>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>>> discard. However, guest_memfd relies on discard operations for page
>>>> conversion between private and shared memory, potentially leading to
>>>> the stale IOMMU mapping issue when assigning hardware devices to
>>>> confidential VMs via shared memory. To address this and allow shared
>>>> device assignement, it is crucial to ensure the VFIO system refreshes
>>>> its IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>>> adjust VFIO mappings in relation to VM page assignment. Effectively
>>>> page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other. Therefore, similar actions are required for page
>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>> facilitate this process.
>>>>
>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>>> have a memory backend while others do not. Notably, virtual BIOS
>>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>>> backend.
>>>>
>>>> To manage RAMBlocks with guest_memfd, define a new object named
>>>> RamBlockAttributes to implement the RamDiscardManager interface. This
>>>> object can store the guest_memfd information such as bitmap for shared
>>>> memory and the registered listeners for event notification. In the
>>>> context of RamDiscardManager, shared state is analogous to
>>>> populated, and
>>>> private state is signified as discarded. To notify the conversion
>>>> events,
>>>> a new state_change() helper is exported for the users to notify the
>>>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>>>> shared mapping.
>>>>
>>>> Note that the memory state is tracked at the host page size
>>>> granularity,
>>>> as the minimum conversion size can be one page per request and VFIO
>>>> expects the DMA mapping for a specific iova to be mapped and unmapped
>>>> with the same granularity. Confidential VMs may perform partial
>>>> conversions, such as conversions on small regions within larger ones.
>>>> To prevent such invalid cases and until DMA mapping cut operation
>>>> support is available, all operations are performed with 4K granularity.
>>>>
>>>> In addition, memory conversion failures cause QEMU to quit instead of
>>>> resuming the guest or retrying the operation at present. It would be
>>>> future work to add more error handling or rollback mechanisms once
>>>> conversion failures are allowed. For example, in-place conversion of
>>>> guest_memfd could retry the unmap operation during the conversion from
>>>> shared to private. For now, keep the complex error handling out of the
>>>> picture as it is not required.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>> Changes in v6:
>>>>       - Change the object type name from RamBlockAttribute to
>>>>         RamBlockAttributes. (David)
>>>>       - Save the associated RAMBlock instead MemoryRegion in
>>>>         RamBlockAttributes. (David)
>>>>       - Squash the state_change() helper introduction in this commit as
>>>>         well as the mixture conversion case handling. (David)
>>>>       - Change the block_size type from int to size_t and some
>>>> cleanup in
>>>>         validation check. (Alexey)
>>>>       - Add a tracepoint to track the state changes. (Alexey)
>>>>
>>>> Changes in v5:
>>>>       - Revert to use RamDiscardManager interface instead of
>>>> introducing
>>>>         new hierarchy of class to manage private/shared state, and keep
>>>>         using the new name of RamBlockAttribute compared with the
>>>>         MemoryAttributeManager in v3.
>>>>       - Use *simple* version of object_define and object_declare
>>>> since the
>>>>         state_change() function is changed as an exported function
>>>> instead
>>>>         of a virtual function in later patch.
>>>>       - Move the introduction of RamBlockAttribute field to this
>>>> patch and
>>>>         rename it to ram_shared. (Alexey)
>>>>       - call the exit() when register/unregister failed. (Zhao)
>>>>       - Add the ram-block-attribute.c to Memory API related part in
>>>>         MAINTAINERS.
>>>>
>>>> Changes in v4:
>>>>       - Change the name from memory-attribute-manager to
>>>>         ram-block-attribute.
>>>>       - Implement the newly-introduced PrivateSharedManager instead of
>>>>         RamDiscardManager and change related commit message.
>>>>       - Define the new object in ramblock.h instead of adding a new
>>>> file.
>>>> ---
>>>>    MAINTAINERS                   |   1 +
>>>>    include/system/ramblock.h     |  21 ++
>>>>    system/meson.build            |   1 +
>>>>    system/ram-block-attributes.c | 480 +++++++++++++++++++++++++++++
>>>> +++++
>>>>    system/trace-events           |   3 +
>>>>    5 files changed, 506 insertions(+)
>>>>    create mode 100644 system/ram-block-attributes.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 6dacd6d004..8ec39aa7f8 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>>    F: system/memory_mapping.c
>>>>    F: system/physmem.c
>>>>    F: system/memory-internal.h
>>>> +F: system/ram-block-attributes.c
>>>>    F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>>      Memory devices
>>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>>> index d8a116ba99..1bab9e2dac 100644
>>>> --- a/include/system/ramblock.h
>>>> +++ b/include/system/ramblock.h
>>>> @@ -22,6 +22,10 @@
>>>>    #include "exec/cpu-common.h"
>>>>    #include "qemu/rcu.h"
>>>>    #include "exec/ramlist.h"
>>>> +#include "system/hostmem.h"
>>>> +
>>>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>>>      struct RAMBlock {
>>>>        struct rcu_head rcu;
>>>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>>>        ram_addr_t postcopy_length;
>>>>    };
>>>>    +struct RamBlockAttributes {
>>>> +    Object parent;
>>>> +
>>>> +    RAMBlock *ram_block;
>>>> +
>>>> +    /* 1-setting of the bitmap represents ram is populated (shared) */
>>>> +    unsigned bitmap_size;
>>>> +    unsigned long *bitmap;
>>>> +
>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>> +};
>>>> +
>>>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
>>>> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
>>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>>> uint64_t offset,
>>>> +                                      uint64_t size, bool to_discard);
>>>> +
>>>>    #endif
>>>> diff --git a/system/meson.build b/system/meson.build
>>>> index c2f0082766..2747dbde80 100644
>>>> --- a/system/meson.build
>>>> +++ b/system/meson.build
>>>> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>>>>      'dma-helpers.c',
>>>>      'globals.c',
>>>>      'ioport.c',
>>>> +  'ram-block-attributes.c',
>>>>      'memory_mapping.c',
>>>>      'memory.c',
>>>>      'physmem.c',
>>>> diff --git a/system/ram-block-attributes.c b/system/ram-block-
>>>> attributes.c
>>>> new file mode 100644
>>>> index 0000000000..514252413f
>>>> --- /dev/null
>>>> +++ b/system/ram-block-attributes.c
>>>> @@ -0,0 +1,480 @@
>>>> +/*
>>>> + * QEMU ram block attributes
>>>> + *
>>>> + * Copyright Intel
>>>> + *
>>>> + * Author:
>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>> later.
>>>> + * See the COPYING file in the top-level directory
>>>> + *
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "system/ramblock.h"
>>>> +#include "trace.h"
>>>> +
>>>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
>>>> +                                          ram_block_attributes,
>>>> +                                          RAM_BLOCK_ATTRIBUTES,
>>>> +                                          OBJECT,
>>>> +                                         
>>>> { TYPE_RAM_DISCARD_MANAGER },
>>>> +                                          { })
>>>> +
>>>> +static size_t
>>>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
>>>> +{
>>>> +    /*
>>>> +     * Because page conversion could be manipulated in the size of at
>>>> least 4K
>>>> +     * or 4K aligned, Use the host page size as the granularity to
>>>> track the
>>>> +     * memory attribute.
>>>> +     */
>>>> +    g_assert(attr && attr->ram_block);
>>>> +    g_assert(attr->ram_block->page_size ==
>>>> qemu_real_host_page_size());
>>>> +    return attr->ram_block->page_size;
>>>> +}
>>>> +
>>>> +
>>>> +static bool
>>>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
>>>> +                                      const MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +    const uint64_t first_bit = section->offset_within_region /
>>>> block_size;
>>>> +    const uint64_t last_bit = first_bit + int128_get64(section-
>>>>> size) / block_size - 1;
>>>> +    unsigned long first_discarded_bit;
>>>> +
>>>> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit
>>>> + 1,
>>>> +                                           first_bit);
>>>> +    return first_discarded_bit > last_bit;
>>>> +}
>>>> +
>>>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
>>>> +                                               void *arg);
>>>> +
>>>> +static int
>>>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
>>>> +                                        void *arg)
>>>> +{
>>>> +    RamDiscardListener *rdl = arg;
>>>> +
>>>> +    return rdl->notify_populate(rdl, section);
>>>> +}
>>>> +
>>>> +static int
>>>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
>>>> +                                       void *arg)
>>>> +{
>>>> +    RamDiscardListener *rdl = arg;
>>>> +
>>>> +    rdl->notify_discard(rdl, section);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int
>>>> +ram_block_attributes_for_each_populated_section(const
>>>> RamBlockAttributes *attr,
>>>> +                                                MemoryRegionSection
>>>> *section,
>>>> +                                                void *arg,
>>>> +
>>>> ram_block_attributes_section_cb cb)
>>>> +{
>>>> +    unsigned long first_bit, last_bit;
>>>> +    uint64_t offset, size;
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +    int ret = 0;
>>>> +
>>>> +    first_bit = section->offset_within_region / block_size;
>>>> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>>> +                              first_bit);
>>>> +
>>>> +    while (first_bit < attr->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_bit * block_size;
>>>> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>>> +                                      first_bit + 1) - 1;
>>>> +        size = (last_bit - first_bit + 1) * block_size;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            error_report("%s: Failed to notify RAM discard listener:
>>>> %s",
>>>> +                         __func__, strerror(-ret));
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>>> +                                  last_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int
>>>> +ram_block_attributes_for_each_discarded_section(const
>>>> RamBlockAttributes *attr,
>>>> +                                                MemoryRegionSection
>>>> *section,
>>>> +                                                void *arg,
>>>> +
>>>> ram_block_attributes_section_cb cb)
>>>> +{
>>>> +    unsigned long first_bit, last_bit;
>>>> +    uint64_t offset, size;
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +    int ret = 0;
>>>> +
>>>> +    first_bit = section->offset_within_region / block_size;
>>>> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>>> +                                   first_bit);
>>>> +
>>>> +    while (first_bit < attr->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_bit * block_size;
>>>> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>>> +                                 first_bit + 1) - 1;
>>>> +        size = (last_bit - first_bit + 1) * block_size;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            error_report("%s: Failed to notify RAM discard listener:
>>>> %s",
>>>> +                         __func__, strerror(-ret));
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_bit = find_next_zero_bit(attr->bitmap,
>>>> +                                       attr->bitmap_size,
>>>> +                                       last_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint64_t
>>>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager
>>>> *rdm,
>>>> +                                             const MemoryRegion *mr)
>>>> +{
>>>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>>> +
>>>> +    g_assert(mr == attr->ram_block->mr);
>>>> +    return ram_block_attributes_get_block_size(attr);
>>>> +}
>>>> +
>>>> +static void
>>>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
>>>> +                                           RamDiscardListener *rdl,
>>>> +                                           MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(section->mr == attr->ram_block->mr);
>>>> +    rdl->section = memory_region_section_new_copy(section);
>>>> +
>>>> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
>>>> +
>>>> +    ret = ram_block_attributes_for_each_populated_section(attr,
>>>> section, rdl,
>>>> +
>>>> ram_block_attributes_notify_populate_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to register RAM discard listener:
>>>> %s",
>>>> +                     __func__, strerror(-ret));
>>>> +        exit(1);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
>>>> +                                             RamDiscardListener *rdl)
>>>> +{
>>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(rdl->section);
>>>> +    g_assert(rdl->section->mr == attr->ram_block->mr);
>>>> +
>>>> +    if (rdl->double_discard_supported) {
>>>> +        rdl->notify_discard(rdl, rdl->section);
>>>> +    } else {
>>>> +        ret = ram_block_attributes_for_each_populated_section(attr,
>>>> +                rdl->section, rdl,
>>>> ram_block_attributes_notify_discard_cb);
>>>> +        if (ret) {
>>>> +            error_report("%s: Failed to unregister RAM discard
>>>> listener: %s",
>>>> +                         __func__, strerror(-ret));
>>>> +            exit(1);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    memory_region_section_free_copy(rdl->section);
>>>> +    rdl->section = NULL;
>>>> +    QLIST_REMOVE(rdl, next);
>>>> +}
>>>> +
>>>> +typedef struct RamBlockAttributesReplayData {
>>>> +    ReplayRamDiscardState fn;
>>>> +    void *opaque;
>>>> +} RamBlockAttributesReplayData;
>>>> +
>>>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection
>>>> *section,
>>>> +                                              void *arg)
>>>> +{
>>>> +    RamBlockAttributesReplayData *data = arg;
>>>> +
>>>> +    return data->fn(section, data->opaque);
>>>> +}
>>>> +
>>>> +static int
>>>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager
>>>> *rdm,
>>>> +                                          MemoryRegionSection
>>>> *section,
>>>> +                                          ReplayRamDiscardState
>>>> replay_fn,
>>>> +                                          void *opaque)
>>>> +{
>>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == attr->ram_block->mr);
>>>> +    return ram_block_attributes_for_each_populated_section(attr,
>>>> section, &data,
>>>> +
>>>> ram_block_attributes_rdm_replay_cb);
>>>> +}
>>>> +
>>>> +static int
>>>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager
>>>> *rdm,
>>>> +                                          MemoryRegionSection
>>>> *section,
>>>> +                                          ReplayRamDiscardState
>>>> replay_fn,
>>>> +                                          void *opaque)
>>>> +{
>>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == attr->ram_block->mr);
>>>> +    return ram_block_attributes_for_each_discarded_section(attr,
>>>> section, &data,
>>>> +
>>>> ram_block_attributes_rdm_replay_cb);
>>>> +}
>>>> +
>>>> +static bool
>>>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr,
>>>> uint64_t offset,
>>>> +                                    uint64_t size)
>>>> +{
>>>> +    MemoryRegion *mr = attr->ram_block->mr;
>>>> +
>>>> +    g_assert(mr);
>>>> +
>>>> +    uint64_t region_size = memory_region_size(mr);
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +
>>>> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
>>>> +        !QEMU_IS_ALIGNED(size, block_size)) {
>>>> +        return false;
>>>> +    }
>>>> +    if (offset + size <= offset) {
>>>> +        return false;
>>>> +    }
>>>> +    if (offset + size > region_size) {
>>>> +        return false;
>>>> +    }
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void ram_block_attributes_notify_discard(RamBlockAttributes
>>>> *attr,
>>>> +                                                uint64_t offset,
>>>> +                                                uint64_t size)
>>>> +{
>>>> +    RamDiscardListener *rdl;
>>>> +
>>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>>> +        MemoryRegionSection tmp = *rdl->section;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            continue;
>>>> +        }
>>>> +        rdl->notify_discard(rdl, &tmp);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int
>>>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
>>>> +                                     uint64_t offset, uint64_t size)
>>>> +{
>>>> +    RamDiscardListener *rdl;
>>>> +    int ret = 0;
>>>> +
>>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>>> +        MemoryRegionSection tmp = *rdl->section;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            continue;
>>>> +        }
>>>> +        ret = rdl->notify_populate(rdl, &tmp);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static bool
>>>> ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
>>>> +                                                    uint64_t offset,
>>>> +                                                    uint64_t size)
>>>> +{
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +    const unsigned long first_bit = offset / block_size;
>>>> +    const unsigned long last_bit = first_bit + (size / block_size)
>>>> - 1;
>>>> +    unsigned long found_bit;
>>>> +
>>>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>>> +                                   first_bit);
>>>> +    return found_bit > last_bit;
>>>> +}
>>>> +
>>>> +static bool
>>>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
>>>> +                                        uint64_t offset, uint64_t
>>>> size)
>>>> +{
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +    const unsigned long first_bit = offset / block_size;
>>>> +    const unsigned long last_bit = first_bit + (size / block_size)
>>>> - 1;
>>>> +    unsigned long found_bit;
>>>> +
>>>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>>>> +    return found_bit > last_bit;
>>>> +}
>>>> +
>>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>>> +                                      uint64_t offset, uint64_t size,
>>>> +                                      bool to_discard)
>>>> +{
>>>> +    const size_t block_size =
>>>> ram_block_attributes_get_block_size(attr);
>>>> +    const unsigned long first_bit = offset / block_size;
>>>> +    const unsigned long nbits = size / block_size;
>>>> +    bool is_range_discarded, is_range_populated;
>>>> +    const uint64_t end = offset + size;
>>>> +    unsigned long bit;
>>>> +    uint64_t cur;
>>>> +    int ret = 0;
>>>> +
>>>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>> +                     __func__, offset, size);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    is_range_discarded =
>>>> ram_block_attributes_is_range_discarded(attr, offset,
>>>> +                                                                
>>>> size);
>>>> +    is_range_populated =
>>>> ram_block_attributes_is_range_populated(attr, offset,
>>>> +                                                                
>>>> size);
>>>> +
>>>> +    trace_ram_block_attributes_state_change(offset, size,
>>>> +                                            is_range_discarded ?
>>>> "discarded" :
>>>> +                                            is_range_populated ?
>>>> "populated" :
>>>> +                                            "mixture",
>>>> +                                            to_discard ? "discarded" :
>>>> +                                            "populated");
>>>> +    if (to_discard) {
>>>> +        if (is_range_discarded) {
>>>> +            /* Already private */
>>>> +        } else if (is_range_populated) {
>>>> +            /* Completely shared */
>>>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>>>> +            ram_block_attributes_notify_discard(attr, offset, size);
>>>
>>> In this patch series we are only maintaining the bitmap for Ram discard/
>>> populate state not for regular guest_memfd private/shared?
>>
>> As mentioned in changelog, "In the context of RamDiscardManager, shared
>> state is analogous to populated, and private state is signified as
>> discarded." To keep consistent with RamDiscardManager, I used the ram
>> "populated/discareded" in variable and function names.
>>
>> Of course, we can use private/shared if we rename the RamDiscardManager
>> to something like RamStateManager. But I haven't done it in this series.
>> Because I think we can also view the bitmap as the state of shared
>> memory (shared discard/shared populate) at present. The VFIO user only
>> manipulate the dma map/unmap of shared mapping. (We need to consider how
>> to extend the RDM framwork to manage the shared/private/discard states
>> in the future when need to distinguish private and discard states.)
> 
> As function name 'ram_block_attributes_state_change' is generic. Maybe
> for now metadata update for only two states (shared/private) is enough
> as it also aligns with discard vs populate states?

Yes, it is enough to treat the shared/private states align with
populate/discard at present as the only user is VFIO shared mapping.

> 
> As we would also need the shared vs private state metadata for other
> COCO operations e.g live migration, so wondering having this metadata
> already there would be helpful. This also will keep the legacy interface
> (prior to in-place conversion) consistent (As memory-attributes handling
> is generic operation anyway).

When live migration in CoCo VMs is introduced, I think it needs to
distinguish the difference between the states of discard and private. It
cannot simply skip the discard parts any more and needs special handling
for private parts. So still, we have to extend the interface if have to
make it avaiable in advance.

> 
> I saw you had it in previous versions(v3) but removed later?

In new versions, I changed the generic wrapper of klass->state_change()
(ram_block_attributes_state_change()) to an exported function and squash
it in this patch because there's no derived class of RamBlockAttribute
which will implement other callback.

> 
> https://lore.kernel.org/qemu-devel/20250310081837.13123-6-
> chenyi.qiang@intel.com/
> 
> Best regards,
> Pankaj
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-03  6:33         ` Chenyi Qiang
@ 2025-06-03  7:17           ` Gupta, Pankaj
  2025-06-03  7:41             ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-03  7:17 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson,
	Lindgren, Tony, Maloor, Kishen

+CC Tony & Kishen

>>>> In this patch series we are only maintaining the bitmap for Ram discard/
>>>> populate state not for regular guest_memfd private/shared?
>>>
>>> As mentioned in changelog, "In the context of RamDiscardManager, shared
>>> state is analogous to populated, and private state is signified as
>>> discarded." To keep consistent with RamDiscardManager, I used the ram
>>> "populated/discareded" in variable and function names.
>>>
>>> Of course, we can use private/shared if we rename the RamDiscardManager
>>> to something like RamStateManager. But I haven't done it in this series.
>>> Because I think we can also view the bitmap as the state of shared
>>> memory (shared discard/shared populate) at present. The VFIO user only
>>> manipulate the dma map/unmap of shared mapping. (We need to consider how
>>> to extend the RDM framwork to manage the shared/private/discard states
>>> in the future when need to distinguish private and discard states.)
>>
>> As function name 'ram_block_attributes_state_change' is generic. Maybe
>> for now metadata update for only two states (shared/private) is enough
>> as it also aligns with discard vs populate states?
> 
> Yes, it is enough to treat the shared/private states align with
> populate/discard at present as the only user is VFIO shared mapping.
> 
>>
>> As we would also need the shared vs private state metadata for other
>> COCO operations e.g live migration, so wondering having this metadata
>> already there would be helpful. This also will keep the legacy interface
>> (prior to in-place conversion) consistent (As memory-attributes handling
>> is generic operation anyway).
> 
> When live migration in CoCo VMs is introduced, I think it needs to
> distinguish the difference between the states of discard and private. It
> cannot simply skip the discard parts any more and needs special handling
> for private parts. So still, we have to extend the interface if have to
> make it avaiable in advance.

You mean even the discard and private would need different handling and 
we cannot use a common per RAMBlock object metadata store? That was the 
reason I suggested in v4 to go with a base abstract class with common 
bits and implementation can be based on specific derived class. As that 
seemed cleaner and future extensible, otherwise we would need a major 
overhaul in future to this code.


Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-03  7:17           ` Gupta, Pankaj
@ 2025-06-03  7:41             ` David Hildenbrand
  2025-06-03  9:45               ` Gupta, Pankaj
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-06-03  7:41 UTC (permalink / raw)
  To: Gupta, Pankaj, Chenyi Qiang, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson,
	Lindgren, Tony, Maloor, Kishen

On 03.06.25 09:17, Gupta, Pankaj wrote:
> +CC Tony & Kishen
> 
>>>>> In this patch series we are only maintaining the bitmap for Ram discard/
>>>>> populate state not for regular guest_memfd private/shared?
>>>>
>>>> As mentioned in changelog, "In the context of RamDiscardManager, shared
>>>> state is analogous to populated, and private state is signified as
>>>> discarded." To keep consistent with RamDiscardManager, I used the ram
>>>> "populated/discareded" in variable and function names.
>>>>
>>>> Of course, we can use private/shared if we rename the RamDiscardManager
>>>> to something like RamStateManager. But I haven't done it in this series.
>>>> Because I think we can also view the bitmap as the state of shared
>>>> memory (shared discard/shared populate) at present. The VFIO user only
>>>> manipulate the dma map/unmap of shared mapping. (We need to consider how
>>>> to extend the RDM framwork to manage the shared/private/discard states
>>>> in the future when need to distinguish private and discard states.)
>>>
>>> As function name 'ram_block_attributes_state_change' is generic. Maybe
>>> for now metadata update for only two states (shared/private) is enough
>>> as it also aligns with discard vs populate states?
>>
>> Yes, it is enough to treat the shared/private states align with
>> populate/discard at present as the only user is VFIO shared mapping.
>>
>>>
>>> As we would also need the shared vs private state metadata for other
>>> COCO operations e.g live migration, so wondering having this metadata
>>> already there would be helpful. This also will keep the legacy interface
>>> (prior to in-place conversion) consistent (As memory-attributes handling
>>> is generic operation anyway).
>>
>> When live migration in CoCo VMs is introduced, I think it needs to
>> distinguish the difference between the states of discard and private. It
>> cannot simply skip the discard parts any more and needs special handling
>> for private parts. So still, we have to extend the interface if have to
>> make it avaiable in advance.
> 
> You mean even the discard and private would need different handling

I am pretty sure they would in any case? Shared memory, you can simply 
copy, private memory has to be extracted + placed differently.

If we run into problems with live-migration, we can investigate how to 
extend the current approach.

Just like with memory hotplug / virtio-mem, I shared some ideas on how 
to make it work, but holding up this work when we don't even know what 
exactly we will exactly need for other future use cases does not sound 
too plausible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-03  7:41             ` David Hildenbrand
@ 2025-06-03  9:45               ` Gupta, Pankaj
  2025-06-03 12:05                 ` Chenyi Qiang
  0 siblings, 1 reply; 30+ messages in thread
From: Gupta, Pankaj @ 2025-06-03  9:45 UTC (permalink / raw)
  To: David Hildenbrand, Chenyi Qiang, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson,
	Lindgren, Tony, Maloor, Kishen

On 6/3/2025 9:41 AM, David Hildenbrand wrote:
> On 03.06.25 09:17, Gupta, Pankaj wrote:
>> +CC Tony & Kishen
>>
>>>>>> In this patch series we are only maintaining the bitmap for Ram 
>>>>>> discard/
>>>>>> populate state not for regular guest_memfd private/shared?
>>>>>
>>>>> As mentioned in changelog, "In the context of RamDiscardManager, 
>>>>> shared
>>>>> state is analogous to populated, and private state is signified as
>>>>> discarded." To keep consistent with RamDiscardManager, I used the ram
>>>>> "populated/discareded" in variable and function names.
>>>>>
>>>>> Of course, we can use private/shared if we rename the 
>>>>> RamDiscardManager
>>>>> to something like RamStateManager. But I haven't done it in this 
>>>>> series.
>>>>> Because I think we can also view the bitmap as the state of shared
>>>>> memory (shared discard/shared populate) at present. The VFIO user only
>>>>> manipulate the dma map/unmap of shared mapping. (We need to 
>>>>> consider how
>>>>> to extend the RDM framwork to manage the shared/private/discard states
>>>>> in the future when need to distinguish private and discard states.)
>>>>
>>>> As function name 'ram_block_attributes_state_change' is generic. Maybe
>>>> for now metadata update for only two states (shared/private) is enough
>>>> as it also aligns with discard vs populate states?
>>>
>>> Yes, it is enough to treat the shared/private states align with
>>> populate/discard at present as the only user is VFIO shared mapping.
>>>
>>>>
>>>> As we would also need the shared vs private state metadata for other
>>>> COCO operations e.g live migration, so wondering having this metadata
>>>> already there would be helpful. This also will keep the legacy 
>>>> interface
>>>> (prior to in-place conversion) consistent (As memory-attributes 
>>>> handling
>>>> is generic operation anyway).
>>>
>>> When live migration in CoCo VMs is introduced, I think it needs to
>>> distinguish the difference between the states of discard and private. It
>>> cannot simply skip the discard parts any more and needs special handling
>>> for private parts. So still, we have to extend the interface if have to
>>> make it avaiable in advance.
>>
>> You mean even the discard and private would need different handling
> 
> I am pretty sure they would in any case? Shared memory, you can simply 
> copy, private memory has to be extracted + placed differently.
> 
> If we run into problems with live-migration, we can investigate how to 
> extend the current approach.

Not problems. My understanding was: newly introduced per RAM BLock 
bitmap gets maintained for RAMBlock corresponding shared <-> private 
conversions in addition to VFIO discard <-> populate conversions.
Since per RAMBlock bitmap set is disjoint for both the above cases,
so can be reused for live migration use-case as well when deciding which 
page is private vs shared.

Seems it was part of the series till v3 & v4(in a different design), not 
anymore though. Of-course it can be added later :)

> 
> Just like with memory hotplug / virtio-mem, I shared some ideas on how 
> to make it work, but holding up this work when we don't even know what 
> exactly we will exactly need for other future use cases does not sound 
> too plausible.
> 

Of-course we should not hold this series. But Thanks  'Chenyi Qiang' for
your efforts for trying different implementation based on information we 
had!

With or w/o shared <-> private bitmap update. Feel free to add:

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>


Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-03  9:45               ` Gupta, Pankaj
@ 2025-06-03 12:05                 ` Chenyi Qiang
  0 siblings, 0 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-03 12:05 UTC (permalink / raw)
  To: Gupta, Pankaj, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson,
	Lindgren, Tony, Maloor, Kishen



On 6/3/2025 5:45 PM, Gupta, Pankaj wrote:
> On 6/3/2025 9:41 AM, David Hildenbrand wrote:
>> On 03.06.25 09:17, Gupta, Pankaj wrote:
>>> +CC Tony & Kishen
>>>
>>>>>>> In this patch series we are only maintaining the bitmap for Ram
>>>>>>> discard/
>>>>>>> populate state not for regular guest_memfd private/shared?
>>>>>>
>>>>>> As mentioned in changelog, "In the context of RamDiscardManager,
>>>>>> shared
>>>>>> state is analogous to populated, and private state is signified as
>>>>>> discarded." To keep consistent with RamDiscardManager, I used the ram
>>>>>> "populated/discareded" in variable and function names.
>>>>>>
>>>>>> Of course, we can use private/shared if we rename the
>>>>>> RamDiscardManager
>>>>>> to something like RamStateManager. But I haven't done it in this
>>>>>> series.
>>>>>> Because I think we can also view the bitmap as the state of shared
>>>>>> memory (shared discard/shared populate) at present. The VFIO user
>>>>>> only
>>>>>> manipulate the dma map/unmap of shared mapping. (We need to
>>>>>> consider how
>>>>>> to extend the RDM framwork to manage the shared/private/discard
>>>>>> states
>>>>>> in the future when need to distinguish private and discard states.)
>>>>>
>>>>> As function name 'ram_block_attributes_state_change' is generic. Maybe
>>>>> for now metadata update for only two states (shared/private) is enough
>>>>> as it also aligns with discard vs populate states?
>>>>
>>>> Yes, it is enough to treat the shared/private states align with
>>>> populate/discard at present as the only user is VFIO shared mapping.
>>>>
>>>>>
>>>>> As we would also need the shared vs private state metadata for other
>>>>> COCO operations e.g live migration, so wondering having this metadata
>>>>> already there would be helpful. This also will keep the legacy
>>>>> interface
>>>>> (prior to in-place conversion) consistent (As memory-attributes
>>>>> handling
>>>>> is generic operation anyway).
>>>>
>>>> When live migration in CoCo VMs is introduced, I think it needs to
>>>> distinguish the difference between the states of discard and
>>>> private. It
>>>> cannot simply skip the discard parts any more and needs special
>>>> handling
>>>> for private parts. So still, we have to extend the interface if have to
>>>> make it avaiable in advance.
>>>
>>> You mean even the discard and private would need different handling
>>
>> I am pretty sure they would in any case? Shared memory, you can simply
>> copy, private memory has to be extracted + placed differently.
>>
>> If we run into problems with live-migration, we can investigate how to
>> extend the current approach.
> 
> Not problems. My understanding was: newly introduced per RAM BLock
> bitmap gets maintained for RAMBlock corresponding shared <-> private
> conversions in addition to VFIO discard <-> populate conversions.
> Since per RAMBlock bitmap set is disjoint for both the above cases,
> so can be reused for live migration use-case as well when deciding which
> page is private vs shared.
> 
> Seems it was part of the series till v3 & v4(in a different design), not
> anymore though. Of-course it can be added later :)

Yeah. I think we can consider the extension in a separate series and
view it as the preparation work for CoCo live migration/virtio-mem
support. Since v4 is considered in a wrong direction, maybe David's idea
[1] is worth a try.


[1]
https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/

> 
>>
>> Just like with memory hotplug / virtio-mem, I shared some ideas on how
>> to make it work, but holding up this work when we don't even know what
>> exactly we will exactly need for other future use cases does not sound
>> too plausible.
>>
> 
> Of-course we should not hold this series. But Thanks  'Chenyi Qiang' for
> your efforts for trying different implementation based on information we
> had!
> 
> With or w/o shared <-> private bitmap update. Feel free to add:
> 
> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

Thanks Pankaj for your review!

> 
> 
> Thanks,
> Pankaj
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
  2025-06-01  9:58   ` Gupta, Pankaj
  2025-06-02 21:10   ` David Hildenbrand
@ 2025-06-04 11:04   ` Alexey Kardashevskiy
  2025-06-05  0:35     ` Alexey Kardashevskiy
  2025-06-05  6:01     ` Chenyi Qiang
  2 siblings, 2 replies; 30+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-04 11:04 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 30/5/25 18:32, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") highlighted that subsystems like VFIO may disable RAM block
> discard. However, guest_memfd relies on discard operations for page
> conversion between private and shared memory, potentially leading to
> the stale IOMMU mapping issue when assigning hardware devices to
> confidential VMs via shared memory. To address this and allow shared
> device assignement, it is crucial to ensure the VFIO system refreshes
> its IOMMU mappings.
> 
> RamDiscardManager is an existing interface (used by virtio-mem) to
> adjust VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. Implementing it in HostMemoryBackend is
> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
> have a memory backend while others do not. Notably, virtual BIOS
> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
> backend.
> 
> To manage RAMBlocks with guest_memfd, define a new object named
> RamBlockAttributes to implement the RamDiscardManager interface. This
> object can store the guest_memfd information such as bitmap for shared
> memory and the registered listeners for event notification. In the
> context of RamDiscardManager, shared state is analogous to populated, and
> private state is signified as discarded. To notify the conversion events,
> a new state_change() helper is exported for the users to notify the
> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
> shared mapping.
> 
> Note that the memory state is tracked at the host page size granularity,
> as the minimum conversion size can be one page per request and VFIO
> expects the DMA mapping for a specific iova to be mapped and unmapped
> with the same granularity. Confidential VMs may perform partial
> conversions, such as conversions on small regions within larger ones.
> To prevent such invalid cases and until DMA mapping cut operation
> support is available, all operations are performed with 4K granularity.
> 
> In addition, memory conversion failures cause QEMU to quit instead of
> resuming the guest or retrying the operation at present. It would be
> future work to add more error handling or rollback mechanisms once
> conversion failures are allowed. For example, in-place conversion of
> guest_memfd could retry the unmap operation during the conversion from
> shared to private. For now, keep the complex error handling out of the
> picture as it is not required.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v6:
>      - Change the object type name from RamBlockAttribute to
>        RamBlockAttributes. (David)
>      - Save the associated RAMBlock instead MemoryRegion in
>        RamBlockAttributes. (David)
>      - Squash the state_change() helper introduction in this commit as
>        well as the mixture conversion case handling. (David)
>      - Change the block_size type from int to size_t and some cleanup in
>        validation check. (Alexey)
>      - Add a tracepoint to track the state changes. (Alexey)
> 
> Changes in v5:
>      - Revert to use RamDiscardManager interface instead of introducing
>        new hierarchy of class to manage private/shared state, and keep
>        using the new name of RamBlockAttribute compared with the
>        MemoryAttributeManager in v3.
>      - Use *simple* version of object_define and object_declare since the
>        state_change() function is changed as an exported function instead
>        of a virtual function in later patch.
>      - Move the introduction of RamBlockAttribute field to this patch and
>        rename it to ram_shared. (Alexey)
>      - call the exit() when register/unregister failed. (Zhao)
>      - Add the ram-block-attribute.c to Memory API related part in
>        MAINTAINERS.
> 
> Changes in v4:
>      - Change the name from memory-attribute-manager to
>        ram-block-attribute.
>      - Implement the newly-introduced PrivateSharedManager instead of
>        RamDiscardManager and change related commit message.
>      - Define the new object in ramblock.h instead of adding a new file.
> ---
>   MAINTAINERS                   |   1 +
>   include/system/ramblock.h     |  21 ++
>   system/meson.build            |   1 +
>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>   system/trace-events           |   3 +
>   5 files changed, 506 insertions(+)
>   create mode 100644 system/ram-block-attributes.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6dacd6d004..8ec39aa7f8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3149,6 +3149,7 @@ F: system/memory.c
>   F: system/memory_mapping.c
>   F: system/physmem.c
>   F: system/memory-internal.h
> +F: system/ram-block-attributes.c
>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>   
>   Memory devices
> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
> index d8a116ba99..1bab9e2dac 100644
> --- a/include/system/ramblock.h
> +++ b/include/system/ramblock.h
> @@ -22,6 +22,10 @@
>   #include "exec/cpu-common.h"
>   #include "qemu/rcu.h"
>   #include "exec/ramlist.h"
> +#include "system/hostmem.h"
> +
> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>   
>   struct RAMBlock {
>       struct rcu_head rcu;
> @@ -91,4 +95,21 @@ struct RAMBlock {
>       ram_addr_t postcopy_length;
>   };
>   
> +struct RamBlockAttributes {
> +    Object parent;
> +
> +    RAMBlock *ram_block;
> +
> +    /* 1-setting of the bitmap represents ram is populated (shared) */
> +    unsigned bitmap_size;
> +    unsigned long *bitmap;
> +
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
> +int ram_block_attributes_state_change(RamBlockAttributes *attr, uint64_t offset,
> +                                      uint64_t size, bool to_discard);
> +
>   #endif
> diff --git a/system/meson.build b/system/meson.build
> index c2f0082766..2747dbde80 100644
> --- a/system/meson.build
> +++ b/system/meson.build
> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>     'dma-helpers.c',
>     'globals.c',
>     'ioport.c',
> +  'ram-block-attributes.c',
>     'memory_mapping.c',
>     'memory.c',
>     'physmem.c',
> diff --git a/system/ram-block-attributes.c b/system/ram-block-attributes.c
> new file mode 100644
> index 0000000000..514252413f
> --- /dev/null
> +++ b/system/ram-block-attributes.c
> @@ -0,0 +1,480 @@
> +/*
> + * QEMU ram block attributes
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "system/ramblock.h"
> +#include "trace.h"
> +
> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
> +                                          ram_block_attributes,
> +                                          RAM_BLOCK_ATTRIBUTES,
> +                                          OBJECT,
> +                                          { TYPE_RAM_DISCARD_MANAGER },
> +                                          { })
> +
> +static size_t
> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
> +{
> +    /*
> +     * Because page conversion could be manipulated in the size of at least 4K
> +     * or 4K aligned, Use the host page size as the granularity to track the
> +     * memory attribute.
> +     */
> +    g_assert(attr && attr->ram_block);
> +    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
> +    return attr->ram_block->page_size;
> +}
> +
> +
> +static bool
> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
> +                                      const MemoryRegionSection *section)
> +{
> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const uint64_t first_bit = section->offset_within_region / block_size;
> +    const uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
> +    unsigned long first_discarded_bit;
> +
> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
> +                                           first_bit);
> +    return first_discarded_bit > last_bit;
> +}
> +
> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
> +                                               void *arg);
> +
> +static int
> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
> +                                        void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    return rdl->notify_populate(rdl, section);
> +}
> +
> +static int
> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
> +                                       void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    rdl->notify_discard(rdl, section);
> +    return 0;
> +}
> +
> +static int
> +ram_block_attributes_for_each_populated_section(const RamBlockAttributes *attr,
> +                                                MemoryRegionSection *section,
> +                                                void *arg,
> +                                                ram_block_attributes_section_cb cb)
> +{
> +    unsigned long first_bit, last_bit;
> +    uint64_t offset, size;
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    int ret = 0;
> +
> +    first_bit = section->offset_within_region / block_size;
> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
> +                              first_bit);
> +
> +    while (first_bit < attr->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_bit * block_size;
> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
> +                                      first_bit + 1) - 1;
> +        size = (last_bit - first_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s",
> +                         __func__, strerror(-ret));
> +            break;
> +        }
> +
> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
> +                                  last_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static int
> +ram_block_attributes_for_each_discarded_section(const RamBlockAttributes *attr,
> +                                                MemoryRegionSection *section,
> +                                                void *arg,
> +                                                ram_block_attributes_section_cb cb)
> +{
> +    unsigned long first_bit, last_bit;
> +    uint64_t offset, size;
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    int ret = 0;
> +
> +    first_bit = section->offset_within_region / block_size;
> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
> +                                   first_bit);
> +
> +    while (first_bit < attr->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_bit * block_size;
> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
> +                                 first_bit + 1) - 1;
> +        size = (last_bit - first_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s",
> +                         __func__, strerror(-ret));
> +            break;
> +        }
> +
> +        first_bit = find_next_zero_bit(attr->bitmap,
> +                                       attr->bitmap_size,
> +                                       last_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint64_t
> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager *rdm,
> +                                             const MemoryRegion *mr)
> +{
> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +
> +    g_assert(mr == attr->ram_block->mr);
> +    return ram_block_attributes_get_block_size(attr);
> +}
> +
> +static void
> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
> +                                           RamDiscardListener *rdl,
> +                                           MemoryRegionSection *section)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    int ret;
> +
> +    g_assert(section->mr == attr->ram_block->mr);
> +    rdl->section = memory_region_section_new_copy(section);
> +
> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
> +
> +    ret = ram_block_attributes_for_each_populated_section(attr, section, rdl,
> +                                    ram_block_attributes_notify_populate_cb);
> +    if (ret) {
> +        error_report("%s: Failed to register RAM discard listener: %s",
> +                     __func__, strerror(-ret));
> +        exit(1);
> +    }
> +}
> +
> +static void
> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
> +                                             RamDiscardListener *rdl)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    int ret;
> +
> +    g_assert(rdl->section);
> +    g_assert(rdl->section->mr == attr->ram_block->mr);
> +
> +    if (rdl->double_discard_supported) {
> +        rdl->notify_discard(rdl, rdl->section);
> +    } else {
> +        ret = ram_block_attributes_for_each_populated_section(attr,
> +                rdl->section, rdl, ram_block_attributes_notify_discard_cb);
> +        if (ret) {
> +            error_report("%s: Failed to unregister RAM discard listener: %s",
> +                         __func__, strerror(-ret));
> +            exit(1);
> +        }
> +    }
> +
> +    memory_region_section_free_copy(rdl->section);
> +    rdl->section = NULL;
> +    QLIST_REMOVE(rdl, next);
> +}
> +
> +typedef struct RamBlockAttributesReplayData {
> +    ReplayRamDiscardState fn;
> +    void *opaque;
> +} RamBlockAttributesReplayData;
> +
> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection *section,
> +                                              void *arg)
> +{
> +    RamBlockAttributesReplayData *data = arg;
> +
> +    return data->fn(section, data->opaque);
> +}
> +
> +static int
> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
> +                                          MemoryRegionSection *section,
> +                                          ReplayRamDiscardState replay_fn,
> +                                          void *opaque)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == attr->ram_block->mr);
> +    return ram_block_attributes_for_each_populated_section(attr, section, &data,
> +                                            ram_block_attributes_rdm_replay_cb);
> +}
> +
> +static int
> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                          MemoryRegionSection *section,
> +                                          ReplayRamDiscardState replay_fn,
> +                                          void *opaque)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == attr->ram_block->mr);
> +    return ram_block_attributes_for_each_discarded_section(attr, section, &data,
> +                                            ram_block_attributes_rdm_replay_cb);
> +}
> +
> +static bool
> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr, uint64_t offset,
> +                                    uint64_t size)
> +{
> +    MemoryRegion *mr = attr->ram_block->mr;
> +
> +    g_assert(mr);
> +
> +    uint64_t region_size = memory_region_size(mr);
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +
> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
> +        !QEMU_IS_ALIGNED(size, block_size)) {
> +        return false;
> +    }
> +    if (offset + size <= offset) {
> +        return false;
> +    }
> +    if (offset + size > region_size) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static void ram_block_attributes_notify_discard(RamBlockAttributes *attr,
> +                                                uint64_t offset,
> +                                                uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +
> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +        rdl->notify_discard(rdl, &tmp);
> +    }
> +}
> +
> +static int
> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
> +                                     uint64_t offset, uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +        ret = rdl->notify_populate(rdl, &tmp);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static bool ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
> +                                                    uint64_t offset,
> +                                                    uint64_t size)
> +{
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
> +                                   first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +static bool
> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
> +                                        uint64_t offset, uint64_t size)
> +{
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
> +                                      uint64_t offset, uint64_t size,
> +                                      bool to_discard)
> +{
> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long nbits = size / block_size;
> +    bool is_range_discarded, is_range_populated;

Can be reduced to "discarded" and "populated".

> +    const uint64_t end = offset + size;
> +    unsigned long bit;
> +    uint64_t cur;
> +    int ret = 0;
> +
> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
> +                     __func__, offset, size);
> +        return -EINVAL;
> +    }
> +
> +    is_range_discarded = ram_block_attributes_is_range_discarded(attr, offset,
> +                                                                 size);

See - needlessly long line.

> +    is_range_populated = ram_block_attributes_is_range_populated(attr, offset,
> +                                                                 size);

If ram_block_attributes_is_range_populated() returned (found_bit*block_size), you could tell from a single call if it is populated (found_bit == size) or discarded (found_bit == 0), otherwise it is a mix (and dump just this number in the tracepoint below).

And then ditch ram_block_attributes_is_range_discarded() which is practically cut-n-paste. And then open code ram_block_attributes_is_range_populated().

These two are not used elsewhere anyway.

> +
> +    trace_ram_block_attributes_state_change(offset, size,
> +                                            is_range_discarded ? "discarded" :
> +                                            is_range_populated ? "populated" :
> +                                            "mixture",
> +                                            to_discard ? "discarded" :
> +                                            "populated");


I'd just dump 3 numbers (is_range_discarded, is_range_populated, to_discard) in the tracepoint as:

ram_block_attributes_state_change(uint64_t offset, uint64_t size, int discarded, int populated, int to_discard) "offset 0x%"PRIx64" size 0x%"PRIx64" discarded=%d populated=%d to_discard=%d"



> +    if (to_discard) {
> +        if (is_range_discarded) {
> +            /* Already private */
> +        } else if (is_range_populated) {
> +            /* Completely shared */
> +            bitmap_clear(attr->bitmap, first_bit, nbits);
> +            ram_block_attributes_notify_discard(attr, offset, size);
> +        } else {
> +            /* Unexpected mixture: process individual blocks */
> +            for (cur = offset; cur < end; cur += block_size) {

imho a little bit more accurate to:

for (bit = first_bit; bit < first_bit + nbits; ++bit) {

as you already have calculated first_bit, nbits...

> +                bit = cur / block_size;

... and drop this ...

> +                if (!test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                clear_bit(bit, attr->bitmap);
> +                ram_block_attributes_notify_discard(attr, cur, block_size);

.. and do: ram_block_attributes_notify_discard(attr, bit * block_size, block_size);

Then you can drop @cur which is used in one place inside the loop.


> +            }
> +        }
> +    } else {
> +        if (is_range_populated) {
> +            /* Already shared */
> +        } else if (is_range_discarded) {
> +            /* Complete private */

s/Complete/Completely/

> +            bitmap_set(attr->bitmap, first_bit, nbits);
> +            ret = ram_block_attributes_notify_populate(attr, offset, size);
> +        } else {
> +            /* Unexpected mixture: process individual blocks */
> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                set_bit(bit, attr->bitmap);
> +                ret = ram_block_attributes_notify_populate(attr, cur,
> +                                                           block_size);
> +                if (ret) {
> +                    break;
> +                }
> +            }
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
> +{
> +    uint64_t bitmap_size;

Not really needed.

> +    const int block_size  = qemu_real_host_page_size();
> +    RamBlockAttributes *attr;
> +    int ret;
> +    MemoryRegion *mr = ram_block->mr;
> +
> +    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
> +
> +    attr->ram_block = ram_block;
> +    ret = memory_region_set_ram_discard_manager(mr, RAM_DISCARD_MANAGER(attr));
> +    if (ret) {

Could just "if (memory_region_set_ram_discard_manager(...))".

> +        object_unref(OBJECT(attr));
> +        return NULL;
> +    }
> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
> +    attr->bitmap_size = bitmap_size;
> +    attr->bitmap = bitmap_new(bitmap_size);
> +
> +    return attr;
> +}
> +
> +void ram_block_attributes_destroy(RamBlockAttributes *attr)
> +{
> +    if (!attr) {


Rather g_assert().


> +        return;
> +    }
> +
> +    g_free(attr->bitmap);
> +    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
> +    object_unref(OBJECT(attr));
> +}
> +
> +static void ram_block_attributes_init(Object *obj)
> +{
> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
> +
> +    QLIST_INIT(&attr->rdl_list);
> +}

Not used.

> +
> +static void ram_block_attributes_finalize(Object *obj)

Not used.

Besides these two, feel free to ignore other comments :)

Otherwise,

Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>


> +{
> +}
> +
> +static void ram_block_attributes_class_init(ObjectClass *klass,
> +                                            const void *data)
> +{
> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
> +
> +    rdmc->get_min_granularity = ram_block_attributes_rdm_get_min_granularity;
> +    rdmc->register_listener = ram_block_attributes_rdm_register_listener;
> +    rdmc->unregister_listener = ram_block_attributes_rdm_unregister_listener;
> +    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
> +    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
> +    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
> +}
> diff --git a/system/trace-events b/system/trace-events
> index be12ebfb41..82856e44f2 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
>   dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>   dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>   dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
> +
> +# ram-block-attributes.c
> +ram_block_attributes_state_change(uint64_t offset, uint64_t size, const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64" from '%s' to '%s'"



-- 
Alexey


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 5/5] physmem: Support coordinated discarding of RAM with guest_memfd
  2025-05-30  8:32 ` [PATCH v6 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
@ 2025-06-04 11:04   ` Alexey Kardashevskiy
  2025-06-04 11:29   ` David Hildenbrand
  1 sibling, 0 replies; 30+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-04 11:04 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 30/5/25 18:32, Chenyi Qiang wrote:
> A new field, attributes, was introduced in RAMBlock to link to a
> RamBlockAttributes object, which centralizes all guest_memfd related
> information (such as fd and shared bitmap) within a RAMBlock.
> 
> Create and initialize the RamBlockAttributes object upon ram_block_add().
> Meanwhile, register the object in the target RAMBlock's MemoryRegion.
> After that, guest_memfd-backed RAMBlock is associated with the
> RamDiscardManager interface, and the users can execute RamDiscardManager
> specific handling. For example, VFIO will register the
> RamDiscardListener and get notifications when the state_change() helper
> invokes.
> 
> As coordinate discarding of RAM with guest_memfd is now supported, only
> block uncoordinated discard.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>


Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>


> ---
> Changes in v6:
>      - Squash the unblocking of cooridnate discard into this commit.
>      - Remove the checks in migration path.
> 
> Changes in v5:
>      - Revert to use RamDiscardManager interface.
>      - Move the object_new() into the ram_block_attribute_create()
>        helper.
>      - Add some check in migration path.
> 
> Changes in v4:
>      - Remove the replay operations for attribute changes which will be
>        handled in a listener in following patches.
>      - Add some comment in the error path of realize() to remind the
>        future development of the unified error path.
> 
> Changes in v3:
>      - Use ram_discard_manager_reply_populated/discarded() to set the
>        memory attribute and add the undo support if state_change()
>        failed.
>      - Didn't add Reviewed-by from Alexey due to the new changes in this
>        commit.
> ---
>   accel/kvm/kvm-all.c       |  9 +++++++++
>   include/system/ramblock.h |  1 +
>   system/physmem.c          | 18 ++++++++++++++++--
>   3 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 51526d301b..3b390bbb09 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3089,6 +3089,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>       addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
>       rb = qemu_ram_block_from_host(addr, false, &offset);
>   
> +    ret = ram_block_attributes_state_change(RAM_BLOCK_ATTRIBUTES(mr->rdm),
> +                                            offset, size, to_private);
> +    if (ret) {
> +        error_report("Failed to notify the listener the state change of "
> +                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
> +                     start, size, to_private ? "private" : "shared");
> +        goto out_unref;
> +    }
> +
>       if (to_private) {
>           if (rb->page_size != qemu_real_host_page_size()) {
>               /*
> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
> index 1bab9e2dac..87e847e184 100644
> --- a/include/system/ramblock.h
> +++ b/include/system/ramblock.h
> @@ -46,6 +46,7 @@ struct RAMBlock {
>       int fd;
>       uint64_t fd_offset;
>       int guest_memfd;
> +    RamBlockAttributes *attributes;
>       size_t page_size;
>       /* dirty bitmap used during migration */
>       unsigned long *bmap;
> diff --git a/system/physmem.c b/system/physmem.c
> index a8a9ca309e..1f1217fa0a 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>           }
>           assert(new_block->guest_memfd < 0);
>   
> -        ret = ram_block_discard_require(true);
> +        ret = ram_block_coordinated_discard_require(true);
>           if (ret < 0) {
>               error_setg_errno(errp, -ret,
>                                "cannot set up private guest memory: discard currently blocked");
> @@ -1931,6 +1931,19 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>               goto out_free;
>           }
>   
> +        new_block->attributes = ram_block_attributes_create(new_block);
> +        if (!new_block->attributes) {
> +            error_setg(errp, "Failed to create ram block attribute");
> +            /*
> +             * The error path could be unified if the rest of ram_block_add()
> +             * ever develops a need to check for errors.
> +             */
> +            close(new_block->guest_memfd);
> +            ram_block_coordinated_discard_require(false);
> +            qemu_mutex_unlock_ramlist();
> +            goto out_free;
> +        }
> +
>           /*
>            * Add a specific guest_memfd blocker if a generic one would not be
>            * added by ram_block_add_cpr_blocker.
> @@ -2287,8 +2300,9 @@ static void reclaim_ramblock(RAMBlock *block)
>       }
>   
>       if (block->guest_memfd >= 0) {
> +        ram_block_attributes_destroy(block->attributes);
>           close(block->guest_memfd);
> -        ram_block_discard_require(false);
> +        ram_block_coordinated_discard_require(false);
>       }
>   
>       g_free(block);

-- 
Alexey


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 5/5] physmem: Support coordinated discarding of RAM with guest_memfd
  2025-05-30  8:32 ` [PATCH v6 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
  2025-06-04 11:04   ` Alexey Kardashevskiy
@ 2025-06-04 11:29   ` David Hildenbrand
  1 sibling, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-06-04 11:29 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

On 30.05.25 10:32, Chenyi Qiang wrote:
> A new field, attributes, was introduced in RAMBlock to link to a
> RamBlockAttributes object, which centralizes all guest_memfd related
> information (such as fd and shared bitmap) within a RAMBlock.
> 
> Create and initialize the RamBlockAttributes object upon ram_block_add().
> Meanwhile, register the object in the target RAMBlock's MemoryRegion.
> After that, guest_memfd-backed RAMBlock is associated with the
> RamDiscardManager interface, and the users can execute RamDiscardManager
> specific handling. For example, VFIO will register the
> RamDiscardListener and get notifications when the state_change() helper
> invokes.
> 
> As coordinate discarding of RAM with guest_memfd is now supported, only
> block uncoordinated discard.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v6:
>      - Squash the unblocking of cooridnate discard into this commit.
>      - Remove the checks in migration path.
> 
> Changes in v5:
>      - Revert to use RamDiscardManager interface.
>      - Move the object_new() into the ram_block_attribute_create()
>        helper.
>      - Add some check in migration path.
> 
> Changes in v4:
>      - Remove the replay operations for attribute changes which will be
>        handled in a listener in following patches.
>      - Add some comment in the error path of realize() to remind the
>        future development of the unified error path.
> 
> Changes in v3:
>      - Use ram_discard_manager_reply_populated/discarded() to set the
>        memory attribute and add the undo support if state_change()
>        failed.
>      - Didn't add Reviewed-by from Alexey due to the new changes in this
>        commit.
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
  2025-06-01  9:21   ` Gupta, Pankaj
  2025-06-01 15:53   ` Xiaoyao Li
@ 2025-06-04 14:37   ` Alexey Kardashevskiy
  2 siblings, 0 replies; 30+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-04 14:37 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 30/5/25 18:32, Chenyi Qiang wrote:
> Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
> RamDiscardManager is already set in the MemoryRegion. The caller must
> handle this failure, such as having virtio-mem undo its actions and fail
> the realize() process. Opportunistically move the call earlier to avoid
> complex error handling.
> 
> This change is beneficial when introducing a new RamDiscardManager
> instance besides virtio-mem. After
> ram_block_coordinated_discard_require(true) unlocks all
> RamDiscardManager instances, only one instance is allowed to be set for
> one MemoryRegion at present.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>


> ---
> Changes in v6:
>      - Add Reviewed-by from David.
> 
> Changes in v5:
>      - Nit in commit message (return false -> -EBUSY)
>      - Add set_ram_discard_manager(NULL) when ram_block_discard_range()
>        fails.
> 
> Changes in v3:
>      - Move set_ram_discard_manager() up to avoid a g_free()
>      - Clean up set_ram_discard_manager() definition
> ---
>   hw/virtio/virtio-mem.c  | 30 +++++++++++++++++-------------
>   include/system/memory.h |  6 +++---
>   system/memory.c         | 10 +++++++---
>   3 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index b3c126ea1e..2e491e8c44 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1047,6 +1047,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>           return;
>       }
>   
> +    /*
> +     * Set ourselves as RamDiscardManager before the plug handler maps the
> +     * memory region and exposes it via an address space.
> +     */
> +    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> +                                              RAM_DISCARD_MANAGER(vmem))) {
> +        error_setg(errp, "Failed to set RamDiscardManager");
> +        ram_block_coordinated_discard_require(false);
> +        return;
> +    }
> +
>       /*
>        * We don't know at this point whether shared RAM is migrated using
>        * QEMU or migrated using the file content. "x-ignore-shared" will be
> @@ -1061,6 +1072,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>           ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
>           if (ret) {
>               error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
> +            memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>               ram_block_coordinated_discard_require(false);
>               return;
>           }
> @@ -1122,13 +1134,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>       vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
>       vmem->system_reset->vmem = vmem;
>       qemu_register_resettable(obj);
> -
> -    /*
> -     * Set ourselves as RamDiscardManager before the plug handler maps the
> -     * memory region and exposes it via an address space.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> -                                          RAM_DISCARD_MANAGER(vmem));
>   }
>   
>   static void virtio_mem_device_unrealize(DeviceState *dev)
> @@ -1136,12 +1141,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>       VirtIOMEM *vmem = VIRTIO_MEM(dev);
>   
> -    /*
> -     * The unplug handler unmapped the memory region, it cannot be
> -     * found via an address space anymore. Unset ourselves.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
> -
>       qemu_unregister_resettable(OBJECT(vmem->system_reset));
>       object_unref(OBJECT(vmem->system_reset));
>   
> @@ -1154,6 +1153,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       virtio_del_queue(vdev, 0);
>       virtio_cleanup(vdev);
>       g_free(vmem->bitmap);
> +    /*
> +     * The unplug handler unmapped the memory region, it cannot be
> +     * found via an address space anymore. Unset ourselves.
> +     */
> +    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>       ram_block_coordinated_discard_require(false);
>   }
>   
> diff --git a/include/system/memory.h b/include/system/memory.h
> index b961c4076a..896948deb1 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2499,13 +2499,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
>    *
>    * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
>    * that does not cover RAM, or a #MemoryRegion that already has a
> - * #RamDiscardManager assigned.
> + * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
>    *
>    * @mr: the #MemoryRegion
>    * @rdm: #RamDiscardManager to set
>    */
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm);
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm);
>   
>   /**
>    * memory_region_find: translate an address/size relative to a
> diff --git a/system/memory.c b/system/memory.c
> index 63b983efcd..b45b508dce 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2106,12 +2106,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
>       return mr->rdm;
>   }
>   
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm)
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm)
>   {
>       g_assert(memory_region_is_ram(mr));
> -    g_assert(!rdm || !mr->rdm);
> +    if (mr->rdm && rdm) {
> +        return -EBUSY;
> +    }
> +
>       mr->rdm = rdm;
> +    return 0;
>   }
>   
>   uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,

-- 
Alexey


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-04 11:04   ` Alexey Kardashevskiy
@ 2025-06-05  0:35     ` Alexey Kardashevskiy
  2025-06-05  3:22       ` Chenyi Qiang
  2025-06-05  6:01     ` Chenyi Qiang
  1 sibling, 1 reply; 30+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-05  0:35 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 4/6/25 21:04, Alexey Kardashevskiy wrote:
> 
> 
> On 30/5/25 18:32, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") highlighted that subsystems like VFIO may disable RAM block
>> discard. However, guest_memfd relies on discard operations for page
>> conversion between private and shared memory, potentially leading to
>> the stale IOMMU mapping issue when assigning hardware devices to
>> confidential VMs via shared memory. To address this and allow shared
>> device assignement, it is crucial to ensure the VFIO system refreshes
>> its IOMMU mappings.
>>
>> RamDiscardManager is an existing interface (used by virtio-mem) to
>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>> have a memory backend while others do not. Notably, virtual BIOS
>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>> backend.
>>
>> To manage RAMBlocks with guest_memfd, define a new object named
>> RamBlockAttributes to implement the RamDiscardManager interface. This
>> object can store the guest_memfd information such as bitmap for shared
>> memory and the registered listeners for event notification. In the
>> context of RamDiscardManager, shared state is analogous to populated, and
>> private state is signified as discarded. To notify the conversion events,
>> a new state_change() helper is exported for the users to notify the
>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>> shared mapping.
>>
>> Note that the memory state is tracked at the host page size granularity,
>> as the minimum conversion size can be one page per request and VFIO
>> expects the DMA mapping for a specific iova to be mapped and unmapped
>> with the same granularity. Confidential VMs may perform partial
>> conversions, such as conversions on small regions within larger ones.
>> To prevent such invalid cases and until DMA mapping cut operation
>> support is available, all operations are performed with 4K granularity.
>>
>> In addition, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation at present. It would be
>> future work to add more error handling or rollback mechanisms once
>> conversion failures are allowed. For example, in-place conversion of
>> guest_memfd could retry the unmap operation during the conversion from
>> shared to private. For now, keep the complex error handling out of the
>> picture as it is not required.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v6:
>>      - Change the object type name from RamBlockAttribute to
>>        RamBlockAttributes. (David)
>>      - Save the associated RAMBlock instead MemoryRegion in
>>        RamBlockAttributes. (David)
>>      - Squash the state_change() helper introduction in this commit as
>>        well as the mixture conversion case handling. (David)
>>      - Change the block_size type from int to size_t and some cleanup in
>>        validation check. (Alexey)
>>      - Add a tracepoint to track the state changes. (Alexey)
>>
>> Changes in v5:
>>      - Revert to use RamDiscardManager interface instead of introducing
>>        new hierarchy of class to manage private/shared state, and keep
>>        using the new name of RamBlockAttribute compared with the
>>        MemoryAttributeManager in v3.
>>      - Use *simple* version of object_define and object_declare since the
>>        state_change() function is changed as an exported function instead
>>        of a virtual function in later patch.
>>      - Move the introduction of RamBlockAttribute field to this patch and
>>        rename it to ram_shared. (Alexey)
>>      - call the exit() when register/unregister failed. (Zhao)
>>      - Add the ram-block-attribute.c to Memory API related part in
>>        MAINTAINERS.
>>
>> Changes in v4:
>>      - Change the name from memory-attribute-manager to
>>        ram-block-attribute.
>>      - Implement the newly-introduced PrivateSharedManager instead of
>>        RamDiscardManager and change related commit message.
>>      - Define the new object in ramblock.h instead of adding a new file.
>> ---
>>   MAINTAINERS                   |   1 +
>>   include/system/ramblock.h     |  21 ++
>>   system/meson.build            |   1 +
>>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>>   system/trace-events           |   3 +
>>   5 files changed, 506 insertions(+)
>>   create mode 100644 system/ram-block-attributes.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 6dacd6d004..8ec39aa7f8 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>   F: system/memory_mapping.c
>>   F: system/physmem.c
>>   F: system/memory-internal.h
>> +F: system/ram-block-attributes.c
>>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>>   Memory devices
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index d8a116ba99..1bab9e2dac 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -22,6 +22,10 @@
>>   #include "exec/cpu-common.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/ramlist.h"
>> +#include "system/hostmem.h"
>> +
>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>   struct RAMBlock {
>>       struct rcu_head rcu;
>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>       ram_addr_t postcopy_length;
>>   };
>> +struct RamBlockAttributes {
>> +    Object parent;
>> +
>> +    RAMBlock *ram_block;
>> +
>> +    /* 1-setting of the bitmap represents ram is populated (shared) */
>> +    unsigned bitmap_size;
>> +    unsigned long *bitmap;
>> +
>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>> +
>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
>> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
>> +int ram_block_attributes_state_change(RamBlockAttributes *attr, uint64_t offset,
>> +                                      uint64_t size, bool to_discard);
>> +
>>   #endif
>> diff --git a/system/meson.build b/system/meson.build
>> index c2f0082766..2747dbde80 100644
>> --- a/system/meson.build
>> +++ b/system/meson.build
>> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>>     'dma-helpers.c',
>>     'globals.c',
>>     'ioport.c',
>> +  'ram-block-attributes.c',
>>     'memory_mapping.c',
>>     'memory.c',
>>     'physmem.c',
>> diff --git a/system/ram-block-attributes.c b/system/ram-block-attributes.c
>> new file mode 100644
>> index 0000000000..514252413f
>> --- /dev/null
>> +++ b/system/ram-block-attributes.c
>> @@ -0,0 +1,480 @@
>> +/*
>> + * QEMU ram block attributes
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "system/ramblock.h"
>> +#include "trace.h"
>> +
>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
>> +                                          ram_block_attributes,
>> +                                          RAM_BLOCK_ATTRIBUTES,
>> +                                          OBJECT,
>> +                                          { TYPE_RAM_DISCARD_MANAGER },
>> +                                          { })
>> +
>> +static size_t
>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
>> +{
>> +    /*
>> +     * Because page conversion could be manipulated in the size of at least 4K
>> +     * or 4K aligned, Use the host page size as the granularity to track the
>> +     * memory attribute.
>> +     */
>> +    g_assert(attr && attr->ram_block);
>> +    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
>> +    return attr->ram_block->page_size;
>> +}
>> +
>> +
>> +static bool
>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
>> +                                      const MemoryRegionSection *section)
>> +{
>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const uint64_t first_bit = section->offset_within_region / block_size;
>> +    const uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
>> +    unsigned long first_discarded_bit;
>> +
>> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> +                                           first_bit);
>> +    return first_discarded_bit > last_bit;
>> +}
>> +
>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
>> +                                               void *arg);
>> +
>> +static int
>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
>> +                                        void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    return rdl->notify_populate(rdl, section);
>> +}
>> +
>> +static int
>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
>> +                                       void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    rdl->notify_discard(rdl, section);
>> +    return 0;
>> +}
>> +
>> +static int
>> +ram_block_attributes_for_each_populated_section(const RamBlockAttributes *attr,
>> +                                                MemoryRegionSection *section,
>> +                                                void *arg,
>> +                                                ram_block_attributes_section_cb cb)
>> +{
>> +    unsigned long first_bit, last_bit;
>> +    uint64_t offset, size;
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    int ret = 0;
>> +
>> +    first_bit = section->offset_within_region / block_size;
>> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>> +                              first_bit);
>> +
>> +    while (first_bit < attr->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_bit * block_size;
>> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>> +                                      first_bit + 1) - 1;
>> +        size = (last_bit - first_bit + 1) * block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener: %s",
>> +                         __func__, strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>> +                                  last_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int
>> +ram_block_attributes_for_each_discarded_section(const RamBlockAttributes *attr,
>> +                                                MemoryRegionSection *section,
>> +                                                void *arg,
>> +                                                ram_block_attributes_section_cb cb)
>> +{
>> +    unsigned long first_bit, last_bit;
>> +    uint64_t offset, size;
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    int ret = 0;
>> +
>> +    first_bit = section->offset_within_region / block_size;
>> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>> +                                   first_bit);
>> +
>> +    while (first_bit < attr->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_bit * block_size;
>> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>> +                                 first_bit + 1) - 1;
>> +        size = (last_bit - first_bit + 1) * block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener: %s",
>> +                         __func__, strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_bit = find_next_zero_bit(attr->bitmap,
>> +                                       attr->bitmap_size,
>> +                                       last_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint64_t
>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager *rdm,
>> +                                             const MemoryRegion *mr)
>> +{
>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +
>> +    g_assert(mr == attr->ram_block->mr);
>> +    return ram_block_attributes_get_block_size(attr);
>> +}
>> +
>> +static void
>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
>> +                                           RamDiscardListener *rdl,
>> +                                           MemoryRegionSection *section)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    int ret;
>> +
>> +    g_assert(section->mr == attr->ram_block->mr);
>> +    rdl->section = memory_region_section_new_copy(section);
>> +
>> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
>> +
>> +    ret = ram_block_attributes_for_each_populated_section(attr, section, rdl,
>> +                                    ram_block_attributes_notify_populate_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to register RAM discard listener: %s",
>> +                     __func__, strerror(-ret));
>> +        exit(1);
>> +    }
>> +}
>> +
>> +static void
>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
>> +                                             RamDiscardListener *rdl)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    int ret;
>> +
>> +    g_assert(rdl->section);
>> +    g_assert(rdl->section->mr == attr->ram_block->mr);
>> +
>> +    if (rdl->double_discard_supported) {
>> +        rdl->notify_discard(rdl, rdl->section);
>> +    } else {
>> +        ret = ram_block_attributes_for_each_populated_section(attr,
>> +                rdl->section, rdl, ram_block_attributes_notify_discard_cb);
>> +        if (ret) {
>> +            error_report("%s: Failed to unregister RAM discard listener: %s",
>> +                         __func__, strerror(-ret));
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    memory_region_section_free_copy(rdl->section);
>> +    rdl->section = NULL;
>> +    QLIST_REMOVE(rdl, next);
>> +}
>> +
>> +typedef struct RamBlockAttributesReplayData {
>> +    ReplayRamDiscardState fn;
>> +    void *opaque;
>> +} RamBlockAttributesReplayData;
>> +
>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection *section,
>> +                                              void *arg)
>> +{
>> +    RamBlockAttributesReplayData *data = arg;
>> +
>> +    return data->fn(section, data->opaque);
>> +}
>> +
>> +static int
>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
>> +                                          MemoryRegionSection *section,
>> +                                          ReplayRamDiscardState replay_fn,
>> +                                          void *opaque)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
>> +
>> +    g_assert(section->mr == attr->ram_block->mr);
>> +    return ram_block_attributes_for_each_populated_section(attr, section, &data,
>> +                                            ram_block_attributes_rdm_replay_cb);
>> +}
>> +
>> +static int
>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
>> +                                          MemoryRegionSection *section,
>> +                                          ReplayRamDiscardState replay_fn,
>> +                                          void *opaque)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
>> +
>> +    g_assert(section->mr == attr->ram_block->mr);
>> +    return ram_block_attributes_for_each_discarded_section(attr, section, &data,
>> +                                            ram_block_attributes_rdm_replay_cb);
>> +}
>> +
>> +static bool
>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr, uint64_t offset,
>> +                                    uint64_t size)
>> +{
>> +    MemoryRegion *mr = attr->ram_block->mr;
>> +
>> +    g_assert(mr);
>> +
>> +    uint64_t region_size = memory_region_size(mr);
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +
>> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
>> +        !QEMU_IS_ALIGNED(size, block_size)) {
>> +        return false;
>> +    }
>> +    if (offset + size <= offset) {
>> +        return false;
>> +    }
>> +    if (offset + size > region_size) {
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static void ram_block_attributes_notify_discard(RamBlockAttributes *attr,
>> +                                                uint64_t offset,
>> +                                                uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>> +            continue;
>> +        }
>> +        rdl->notify_discard(rdl, &tmp);
>> +    }
>> +}
>> +
>> +static int
>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
>> +                                     uint64_t offset, uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +    int ret = 0;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>> +            continue;
>> +        }
>> +        ret = rdl->notify_populate(rdl, &tmp);
>> +        if (ret) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static bool ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
>> +                                                    uint64_t offset,
>> +                                                    uint64_t size)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> +                                   first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +static bool
>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
>> +                                        uint64_t offset, uint64_t size)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>> +                                      uint64_t offset, uint64_t size,
>> +                                      bool to_discard)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long nbits = size / block_size;
>> +    bool is_range_discarded, is_range_populated;
> 
> Can be reduced to "discarded" and "populated".
> 
>> +    const uint64_t end = offset + size;
>> +    unsigned long bit;
>> +    uint64_t cur;
>> +    int ret = 0;
>> +
>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    is_range_discarded = ram_block_attributes_is_range_discarded(attr, offset,
>> +                                                                 size);
> 
> See - needlessly long line.
> 
>> +    is_range_populated = ram_block_attributes_is_range_populated(attr, offset,
>> +                                                                 size);
> 
> If ram_block_attributes_is_range_populated() returned (found_bit*block_size), you could tell from a single call if it is populated (found_bit == size) or discarded (found_bit == 0), otherwise it is a mix (and dump just this number in the tracepoint below).
> 
> And then ditch ram_block_attributes_is_range_discarded() which is practically cut-n-paste. And then open code ram_block_attributes_is_range_populated().

oops, cannot just drop find_next_bit(), my bad, need both find_next_bit() and find_next_zero_bit(). My point still stands though - if this is coded right here without helpers - it will look simpler. Thanks,


> 
> These two are not used elsewhere anyway.
> 
>> +
>> +    trace_ram_block_attributes_state_change(offset, size,
>> +                                            is_range_discarded ? "discarded" :
>> +                                            is_range_populated ? "populated" :
>> +                                            "mixture",
>> +                                            to_discard ? "discarded" :
>> +                                            "populated");
> 
> 
> I'd just dump 3 numbers (is_range_discarded, is_range_populated, to_discard) in the tracepoint as:
> 
> ram_block_attributes_state_change(uint64_t offset, uint64_t size, int discarded, int populated, int to_discard) "offset 0x%"PRIx64" size 0x%"PRIx64" discarded=%d populated=%d to_discard=%d"
> 
> 
> 
>> +    if (to_discard) {
>> +        if (is_range_discarded) {
>> +            /* Already private */
>> +        } else if (is_range_populated) {
>> +            /* Completely shared */
>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            ram_block_attributes_notify_discard(attr, offset, size);
>> +        } else {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
> 
> imho a little bit more accurate to:
> 
> for (bit = first_bit; bit < first_bit + nbits; ++bit) {
> 
> as you already have calculated first_bit, nbits...
> 
>> +                bit = cur / block_size;
> 
> ... and drop this ...
> 
>> +                if (!test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                ram_block_attributes_notify_discard(attr, cur, block_size);
> 
> .. and do: ram_block_attributes_notify_discard(attr, bit * block_size, block_size);
> 
> Then you can drop @cur which is used in one place inside the loop.
> 
> 
>> +            }
>> +        }
>> +    } else {
>> +        if (is_range_populated) {
>> +            /* Already shared */
>> +        } else if (is_range_discarded) {
>> +            /* Complete private */
> 
> s/Complete/Completely/
> 
>> +            bitmap_set(attr->bitmap, first_bit, nbits);
>> +            ret = ram_block_attributes_notify_populate(attr, offset, size);
>> +        } else {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                set_bit(bit, attr->bitmap);
>> +                ret = ram_block_attributes_notify_populate(attr, cur,
>> +                                                           block_size);
>> +                if (ret) {
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
>> +{
>> +    uint64_t bitmap_size;
> 
> Not really needed.
> 
>> +    const int block_size  = qemu_real_host_page_size();
>> +    RamBlockAttributes *attr;
>> +    int ret;
>> +    MemoryRegion *mr = ram_block->mr;
>> +
>> +    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
>> +
>> +    attr->ram_block = ram_block;
>> +    ret = memory_region_set_ram_discard_manager(mr, RAM_DISCARD_MANAGER(attr));
>> +    if (ret) {
> 
> Could just "if (memory_region_set_ram_discard_manager(...))".
> 
>> +        object_unref(OBJECT(attr));
>> +        return NULL;
>> +    }
>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>> +    attr->bitmap_size = bitmap_size;
>> +    attr->bitmap = bitmap_new(bitmap_size);
>> +
>> +    return attr;
>> +}
>> +
>> +void ram_block_attributes_destroy(RamBlockAttributes *attr)
>> +{
>> +    if (!attr) {
> 
> 
> Rather g_assert().
> 
> 
>> +        return;
>> +    }
>> +
>> +    g_free(attr->bitmap);
>> +    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
>> +    object_unref(OBJECT(attr));
>> +}
>> +
>> +static void ram_block_attributes_init(Object *obj)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
>> +
>> +    QLIST_INIT(&attr->rdl_list);
>> +}
> 
> Not used.
> 
>> +
>> +static void ram_block_attributes_finalize(Object *obj)
> 
> Not used.
> 
> Besides these two, feel free to ignore other comments :)
> 
> Otherwise,
> 
> Tested-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
> 
> 
>> +{
>> +}
>> +
>> +static void ram_block_attributes_class_init(ObjectClass *klass,
>> +                                            const void *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
>> +
>> +    rdmc->get_min_granularity = ram_block_attributes_rdm_get_min_granularity;
>> +    rdmc->register_listener = ram_block_attributes_rdm_register_listener;
>> +    rdmc->unregister_listener = ram_block_attributes_rdm_unregister_listener;
>> +    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
>> +    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
>> +    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
>> +}
>> diff --git a/system/trace-events b/system/trace-events
>> index be12ebfb41..82856e44f2 100644
>> --- a/system/trace-events
>> +++ b/system/trace-events
>> @@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
>>   dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>>   dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>>   dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
>> +
>> +# ram-block-attributes.c
>> +ram_block_attributes_state_change(uint64_t offset, uint64_t size, const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64" from '%s' to '%s'"
> 
> 
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-05  0:35     ` Alexey Kardashevskiy
@ 2025-06-05  3:22       ` Chenyi Qiang
  0 siblings, 0 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-05  3:22 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/5/2025 8:35 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 4/6/25 21:04, Alexey Kardashevskiy wrote:
>>
>>
>> On 30/5/25 18:32, Chenyi Qiang wrote:
>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>> discard. However, guest_memfd relies on discard operations for page
>>> conversion between private and shared memory, potentially leading to
>>> the stale IOMMU mapping issue when assigning hardware devices to
>>> confidential VMs via shared memory. To address this and allow shared
>>> device assignement, it is crucial to ensure the VFIO system refreshes
>>> its IOMMU mappings.
>>>
>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other. Therefore, similar actions are required for page
>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>> facilitate this process.
>>>
>>> Since guest_memfd is not an object, it cannot directly implement the
>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>> have a memory backend while others do not. Notably, virtual BIOS
>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>> backend.
>>>
>>> To manage RAMBlocks with guest_memfd, define a new object named
>>> RamBlockAttributes to implement the RamDiscardManager interface. This
>>> object can store the guest_memfd information such as bitmap for shared
>>> memory and the registered listeners for event notification. In the
>>> context of RamDiscardManager, shared state is analogous to populated,
>>> and
>>> private state is signified as discarded. To notify the conversion
>>> events,
>>> a new state_change() helper is exported for the users to notify the
>>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>>> shared mapping.
>>>
>>> Note that the memory state is tracked at the host page size granularity,
>>> as the minimum conversion size can be one page per request and VFIO
>>> expects the DMA mapping for a specific iova to be mapped and unmapped
>>> with the same granularity. Confidential VMs may perform partial
>>> conversions, such as conversions on small regions within larger ones.
>>> To prevent such invalid cases and until DMA mapping cut operation
>>> support is available, all operations are performed with 4K granularity.
>>>
>>> In addition, memory conversion failures cause QEMU to quit instead of
>>> resuming the guest or retrying the operation at present. It would be
>>> future work to add more error handling or rollback mechanisms once
>>> conversion failures are allowed. For example, in-place conversion of
>>> guest_memfd could retry the unmap operation during the conversion from
>>> shared to private. For now, keep the complex error handling out of the
>>> picture as it is not required.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Changes in v6:
>>>      - Change the object type name from RamBlockAttribute to
>>>        RamBlockAttributes. (David)
>>>      - Save the associated RAMBlock instead MemoryRegion in
>>>        RamBlockAttributes. (David)
>>>      - Squash the state_change() helper introduction in this commit as
>>>        well as the mixture conversion case handling. (David)
>>>      - Change the block_size type from int to size_t and some cleanup in
>>>        validation check. (Alexey)
>>>      - Add a tracepoint to track the state changes. (Alexey)
>>>
>>> Changes in v5:
>>>      - Revert to use RamDiscardManager interface instead of introducing
>>>        new hierarchy of class to manage private/shared state, and keep
>>>        using the new name of RamBlockAttribute compared with the
>>>        MemoryAttributeManager in v3.
>>>      - Use *simple* version of object_define and object_declare since
>>> the
>>>        state_change() function is changed as an exported function
>>> instead
>>>        of a virtual function in later patch.
>>>      - Move the introduction of RamBlockAttribute field to this patch
>>> and
>>>        rename it to ram_shared. (Alexey)
>>>      - call the exit() when register/unregister failed. (Zhao)
>>>      - Add the ram-block-attribute.c to Memory API related part in
>>>        MAINTAINERS.
>>>
>>> Changes in v4:
>>>      - Change the name from memory-attribute-manager to
>>>        ram-block-attribute.
>>>      - Implement the newly-introduced PrivateSharedManager instead of
>>>        RamDiscardManager and change related commit message.
>>>      - Define the new object in ramblock.h instead of adding a new file.
>>> ---
>>>   MAINTAINERS                   |   1 +
>>>   include/system/ramblock.h     |  21 ++
>>>   system/meson.build            |   1 +
>>>   system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>>>   system/trace-events           |   3 +
>>>   5 files changed, 506 insertions(+)
>>>   create mode 100644 system/ram-block-attributes.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 6dacd6d004..8ec39aa7f8 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>   F: system/memory_mapping.c
>>>   F: system/physmem.c
>>>   F: system/memory-internal.h
>>> +F: system/ram-block-attributes.c
>>>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>   Memory devices
>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>> index d8a116ba99..1bab9e2dac 100644
>>> --- a/include/system/ramblock.h
>>> +++ b/include/system/ramblock.h
>>> @@ -22,6 +22,10 @@
>>>   #include "exec/cpu-common.h"
>>>   #include "qemu/rcu.h"
>>>   #include "exec/ramlist.h"
>>> +#include "system/hostmem.h"
>>> +
>>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>>   struct RAMBlock {
>>>       struct rcu_head rcu;
>>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>>       ram_addr_t postcopy_length;
>>>   };
>>> +struct RamBlockAttributes {
>>> +    Object parent;
>>> +
>>> +    RAMBlock *ram_block;
>>> +
>>> +    /* 1-setting of the bitmap represents ram is populated (shared) */
>>> +    unsigned bitmap_size;
>>> +    unsigned long *bitmap;
>>> +
>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>> +};
>>> +
>>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
>>> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>> uint64_t offset,
>>> +                                      uint64_t size, bool to_discard);
>>> +
>>>   #endif
>>> diff --git a/system/meson.build b/system/meson.build
>>> index c2f0082766..2747dbde80 100644
>>> --- a/system/meson.build
>>> +++ b/system/meson.build
>>> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>>>     'dma-helpers.c',
>>>     'globals.c',
>>>     'ioport.c',
>>> +  'ram-block-attributes.c',
>>>     'memory_mapping.c',
>>>     'memory.c',
>>>     'physmem.c',
>>> diff --git a/system/ram-block-attributes.c b/system/ram-block-
>>> attributes.c
>>> new file mode 100644
>>> index 0000000000..514252413f
>>> --- /dev/null
>>> +++ b/system/ram-block-attributes.c
>>> @@ -0,0 +1,480 @@
>>> +/*
>>> + * QEMU ram block attributes
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2
>>> or later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/error-report.h"
>>> +#include "system/ramblock.h"
>>> +#include "trace.h"
>>> +
>>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
>>> +                                          ram_block_attributes,
>>> +                                          RAM_BLOCK_ATTRIBUTES,
>>> +                                          OBJECT,
>>> +                                          { TYPE_RAM_DISCARD_MANAGER },
>>> +                                          { })
>>> +
>>> +static size_t
>>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
>>> +{
>>> +    /*
>>> +     * Because page conversion could be manipulated in the size of
>>> at least 4K
>>> +     * or 4K aligned, Use the host page size as the granularity to
>>> track the
>>> +     * memory attribute.
>>> +     */
>>> +    g_assert(attr && attr->ram_block);
>>> +    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
>>> +    return attr->ram_block->page_size;
>>> +}
>>> +
>>> +
>>> +static bool
>>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
>>> +                                      const MemoryRegionSection
>>> *section)
>>> +{
>>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +    const uint64_t first_bit = section->offset_within_region /
>>> block_size;
>>> +    const uint64_t last_bit = first_bit + int128_get64(section-
>>> >size) / block_size - 1;
>>> +    unsigned long first_discarded_bit;
>>> +
>>> +    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit
>>> + 1,
>>> +                                           first_bit);
>>> +    return first_discarded_bit > last_bit;
>>> +}
>>> +
>>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
>>> +                                               void *arg);
>>> +
>>> +static int
>>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
>>> +                                        void *arg)
>>> +{
>>> +    RamDiscardListener *rdl = arg;
>>> +
>>> +    return rdl->notify_populate(rdl, section);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
>>> +                                       void *arg)
>>> +{
>>> +    RamDiscardListener *rdl = arg;
>>> +
>>> +    rdl->notify_discard(rdl, section);
>>> +    return 0;
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_for_each_populated_section(const
>>> RamBlockAttributes *attr,
>>> +                                                MemoryRegionSection
>>> *section,
>>> +                                                void *arg,
>>> +                                               
>>> ram_block_attributes_section_cb cb)
>>> +{
>>> +    unsigned long first_bit, last_bit;
>>> +    uint64_t offset, size;
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +    int ret = 0;
>>> +
>>> +    first_bit = section->offset_within_region / block_size;
>>> +    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> +                              first_bit);
>>> +
>>> +    while (first_bit < attr->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_bit * block_size;
>>> +        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>> +                                      first_bit + 1) - 1;
>>> +        size = (last_bit - first_bit + 1) * block_size;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to notify RAM discard listener:
>>> %s",
>>> +                         __func__, strerror(-ret));
>>> +            break;
>>> +        }
>>> +
>>> +        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> +                                  last_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_for_each_discarded_section(const
>>> RamBlockAttributes *attr,
>>> +                                                MemoryRegionSection
>>> *section,
>>> +                                                void *arg,
>>> +                                               
>>> ram_block_attributes_section_cb cb)
>>> +{
>>> +    unsigned long first_bit, last_bit;
>>> +    uint64_t offset, size;
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +    int ret = 0;
>>> +
>>> +    first_bit = section->offset_within_region / block_size;
>>> +    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>> +                                   first_bit);
>>> +
>>> +    while (first_bit < attr->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_bit * block_size;
>>> +        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> +                                 first_bit + 1) - 1;
>>> +        size = (last_bit - first_bit + 1) * block_size;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to notify RAM discard listener:
>>> %s",
>>> +                         __func__, strerror(-ret));
>>> +            break;
>>> +        }
>>> +
>>> +        first_bit = find_next_zero_bit(attr->bitmap,
>>> +                                       attr->bitmap_size,
>>> +                                       last_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static uint64_t
>>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager
>>> *rdm,
>>> +                                             const MemoryRegion *mr)
>>> +{
>>> +    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +
>>> +    g_assert(mr == attr->ram_block->mr);
>>> +    return ram_block_attributes_get_block_size(attr);
>>> +}
>>> +
>>> +static void
>>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
>>> +                                           RamDiscardListener *rdl,
>>> +                                           MemoryRegionSection
>>> *section)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(section->mr == attr->ram_block->mr);
>>> +    rdl->section = memory_region_section_new_copy(section);
>>> +
>>> +    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
>>> +
>>> +    ret = ram_block_attributes_for_each_populated_section(attr,
>>> section, rdl,
>>> +                                   
>>> ram_block_attributes_notify_populate_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to register RAM discard listener: %s",
>>> +                     __func__, strerror(-ret));
>>> +        exit(1);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
>>> +                                             RamDiscardListener *rdl)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(rdl->section);
>>> +    g_assert(rdl->section->mr == attr->ram_block->mr);
>>> +
>>> +    if (rdl->double_discard_supported) {
>>> +        rdl->notify_discard(rdl, rdl->section);
>>> +    } else {
>>> +        ret = ram_block_attributes_for_each_populated_section(attr,
>>> +                rdl->section, rdl,
>>> ram_block_attributes_notify_discard_cb);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to unregister RAM discard
>>> listener: %s",
>>> +                         __func__, strerror(-ret));
>>> +            exit(1);
>>> +        }
>>> +    }
>>> +
>>> +    memory_region_section_free_copy(rdl->section);
>>> +    rdl->section = NULL;
>>> +    QLIST_REMOVE(rdl, next);
>>> +}
>>> +
>>> +typedef struct RamBlockAttributesReplayData {
>>> +    ReplayRamDiscardState fn;
>>> +    void *opaque;
>>> +} RamBlockAttributesReplayData;
>>> +
>>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection
>>> *section,
>>> +                                              void *arg)
>>> +{
>>> +    RamBlockAttributesReplayData *data = arg;
>>> +
>>> +    return data->fn(section, data->opaque);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
>>> +                                          MemoryRegionSection *section,
>>> +                                          ReplayRamDiscardState
>>> replay_fn,
>>> +                                          void *opaque)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == attr->ram_block->mr);
>>> +    return ram_block_attributes_for_each_populated_section(attr,
>>> section, &data,
>>> +                                           
>>> ram_block_attributes_rdm_replay_cb);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
>>> +                                          MemoryRegionSection *section,
>>> +                                          ReplayRamDiscardState
>>> replay_fn,
>>> +                                          void *opaque)
>>> +{
>>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == attr->ram_block->mr);
>>> +    return ram_block_attributes_for_each_discarded_section(attr,
>>> section, &data,
>>> +                                           
>>> ram_block_attributes_rdm_replay_cb);
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr,
>>> uint64_t offset,
>>> +                                    uint64_t size)
>>> +{
>>> +    MemoryRegion *mr = attr->ram_block->mr;
>>> +
>>> +    g_assert(mr);
>>> +
>>> +    uint64_t region_size = memory_region_size(mr);
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +
>>> +    if (!QEMU_IS_ALIGNED(offset, block_size) ||
>>> +        !QEMU_IS_ALIGNED(size, block_size)) {
>>> +        return false;
>>> +    }
>>> +    if (offset + size <= offset) {
>>> +        return false;
>>> +    }
>>> +    if (offset + size > region_size) {
>>> +        return false;
>>> +    }
>>> +    return true;
>>> +}
>>> +
>>> +static void ram_block_attributes_notify_discard(RamBlockAttributes
>>> *attr,
>>> +                                                uint64_t offset,
>>> +                                                uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        rdl->notify_discard(rdl, &tmp);
>>> +    }
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
>>> +                                     uint64_t offset, uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +    int ret = 0;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        ret = rdl->notify_populate(rdl, &tmp);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static bool
>>> ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
>>> +                                                    uint64_t offset,
>>> +                                                    uint64_t size)
>>> +{
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>> +                                   first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
>>> +                                        uint64_t offset, uint64_t size)
>>> +{
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>> +                                      uint64_t offset, uint64_t size,
>>> +                                      bool to_discard)
>>> +{
>>> +    const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long nbits = size / block_size;
>>> +    bool is_range_discarded, is_range_populated;
>>
>> Can be reduced to "discarded" and "populated".
>>
>>> +    const uint64_t end = offset + size;
>>> +    unsigned long bit;
>>> +    uint64_t cur;
>>> +    int ret = 0;
>>> +
>>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>> +                     __func__, offset, size);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    is_range_discarded =
>>> ram_block_attributes_is_range_discarded(attr, offset,
>>> +                                                                 size);
>>
>> See - needlessly long line.
>>
>>> +    is_range_populated =
>>> ram_block_attributes_is_range_populated(attr, offset,
>>> +                                                                 size);
>>
>> If ram_block_attributes_is_range_populated() returned
>> (found_bit*block_size), you could tell from a single call if it is
>> populated (found_bit == size) or discarded (found_bit == 0), otherwise
>> it is a mix (and dump just this number in the tracepoint below).
>>
>> And then ditch ram_block_attributes_is_range_discarded() which is
>> practically cut-n-paste. And then open code
>> ram_block_attributes_is_range_populated().
> 
> oops, cannot just drop find_next_bit(), my bad, need both
> find_next_bit() and find_next_zero_bit(). My point still stands though -

Yes. a single call is insufficient as found_bit == 0 can also be the
mixed case.

> if this is coded right here without helpers - it will look simpler. Thanks,

Do you mean unwrap the helper functions? Since there are no other
callers, you are right, it can be simpler.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-04 11:04   ` Alexey Kardashevskiy
  2025-06-05  0:35     ` Alexey Kardashevskiy
@ 2025-06-05  6:01     ` Chenyi Qiang
  1 sibling, 0 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-05  6:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/4/2025 7:04 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 30/5/25 18:32, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") highlighted that subsystems like VFIO may disable RAM block
>> discard. However, guest_memfd relies on discard operations for page
>> conversion between private and shared memory, potentially leading to
>> the stale IOMMU mapping issue when assigning hardware devices to
>> confidential VMs via shared memory. To address this and allow shared
>> device assignement, it is crucial to ensure the VFIO system refreshes
>> its IOMMU mappings.
>>
>> RamDiscardManager is an existing interface (used by virtio-mem) to
>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>> have a memory backend while others do not. Notably, virtual BIOS
>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>> backend.
>>
>> To manage RAMBlocks with guest_memfd, define a new object named
>> RamBlockAttributes to implement the RamDiscardManager interface. This
>> object can store the guest_memfd information such as bitmap for shared
>> memory and the registered listeners for event notification. In the
>> context of RamDiscardManager, shared state is analogous to populated, and
>> private state is signified as discarded. To notify the conversion events,
>> a new state_change() helper is exported for the users to notify the
>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>> shared mapping.
>>
>> Note that the memory state is tracked at the host page size granularity,
>> as the minimum conversion size can be one page per request and VFIO
>> expects the DMA mapping for a specific iova to be mapped and unmapped
>> with the same granularity. Confidential VMs may perform partial
>> conversions, such as conversions on small regions within larger ones.
>> To prevent such invalid cases and until DMA mapping cut operation
>> support is available, all operations are performed with 4K granularity.
>>
>> In addition, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation at present. It would be
>> future work to add more error handling or rollback mechanisms once
>> conversion failures are allowed. For example, in-place conversion of
>> guest_memfd could retry the unmap operation during the conversion from
>> shared to private. For now, keep the complex error handling out of the
>> picture as it is not required.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

[...]

>> +
>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>> +                                      uint64_t offset, uint64_t size,
>> +                                      bool to_discard)
>> +{
>> +    const size_t block_size = ram_block_attributes_get_block_size(attr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long nbits = size / block_size;
>> +    bool is_range_discarded, is_range_populated;
> 
> Can be reduced to "discarded" and "populated".

[...]

> 
>> +    const uint64_t end = offset + size;
>> +    unsigned long bit;
>> +    uint64_t cur;
>> +    int ret = 0;
>> +
>> +    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    is_range_discarded =
>> ram_block_attributes_is_range_discarded(attr, offset,
>> +                                                                 size);
> 
> See - needlessly long line.

Yes, doing the rename can avoid long line.

> 
>> +    is_range_populated =
>> ram_block_attributes_is_range_populated(attr, offset,
>> +                                                                 size);
> 
> If ram_block_attributes_is_range_populated() returned
> (found_bit*block_size), you could tell from a single call if it is
> populated (found_bit == size) or discarded (found_bit == 0), otherwise
> it is a mix (and dump just this number in the tracepoint below).
> 
> And then ditch ram_block_attributes_is_range_discarded() which is
> practically cut-n-paste. And then open code
> ram_block_attributes_is_range_populated().
> 
> These two are not used elsewhere anyway.
> 
>> +
>> +    trace_ram_block_attributes_state_change(offset, size,
>> +                                            is_range_discarded ?
>> "discarded" :
>> +                                            is_range_populated ?
>> "populated" :
>> +                                            "mixture",
>> +                                            to_discard ? "discarded" :
>> +                                            "populated");
> 
> 
> I'd just dump 3 numbers (is_range_discarded, is_range_populated,
> to_discard) in the tracepoint as:
> 
> ram_block_attributes_state_change(uint64_t offset, uint64_t size, int
> discarded, int populated, int to_discard) "offset 0x%"PRIx64" size
> 0x%"PRIx64" discarded=%d populated=%d to_discard=%d"

Maybe a string of "from xxx to xxx" is more straightforward and it can
also cover your information. Anyway I don't think they have much difference.

> 
> 
> 
>> +    if (to_discard) {
>> +        if (is_range_discarded) {
>> +            /* Already private */
>> +        } else if (is_range_populated) {
>> +            /* Completely shared */
>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            ram_block_attributes_notify_discard(attr, offset, size);
>> +        } else {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
> 
> imho a little bit more accurate to:
> 
> for (bit = first_bit; bit < first_bit + nbits; ++bit) {
> 
> as you already have calculated first_bit, nbits...
> 
>> +                bit = cur / block_size;
> 
> ... and drop this ...
> 
>> +                if (!test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                ram_block_attributes_notify_discard(attr, cur,
>> block_size);
> 
> .. and do: ram_block_attributes_notify_discard(attr, bit * block_size,
> block_size);
> 
> Then you can drop @cur which is used in one place inside the loop.

Yes, looks it can reduce some lines.

> 
> 
>> +            }
>> +        }
>> +    } else {
>> +        if (is_range_populated) {
>> +            /* Already shared */
>> +        } else if (is_range_discarded) {
>> +            /* Complete private */
> 
> s/Complete/Completely/
> 
>> +            bitmap_set(attr->bitmap, first_bit, nbits);
>> +            ret = ram_block_attributes_notify_populate(attr, offset,
>> size);
>> +        } else {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                set_bit(bit, attr->bitmap);
>> +                ret = ram_block_attributes_notify_populate(attr, cur,
>> +                                                           block_size);
>> +                if (ret) {
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
>> +{
>> +    uint64_t bitmap_size;
> 
> Not really needed.
> 
>> +    const int block_size  = qemu_real_host_page_size();
>> +    RamBlockAttributes *attr;
>> +    int ret;
>> +    MemoryRegion *mr = ram_block->mr;
>> +
>> +    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
>> +
>> +    attr->ram_block = ram_block;
>> +    ret = memory_region_set_ram_discard_manager(mr,
>> RAM_DISCARD_MANAGER(attr));
>> +    if (ret) {
> 
> Could just "if (memory_region_set_ram_discard_manager(...))".
> 
>> +        object_unref(OBJECT(attr));
>> +        return NULL;
>> +    }
>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>> +    attr->bitmap_size = bitmap_size;
>> +    attr->bitmap = bitmap_new(bitmap_size);
>> +
>> +    return attr;
>> +}
>> +
>> +void ram_block_attributes_destroy(RamBlockAttributes *attr)
>> +{
>> +    if (!attr) {
> 
> 
> Rather g_assert().
> 
> 
>> +        return;
>> +    }
>> +
>> +    g_free(attr->bitmap);
>> +    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
>> +    object_unref(OBJECT(attr));
>> +}
>> +
>> +static void ram_block_attributes_init(Object *obj)
>> +{
>> +    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
>> +
>> +    QLIST_INIT(&attr->rdl_list);
>> +}
> 
> Not used.
> 
>> +
>> +static void ram_block_attributes_finalize(Object *obj)
> 
> Not used.
> 
> Besides these two, feel free to ignore other comments :)

The init() and finalize() calls are used in the
OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES() macro. It is a common
template. I guess you forgot this again[1] :)

Other comments look good to me. But I think they are some minor changes.
I'll see if need a new version to carry these changes.


[1]
https://lore.kernel.org/qemu-devel/89e791a5-b71b-4b9d-a8b4-e225bfbd1bc2@amd.com/

> 
> Otherwise,
> 
> Tested-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
> 
> 
>> +{
>> +}
>> +
>> +static void ram_block_attributes_class_init(ObjectClass *klass,
>> +                                            const void *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
>> +
>> +    rdmc->get_min_granularity =
>> ram_block_attributes_rdm_get_min_granularity;
>> +    rdmc->register_listener =
>> ram_block_attributes_rdm_register_listener;
>> +    rdmc->unregister_listener =
>> ram_block_attributes_rdm_unregister_listener;
>> +    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
>> +    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
>> +    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
>> +}
>> diff --git a/system/trace-events b/system/trace-events
>> index be12ebfb41..82856e44f2 100644
>> --- a/system/trace-events
>> +++ b/system/trace-events
>> @@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
>>   dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t
>> time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time
>> %"PRIi64 " us"
>>   dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set
>> dirty page rate limit %"PRIu64
>>   dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us)
>> "CPU[%d] sleep %"PRIi64 " us"
>> +
>> +# ram-block-attributes.c
>> +ram_block_attributes_state_change(uint64_t offset, uint64_t size,
>> const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64"
>> from '%s' to '%s'"
> 
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 0/5] Enable shared device assignment
  2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2025-05-30  8:32 ` [PATCH v6 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
@ 2025-06-09  3:23 ` Chenyi Qiang
  5 siblings, 0 replies; 30+ messages in thread
From: Chenyi Qiang @ 2025-06-09  3:23 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

Hi Paolo,

Since this series has received Reviewed-by/Acked-by on all patches,
besides some coding style comments from Alexey and the suggestion to
document the bitmap consistency from David in patch #4, Any other
comments? Or I would send the next version to resolve them.

Thanks
Chenyi

On 5/30/2025 4:32 PM, Chenyi Qiang wrote:
> This is the v6 series of the shared device assignment support.
> 
> Compared with the last version [1], this series retains the basic support
> and removes the additional complex error handling, which can be added
> back when necessary. Meanwhile, the patchset has been re-organized to
> be clearer.
> 
> Overview of this series:
> 
> - Patch 1-3: Preparation patches. These include function exposure and
>   some function prototype changes.
> - Patch 4: Introduce a new object to implement RamDiscardManager
>   interface and a helper to notify the shared/private state change.
> - Patch 5: Enable coordinated discarding of RAM with guest_memfd through
>   the RamDiscardManager interface.
> 
> More small changes or details can be found in the individual patches.
> 
> ---
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared and private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. In current implementation, shared memory is allocated
> with normal methods (e.g. mmap or fallocate) while private memory is
> allocated from guest_memfd. When a VM performs memory conversions, QEMU
> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
> one side, and allocates new pages from the other side. This will cause a
> stale IOMMU mapping issue mentioned in [2] when we try to enable shared
> device assignment in confidential VMs.
> 
> Solution
> ========
> The key to enable shared device assignment is to update the IOMMU mappings
> on page conversion. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversions is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardmanager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Limitation and future extension
> ===============================
> This series only supports the basic shared device assignment functionality.
> There are still some limitations and areas that can be extended and
> optimized in the future.
> 
> Relationship with in-place conversion
> -------------------------------------
> In-place page conversion is the ongoing work to allow mmap() of
> guest_memfd to userspace so that both private and shared memory can use
> the same physical memory as the backend. This new design eliminates the
> need to discard pages during shared/private conversions. When it is
> ready, shared device assignment needs be adjusted to achieve an
> unmap-before-conversion-to-private and map-after-conversion-to-shared
> sequence to be compatible with the change.
> 
> Partial unmap limitation
> ------------------------
> VFIO expects the DMA mapping for a specific IOVA to be mapped and
> unmapped with the same granularity. The guest may perform partial
> conversion, such as converting a small region within a larger one. To
> prevent such invalid cases, current operations are performed with 4K
> granularity. This could be optimized after DMA mapping cut operation
> [3] is introduced in the future. We can always perform a
> split-before-unmap if partial conversions happens. If the split
> succeeds, the unmap will succeed and be atomic. If the split fails, the
> unmap process fails.
> 
> More attributes management
> --------------------------
> Current RamDiscardManager can only manage a pair of opposite states like
> populated/discared or shared/private. If more states need to be
> considered, for example, support virtio-mem in confidential VMs, three
> states would be possible (shared populated/private populated/discard).
> Current framework cannot handle such scenario and we need to think of
> some new framework at that time [4].
> 
> Memory overhead optimization
> ----------------------------
> A comment from Baolu [5] suggests considering using Maple Tree or a generic
> interval tree to manage private/shared state instead of a bitmap, which
> can reduce memory consumption. This optmization can also be considered
> in other bitmap use cases like dirty bitmaps for guest RAM.
> 
> Testing
> =======
> This patch series is tested based on mainline kernel since TDX base
> support has been merged. The QEMU repo is available at
> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>     -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>     -object iommufd,id=iommufd0 \
>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> 
> Because the new features like cut_mapping operation will only be support in iommufd.
> It is more recommended to use the iommufd-backed VFIO.
> 
> Related link
> ============
> [1] https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qiang@intel.com/
> [2] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
> [3] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/
> [4] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
> [5] https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23decf@linux.intel.com/
> 
> Chenyi Qiang (5):
>   memory: Export a helper to get intersection of a MemoryRegionSection
>     with a given range
>   memory: Change memory_region_set_ram_discard_manager() to return the
>     result
>   memory: Unify the definiton of ReplayRamPopulate() and
>     ReplayRamDiscard()
>   ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock
>     with guest_memfd
>   physmem: Support coordinated discarding of RAM with guest_memfd
> 
>  MAINTAINERS                   |   1 +
>  accel/kvm/kvm-all.c           |   9 +
>  hw/virtio/virtio-mem.c        |  83 +++---
>  include/system/memory.h       | 100 +++++--
>  include/system/ramblock.h     |  22 ++
>  migration/ram.c               |   5 +-
>  system/memory.c               |  22 +-
>  system/meson.build            |   1 +
>  system/physmem.c              |  18 +-
>  system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>  system/trace-events           |   3 +
>  11 files changed, 660 insertions(+), 84 deletions(-)
>  create mode 100644 system/ram-block-attributes.c
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-09  3:23 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30  8:32 [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang
2025-05-30  8:32 ` [PATCH v6 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-06-01  9:19   ` Gupta, Pankaj
2025-06-01 15:44   ` Xiaoyao Li
2025-05-30  8:32 ` [PATCH v6 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-06-01  9:21   ` Gupta, Pankaj
2025-06-01 15:53   ` Xiaoyao Li
2025-06-04 14:37   ` Alexey Kardashevskiy
2025-05-30  8:32 ` [PATCH v6 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-06-01 16:12   ` Xiaoyao Li
2025-06-02  9:20   ` Gupta, Pankaj
2025-05-30  8:32 ` [PATCH v6 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
2025-06-01  9:58   ` Gupta, Pankaj
2025-06-03  1:26     ` Chenyi Qiang
2025-06-03  5:26       ` Gupta, Pankaj
2025-06-03  6:33         ` Chenyi Qiang
2025-06-03  7:17           ` Gupta, Pankaj
2025-06-03  7:41             ` David Hildenbrand
2025-06-03  9:45               ` Gupta, Pankaj
2025-06-03 12:05                 ` Chenyi Qiang
2025-06-02 21:10   ` David Hildenbrand
2025-06-03  1:57     ` Chenyi Qiang
2025-06-04 11:04   ` Alexey Kardashevskiy
2025-06-05  0:35     ` Alexey Kardashevskiy
2025-06-05  3:22       ` Chenyi Qiang
2025-06-05  6:01     ` Chenyi Qiang
2025-05-30  8:32 ` [PATCH v6 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
2025-06-04 11:04   ` Alexey Kardashevskiy
2025-06-04 11:29   ` David Hildenbrand
2025-06-09  3:23 ` [PATCH v6 0/5] Enable shared device assignment Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).