[PATCH v2 0/6] Enable shared device assignment

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] Enable shared device assignment
@ 2025-02-17  8:18 Chenyi Qiang
  2025-02-17  8:18 ` [PATCH v2 1/6] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

This is the v2 series of the shared device assignment support.

The overview of this series:
- Patch 1-2: preparation patches. One is to export a helper to get
  intersection of a MemoryRegionSection with a given range. The other is
  to change the memory_region_set_ram_discard_manager() to return the
  result.
- Patch 3-4: Introduce a new object to implement RamDiscardManager
  interface and a callback to notify the shared/private state change.
- Patch 5: Store the new object including guest_memfd information in
  RAMBlock. Register the RamDiscardManager instance to the target
  RAMBlock's MemoryRegion so that the object can notify the page
  conversion events to other systems.
- Patch 6: Unlock the coordinate discard so that the shared device
  assignment (VFIO) can work with guest_memfd.

Compared with v1 series, the main changes are:

- Rename the class name from GuestMemfdManager to MemoryAttributeManager.
- Add a new field in RAMBlock to store the guest_memfd information.
- Add patch 02 to make memory_region_set_ram_discard_manager() return the result.
- Drop the patch 07 in v1 series which is an extension and not necessary for the current series.
- v1: https://lore.kernel.org/qemu-devel/20241213070852.106092-1-chenyi.qiang@intel.com/

More small changes or details can be found in the individual patches.

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [1] when we try to enable shared
device assignment in confidential VMs.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversion is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardManager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Relationship with in-place page conversion
==========================================
To support 1G page support for guest_memfd [2], the current direction is to
allow mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages as the backend. This in-place page
conversion design eliminates the need to discard pages during shared/private
conversions. However, device assignment will still be blocked because the
in-place page conversion will reject the conversion when the page is pinned
by VFIO.

To address this, the key difference lies in the sequence of VFIO map/unmap
operations and the page conversion. It can be adjusted to achieve
unmap-before-conversion-to-private and map-after-conversion-to-shared,
ensuring compatibility with guest_memfd.

Limitation
==========
One limitation is that VFIO expects the DMA mapping for a specific IOVA
to be mapped and unmapped with the same granularity. The guest may
perform partial conversions, such as converting a small region within a
larger region. To prevent such invalid cases, all operations are
performed with 4K granularity. This could be optimized after the
cut_mapping operation [3] is introduced in future. We can alway perform a
split-before-unmap if partial conversions happen. If the split succeeds,
the unmap will succeed and be atomic. If the split fails, the unmap
process fails.

Testing
=======
This patch series is tested based on TDX patches available at:
KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2025-02-10
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-02-17-v2

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

No additional adjustment required.

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
[3] https://lore.kernel.org/linux-iommu/7-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/

---
Chenyi Qiang (6):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  memory: Change memory_region_set_ram_discard_manager() to return the
    result
  memory-attribute-manager: Introduce MemoryAttributeManager to manage
    RAMBLock with guest_memfd
  memory-attribute-manager: Introduce a callback to notify the
    shared/private state change
  memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  RAMBlock: Make guest_memfd require coordinate discard

 accel/kvm/kvm-all.c                       |   9 +
 hw/virtio/virtio-mem.c                    |  62 ++-
 include/exec/memory.h                     |  33 +-
 include/exec/ramblock.h                   |   2 +
 include/system/memory-attribute-manager.h |  62 +++
 system/memory-attribute-manager.c         | 440 ++++++++++++++++++++++
 system/memory.c                           |  11 +-
 system/meson.build                        |   1 +
 system/physmem.c                          |  17 +-
 9 files changed, 589 insertions(+), 48 deletions(-)
 create mode 100644 include/system/memory-attribute-manager.h
 create mode 100644 system/memory-attribute-manager.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/6] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
@ 2025-02-17  8:18 ` Chenyi Qiang
  2025-02-17  8:18 ` [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Rename the helper to memory_region_section_intersect_range() to make it
more generic. Meanwhile, define the @end as Int128 and replace the
related operations with Int128_* format since the helper is exported as
a wider API.

Suggested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v2:
- Make memory_region_section_intersect_range() an inline function.
- Add Reviewed-by from David
- Define the @end as Int128 and use the related Int128_* ops as a wilder
  API (Alexey)
---
 hw/virtio/virtio-mem.c | 32 +++++---------------------------
 include/exec/memory.h  | 27 +++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b1a003736b..21f16e4912 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -244,28 +244,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -287,7 +265,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -319,7 +297,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -355,7 +333,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -371,7 +349,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -388,7 +366,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 3ee1901b52..3bebc43d59 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1202,6 +1202,33 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+static inline bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                                         uint64_t offset, uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    Int128 end = int128_min(int128_add(int128_make64(s->offset_within_region), s->size),
+                            int128_add(int128_make64(offset), int128_make64(size)));
+
+    if (int128_le(end, int128_make64(start))) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_sub(end, int128_make64(start));
+    return true;
+}
+
 /**
  * memory_region_init: Initialize a memory region
  *
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
  2025-02-17  8:18 ` [PATCH v2 1/6] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2025-02-17  8:18 ` Chenyi Qiang
  2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-17  8:18 ` [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Modify memory_region_set_ram_discard_manager() to return false if a
RamDiscardManager is already set in the MemoryRegion. The caller must
handle this failure, such as having virtio-mem undo its actions and fail
the realize() process. Opportunistically move the call earlier to avoid
complex error handling.

This change is beneficial when introducing a new RamDiscardManager
instance besides virtio-mem. After
ram_block_coordinated_discard_require(true) unlocks all
RamDiscardManager instances, only one instance is allowed to be set for
a MemoryRegion at present.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v2:
    - newly added.
---
 hw/virtio/virtio-mem.c | 30 +++++++++++++++++-------------
 include/exec/memory.h  |  6 +++---
 system/memory.c        | 11 ++++++++---
 3 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 21f16e4912..ef818a2cdf 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1074,6 +1074,18 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
                         vmem->block_size;
     vmem->bitmap = bitmap_new(vmem->bitmap_size);
 
+    /*
+     * Set ourselves as RamDiscardManager before the plug handler maps the
+     * memory region and exposes it via an address space.
+     */
+    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
+                                              RAM_DISCARD_MANAGER(vmem))) {
+        error_setg(errp, "Failed to set RamDiscardManager");
+        g_free(vmem->bitmap);
+        ram_block_coordinated_discard_require(false);
+        return;
+    }
+
     virtio_init(vdev, VIRTIO_ID_MEM, sizeof(struct virtio_mem_config));
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
@@ -1124,13 +1136,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
     vmem->system_reset->vmem = vmem;
     qemu_register_resettable(obj);
-
-    /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
-                                          RAM_DISCARD_MANAGER(vmem));
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -1138,12 +1143,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-
     qemu_unregister_resettable(OBJECT(vmem->system_reset));
     object_unref(OBJECT(vmem->system_reset));
 
@@ -1155,6 +1154,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     host_memory_backend_set_mapped(vmem->memdev, false);
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
+    /*
+     * The unplug handler unmapped the memory region, it cannot be
+     * found via an address space anymore. Unset ourselves.
+     */
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     g_free(vmem->bitmap);
     ram_block_coordinated_discard_require(false);
 }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 3bebc43d59..390477b588 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2487,13 +2487,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
  *
  * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
  * that does not cover RAM, or a #MemoryRegion that already has a
- * #RamDiscardManager assigned.
+ * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
  *
  * @mr: the #MemoryRegion
  * @rdm: #RamDiscardManager to set
  */
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm);
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm);
 
 /**
  * memory_region_find: translate an address/size relative to a
diff --git a/system/memory.c b/system/memory.c
index b17b5538ff..297a3dbcd4 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2115,12 +2115,17 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
     return mr->rdm;
 }
 
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm)
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm)
 {
     g_assert(memory_region_is_ram(mr));
-    g_assert(!rdm || !mr->rdm);
+    if (mr->rdm && rdm != NULL) {
+        return -1;
+    }
+
+    /* !rdm || !mr->rdm */
     mr->rdm = rdm;
+    return 0;
 }
 
 uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-02-17  8:18 ` [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-18  9:41     ` Chenyi Qiang
  2025-02-18 10:46     ` David Hildenbrand
  0 siblings, 2 replies; 18+ messages in thread
From: Alexey Kardashevskiy @ 2025-02-18  9:19 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 17/2/25 19:18, Chenyi Qiang wrote:
> Modify memory_region_set_ram_discard_manager() to return false if a
> RamDiscardManager is already set in the MemoryRegion. The caller must
> handle this failure, such as having virtio-mem undo its actions and fail
> the realize() process. Opportunistically move the call earlier to avoid
> complex error handling.
> 
> This change is beneficial when introducing a new RamDiscardManager
> instance besides virtio-mem. After
> ram_block_coordinated_discard_require(true) unlocks all
> RamDiscardManager instances, only one instance is allowed to be set for
> a MemoryRegion at present.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v2:
>      - newly added.
> ---
>   hw/virtio/virtio-mem.c | 30 +++++++++++++++++-------------
>   include/exec/memory.h  |  6 +++---
>   system/memory.c        | 11 ++++++++---
>   3 files changed, 28 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 21f16e4912..ef818a2cdf 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1074,6 +1074,18 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>                           vmem->block_size;
>       vmem->bitmap = bitmap_new(vmem->bitmap_size);
>   
> +    /*
> +     * Set ourselves as RamDiscardManager before the plug handler maps the
> +     * memory region and exposes it via an address space.
> +     */
> +    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> +                                              RAM_DISCARD_MANAGER(vmem))) {
> +        error_setg(errp, "Failed to set RamDiscardManager");
> +        g_free(vmem->bitmap);
> +        ram_block_coordinated_discard_require(false);
> +        return;
> +    }

Looks like this can move before vmem->bitmap is allocated (or even 
before ram_block_coordinated_discard_require(true)?). Then you can drop 
g_free() and avoid having a stale pointer in vmem->bitmap (not that it 
matters here though).

> +
>       virtio_init(vdev, VIRTIO_ID_MEM, sizeof(struct virtio_mem_config));
>       vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>   vmem->bitmap
> @@ -1124,13 +1136,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>       vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
>       vmem->system_reset->vmem = vmem;
>       qemu_register_resettable(obj);
> -
> -    /*
> -     * Set ourselves as RamDiscardManager before the plug handler maps the
> -     * memory region and exposes it via an address space.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> -                                          RAM_DISCARD_MANAGER(vmem));
>   }
>   
>   static void virtio_mem_device_unrealize(DeviceState *dev)
> @@ -1138,12 +1143,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>       VirtIOMEM *vmem = VIRTIO_MEM(dev);
>   
> -    /*
> -     * The unplug handler unmapped the memory region, it cannot be
> -     * found via an address space anymore. Unset ourselves.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
> -
>       qemu_unregister_resettable(OBJECT(vmem->system_reset));
>       object_unref(OBJECT(vmem->system_reset));
>   
> @@ -1155,6 +1154,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       host_memory_backend_set_mapped(vmem->memdev, false);
>       virtio_del_queue(vdev, 0);
>       virtio_cleanup(vdev);
> +    /*
> +     * The unplug handler unmapped the memory region, it cannot be
> +     * found via an address space anymore. Unset ourselves.
> +     */
> +    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>       g_free(vmem->bitmap);
>       ram_block_coordinated_discard_require(false);
>   }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 3bebc43d59..390477b588 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2487,13 +2487,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
>    *
>    * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
>    * that does not cover RAM, or a #MemoryRegion that already has a
> - * #RamDiscardManager assigned.
> + * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
>    *
>    * @mr: the #MemoryRegion
>    * @rdm: #RamDiscardManager to set
>    */
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm);
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm);
>   
>   /**
>    * memory_region_find: translate an address/size relative to a
> diff --git a/system/memory.c b/system/memory.c
> index b17b5538ff..297a3dbcd4 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2115,12 +2115,17 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
>       return mr->rdm;
>   }
>   
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm)
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm)
>   {
>       g_assert(memory_region_is_ram(mr));
> -    g_assert(!rdm || !mr->rdm);
> +    if (mr->rdm && rdm != NULL) {

Drop "!= NULL".

> +        return -1;

-EBUSY?

> +    }
> +
> +    /* !rdm || !mr->rdm */

See, like here - no "!= NULL" :) (and the comment is useless). Thanks,


>       mr->rdm = rdm;
> +    return 0;
>   }
>   
>   uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,

-- 
Alexey



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-02-18  9:19   ` Alexey Kardashevskiy
@ 2025-02-18  9:41     ` Chenyi Qiang
  2025-02-18 10:46     ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-18  9:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 2/18/2025 5:19 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 17/2/25 19:18, Chenyi Qiang wrote:
>> Modify memory_region_set_ram_discard_manager() to return false if a
>> RamDiscardManager is already set in the MemoryRegion. The caller must
>> handle this failure, such as having virtio-mem undo its actions and fail
>> the realize() process. Opportunistically move the call earlier to avoid
>> complex error handling.
>>
>> This change is beneficial when introducing a new RamDiscardManager
>> instance besides virtio-mem. After
>> ram_block_coordinated_discard_require(true) unlocks all
>> RamDiscardManager instances, only one instance is allowed to be set for
>> a MemoryRegion at present.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v2:
>>      - newly added.
>> ---
>>   hw/virtio/virtio-mem.c | 30 +++++++++++++++++-------------
>>   include/exec/memory.h  |  6 +++---
>>   system/memory.c        | 11 ++++++++---
>>   3 files changed, 28 insertions(+), 19 deletions(-)
>>
>> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
>> index 21f16e4912..ef818a2cdf 100644
>> --- a/hw/virtio/virtio-mem.c
>> +++ b/hw/virtio/virtio-mem.c
>> @@ -1074,6 +1074,18 @@ static void
>> virtio_mem_device_realize(DeviceState *dev, Error **errp)
>>                           vmem->block_size;
>>       vmem->bitmap = bitmap_new(vmem->bitmap_size);
>>   +    /*
>> +     * Set ourselves as RamDiscardManager before the plug handler
>> maps the
>> +     * memory region and exposes it via an address space.
>> +     */
>> +    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
>> +                                             
>> RAM_DISCARD_MANAGER(vmem))) {
>> +        error_setg(errp, "Failed to set RamDiscardManager");
>> +        g_free(vmem->bitmap);
>> +        ram_block_coordinated_discard_require(false);
>> +        return;
>> +    }
> 
> Looks like this can move before vmem->bitmap is allocated (or even
> before ram_block_coordinated_discard_require(true)?). Then you can drop
> g_free() and avoid having a stale pointer in vmem->bitmap (not that it
> matters here though).

I'm not sure if moving up the memory_region_set_ram_discard_manager()
will have bring any side effect (seems no requirement for the operation
order here). But Maybe it's better to put after
ram_block_coordinated_discard_require(true) as
ram_block_coordinated_discard_require(true) unlocks RDM users.

> 
>> +
>>       virtio_init(vdev, VIRTIO_ID_MEM, sizeof(struct virtio_mem_config));
>>       vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>>   vmem->bitmap
>> @@ -1124,13 +1136,6 @@ static void
>> virtio_mem_device_realize(DeviceState *dev, Error **errp)
>>       vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
>>       vmem->system_reset->vmem = vmem;
>>       qemu_register_resettable(obj);
>> -
>> -    /*
>> -     * Set ourselves as RamDiscardManager before the plug handler
>> maps the
>> -     * memory region and exposes it via an address space.
>> -     */
>> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
>> -                                          RAM_DISCARD_MANAGER(vmem));
>>   }
>>     static void virtio_mem_device_unrealize(DeviceState *dev)
>> @@ -1138,12 +1143,6 @@ static void
>> virtio_mem_device_unrealize(DeviceState *dev)
>>       VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>>       VirtIOMEM *vmem = VIRTIO_MEM(dev);
>>   -    /*
>> -     * The unplug handler unmapped the memory region, it cannot be
>> -     * found via an address space anymore. Unset ourselves.
>> -     */
>> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>> -
>>       qemu_unregister_resettable(OBJECT(vmem->system_reset));
>>       object_unref(OBJECT(vmem->system_reset));
>>   @@ -1155,6 +1154,11 @@ static void
>> virtio_mem_device_unrealize(DeviceState *dev)
>>       host_memory_backend_set_mapped(vmem->memdev, false);
>>       virtio_del_queue(vdev, 0);
>>       virtio_cleanup(vdev);
>> +    /*
>> +     * The unplug handler unmapped the memory region, it cannot be
>> +     * found via an address space anymore. Unset ourselves.
>> +     */
>> +    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>>       g_free(vmem->bitmap);
>>       ram_block_coordinated_discard_require(false);
>>   }
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 3bebc43d59..390477b588 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -2487,13 +2487,13 @@ static inline bool
>> memory_region_has_ram_discard_manager(MemoryRegion *mr)
>>    *
>>    * This function must not be called for a mapped #MemoryRegion, a
>> #MemoryRegion
>>    * that does not cover RAM, or a #MemoryRegion that already has a
>> - * #RamDiscardManager assigned.
>> + * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
>>    *
>>    * @mr: the #MemoryRegion
>>    * @rdm: #RamDiscardManager to set
>>    */
>> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> -                                           RamDiscardManager *rdm);
>> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> +                                          RamDiscardManager *rdm);
>>     /**
>>    * memory_region_find: translate an address/size relative to a
>> diff --git a/system/memory.c b/system/memory.c
>> index b17b5538ff..297a3dbcd4 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2115,12 +2115,17 @@ RamDiscardManager
>> *memory_region_get_ram_discard_manager(MemoryRegion *mr)
>>       return mr->rdm;
>>   }
>>   -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> -                                           RamDiscardManager *rdm)
>> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> +                                          RamDiscardManager *rdm)
>>   {
>>       g_assert(memory_region_is_ram(mr));
>> -    g_assert(!rdm || !mr->rdm);
>> +    if (mr->rdm && rdm != NULL) {
> 
> Drop "!= NULL".
> 
>> +        return -1;
> 
> -EBUSY?

[..]

> 
>> +    }
>> +
>> +    /* !rdm || !mr->rdm */
> 
> See, like here - no "!= NULL" :) (and the comment is useless). Thanks,

LGTM, will change it. Thanks!

> 
> 
>>       mr->rdm = rdm;
>> +    return 0;
>>   }
>>     uint64_t ram_discard_manager_get_min_granularity(const
>> RamDiscardManager *rdm,
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-18  9:41     ` Chenyi Qiang
@ 2025-02-18 10:46     ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-02-18 10:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Chenyi Qiang, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao

On 18.02.25 10:19, Alexey Kardashevskiy wrote:
> 
> 
> On 17/2/25 19:18, Chenyi Qiang wrote:
>> Modify memory_region_set_ram_discard_manager() to return false if a
>> RamDiscardManager is already set in the MemoryRegion. The caller must
>> handle this failure, such as having virtio-mem undo its actions and fail
>> the realize() process. Opportunistically move the call earlier to avoid
>> complex error handling.
>>
>> This change is beneficial when introducing a new RamDiscardManager
>> instance besides virtio-mem. After
>> ram_block_coordinated_discard_require(true) unlocks all
>> RamDiscardManager instances, only one instance is allowed to be set for
>> a MemoryRegion at present.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v2:
>>       - newly added.
>> ---
>>    hw/virtio/virtio-mem.c | 30 +++++++++++++++++-------------
>>    include/exec/memory.h  |  6 +++---
>>    system/memory.c        | 11 ++++++++---
>>    3 files changed, 28 insertions(+), 19 deletions(-)
>>
>> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
>> index 21f16e4912..ef818a2cdf 100644
>> --- a/hw/virtio/virtio-mem.c
>> +++ b/hw/virtio/virtio-mem.c
>> @@ -1074,6 +1074,18 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>>                            vmem->block_size;
>>        vmem->bitmap = bitmap_new(vmem->bitmap_size);
>>    
>> +    /*
>> +     * Set ourselves as RamDiscardManager before the plug handler maps the
>> +     * memory region and exposes it via an address space.
>> +     */
>> +    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
>> +                                              RAM_DISCARD_MANAGER(vmem))) {
>> +        error_setg(errp, "Failed to set RamDiscardManager");
>> +        g_free(vmem->bitmap);
>> +        ram_block_coordinated_discard_require(false);
>> +        return;
>> +    }
> 
> Looks like this can move before vmem->bitmap is allocated (or even
> before ram_block_coordinated_discard_require(true)?). Then you can drop
> g_free() and avoid having a stale pointer in vmem->bitmap (not that it
> matters here though).
> 
>> +
>>        virtio_init(vdev, VIRTIO_ID_MEM, sizeof(struct virtio_mem_config));
>>        vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
>>    vmem->bitmap
>> @@ -1124,13 +1136,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>>        vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
>>        vmem->system_reset->vmem = vmem;
>>        qemu_register_resettable(obj);
>> -
>> -    /*
>> -     * Set ourselves as RamDiscardManager before the plug handler maps the
>> -     * memory region and exposes it via an address space.
>> -     */
>> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
>> -                                          RAM_DISCARD_MANAGER(vmem));
>>    }
>>    
>>    static void virtio_mem_device_unrealize(DeviceState *dev)
>> @@ -1138,12 +1143,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>>        VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>>        VirtIOMEM *vmem = VIRTIO_MEM(dev);
>>    
>> -    /*
>> -     * The unplug handler unmapped the memory region, it cannot be
>> -     * found via an address space anymore. Unset ourselves.
>> -     */
>> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>> -
>>        qemu_unregister_resettable(OBJECT(vmem->system_reset));
>>        object_unref(OBJECT(vmem->system_reset));
>>    
>> @@ -1155,6 +1154,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>>        host_memory_backend_set_mapped(vmem->memdev, false);
>>        virtio_del_queue(vdev, 0);
>>        virtio_cleanup(vdev);
>> +    /*
>> +     * The unplug handler unmapped the memory region, it cannot be
>> +     * found via an address space anymore. Unset ourselves.
>> +     */
>> +    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>>        g_free(vmem->bitmap);
>>        ram_block_coordinated_discard_require(false);
>>    }
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 3bebc43d59..390477b588 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -2487,13 +2487,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
>>     *
>>     * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
>>     * that does not cover RAM, or a #MemoryRegion that already has a
>> - * #RamDiscardManager assigned.
>> + * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
>>     *
>>     * @mr: the #MemoryRegion
>>     * @rdm: #RamDiscardManager to set
>>     */
>> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> -                                           RamDiscardManager *rdm);
>> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> +                                          RamDiscardManager *rdm);
>>    
>>    /**
>>     * memory_region_find: translate an address/size relative to a
>> diff --git a/system/memory.c b/system/memory.c
>> index b17b5538ff..297a3dbcd4 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2115,12 +2115,17 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
>>        return mr->rdm;
>>    }
>>    
>> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> -                                           RamDiscardManager *rdm)
>> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
>> +                                          RamDiscardManager *rdm)
>>    {
>>        g_assert(memory_region_is_ram(mr));
>> -    g_assert(!rdm || !mr->rdm);
>> +    if (mr->rdm && rdm != NULL) {
> 
> Drop "!= NULL".
> 
>> +        return -1;
> 
> -EBUSY?
> 
>> +    }
>> +
>> +    /* !rdm || !mr->rdm */
> 
> See, like here - no "!= NULL" :) (and the comment is useless). Thanks,
> 
> 
>>        mr->rdm = rdm;
>> +    return 0;
>>    }
>>    
>>    uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
> 


Agreed to all, with that it LGTM, thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
  2025-02-17  8:18 ` [PATCH v2 1/6] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2025-02-17  8:18 ` [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-02-17  8:18 ` Chenyi Qiang
  2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-17  8:18 ` [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO may
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential VM via shared memory. To address this, it is
crucial to ensure systems like VFIO refresh its IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.

Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. One potential attempt is to implement it in
HostMemoryBackend. This is not appropriate because guest_memfd is per
RAMBlock. Some RAMBlocks have a memory backend but others do not. In
particular, the ones like virtual BIOS calling
memory_region_init_ram_guest_memfd() do not.

To manage the RAMBlocks with guest_memfd, define a new object named
MemoryAttributeManager to implement the RamDiscardManager interface. The
object stores guest_memfd information such as shared_bitmap, and handles
page conversion notification. Using the name of MemoryAttributeManager is
aimed to make it more generic. The term "Memory" emcompasses not only RAM
but also private MMIO in TEE I/O, which might rely on this
object/interface to handle page conversion events in the future. The
term "Attribute" allows for the management of various attributes beyond
shared and private. For instance, it could support scenarios where
discard vs. populated and shared vs. private states co-exists, such as
supporting virtio-mem or something similar in the future.

In the current context, MemoryAttributeManager signifies discarded state
as private and populated state as shared. Memory state is tracked at the
host page size granularity, as the minimum memory conversion size can be one
page per request. Additionally, VFIO expects the DMA mapping for a
specific iova to be mapped and unmapped with the same granularity.
Confidential VMs may perform  partial conversions, e.g. conversion
happens on a small region within a large region. To prevent such invalid
cases and until cut_mapping operation support is introduced, all
operations are performed with 4K granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v2:
    - Rename the object name to MemoryAttributeManager
    - Rename the bitmap to shared_bitmap to make it more clear.
    - Remove block_size field and get it from a helper. In future, we
      can get the page_size from RAMBlock if necessary.
    - Remove the unncessary "struct" before GuestMemfdReplayData
    - Remove the unncessary g_free() for the bitmap
    - Add some error report when the callback failure for
      populated/discarded section.
    - Move the realize()/unrealize() definition to this patch.
---
 include/system/memory-attribute-manager.h |  42 ++++
 system/memory-attribute-manager.c         | 292 ++++++++++++++++++++++
 system/meson.build                        |   1 +
 3 files changed, 335 insertions(+)
 create mode 100644 include/system/memory-attribute-manager.h
 create mode 100644 system/memory-attribute-manager.c

diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
new file mode 100644
index 0000000000..72adc0028e
--- /dev/null
+++ b/include/system/memory-attribute-manager.h
@@ -0,0 +1,42 @@
+/*
+ * QEMU memory attribute manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
+#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
+
+#include "system/hostmem.h"
+
+#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
+
+OBJECT_DECLARE_TYPE(MemoryAttributeManager, MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
+
+struct MemoryAttributeManager {
+    Object parent;
+
+    MemoryRegion *mr;
+
+    /* 1-setting of the bit represents the memory is populated (shared) */
+    int32_t bitmap_size;
+    unsigned long *shared_bitmap;
+
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+struct MemoryAttributeManagerClass {
+    ObjectClass parent_class;
+};
+
+int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
+void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
+
+#endif
diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
new file mode 100644
index 0000000000..ed97e43dd0
--- /dev/null
+++ b/system/memory-attribute-manager.c
@@ -0,0 +1,292 @@
+/*
+ * QEMU memory attribute manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "system/memory-attribute-manager.h"
+
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
+                                   memory_attribute_manager,
+                                   MEMORY_ATTRIBUTE_MANAGER,
+                                   OBJECT,
+                                   { TYPE_RAM_DISCARD_MANAGER },
+                                   { })
+
+static int memory_attribute_manager_get_block_size(const MemoryAttributeManager *mgr)
+{
+    /*
+     * Because page conversion could be manipulated in the size of at least 4K or 4K aligned,
+     * Use the host page size as the granularity to track the memory attribute.
+     * TODO: if necessary, switch to get the page_size from RAMBlock.
+     * i.e. mgr->mr->ram_block->page_size.
+     */
+    return qemu_real_host_page_size();
+}
+
+
+static bool memory_attribute_rdm_is_populated(const RamDiscardManager *rdm,
+                                              const MemoryRegionSection *section)
+{
+    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    uint64_t first_bit = section->offset_within_region / block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
+    return first_discard_bit > last_bit;
+}
+
+typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int memory_attribute_notify_populate_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int memory_attribute_notify_discard_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+
+    return 0;
+}
+
+static int memory_attribute_for_each_populated_section(const MemoryAttributeManager *mgr,
+                                                       MemoryRegionSection *section,
+                                                       void *arg,
+                                                       memory_attribute_section_cb cb)
+{
+    unsigned long first_one_bit, last_one_bit;
+    uint64_t offset, size;
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    first_one_bit = section->offset_within_region / block_size;
+    first_one_bit = find_next_bit(mgr->shared_bitmap, mgr->bitmap_size, first_one_bit);
+
+    while (first_one_bit < mgr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_one_bit * block_size;
+        last_one_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->bitmap_size,
+                                          first_one_bit + 1) - 1;
+        size = (last_one_bit - first_one_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
+                         strerror(-ret));
+            break;
+        }
+
+        first_one_bit = find_next_bit(mgr->shared_bitmap, mgr->bitmap_size,
+                                      last_one_bit + 2);
+    }
+
+    return ret;
+}
+
+static int memory_attribute_for_each_discarded_section(const MemoryAttributeManager *mgr,
+                                                       MemoryRegionSection *section,
+                                                       void *arg,
+                                                       memory_attribute_section_cb cb)
+{
+    unsigned long first_zero_bit, last_zero_bit;
+    uint64_t offset, size;
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    first_zero_bit = section->offset_within_region / block_size;
+    first_zero_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->bitmap_size,
+                                        first_zero_bit);
+
+    while (first_zero_bit < mgr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_zero_bit * block_size;
+        last_zero_bit = find_next_bit(mgr->shared_bitmap, mgr->bitmap_size,
+                                      first_zero_bit + 1) - 1;
+        size = (last_zero_bit - first_zero_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
+                         strerror(-ret));
+            break;
+        }
+
+        first_zero_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->bitmap_size,
+                                            last_zero_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t memory_attribute_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                                         const MemoryRegion *mr)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+
+    g_assert(mr == mgr->mr);
+    return memory_attribute_manager_get_block_size(mgr);
+}
+
+static void memory_attribute_rdm_register_listener(RamDiscardManager *rdm,
+                                                   RamDiscardListener *rdl,
+                                                   MemoryRegionSection *section)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    int ret;
+
+    g_assert(section->mr == mgr->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&mgr->rdl_list, rdl, next);
+
+    ret = memory_attribute_for_each_populated_section(mgr, section, rdl,
+                                                      memory_attribute_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
+                                                     RamDiscardListener *rdl)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == mgr->mr);
+
+    ret = memory_attribute_for_each_populated_section(mgr, rdl->section, rdl,
+                                                      memory_attribute_notify_discard_cb);
+    if (ret) {
+        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+
+}
+
+typedef struct MemoryAttributeReplayData {
+    void *fn;
+    void *opaque;
+} MemoryAttributeReplayData;
+
+static int memory_attribute_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+{
+    MemoryAttributeReplayData *data = arg;
+
+    return ((ReplayRamPopulate)data->fn)(section, data->opaque);
+}
+
+static int memory_attribute_rdm_replay_populated(const RamDiscardManager *rdm,
+                                                 MemoryRegionSection *section,
+                                                 ReplayRamPopulate replay_fn,
+                                                 void *opaque)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == mgr->mr);
+    return memory_attribute_for_each_populated_section(mgr, section, &data,
+                                                       memory_attribute_rdm_replay_populated_cb);
+}
+
+static int memory_attribute_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+{
+    MemoryAttributeReplayData *data = arg;
+
+    ((ReplayRamDiscard)data->fn)(section, data->opaque);
+    return 0;
+}
+
+static void memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                                  MemoryRegionSection *section,
+                                                  ReplayRamDiscard replay_fn,
+                                                  void *opaque)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
+    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == mgr->mr);
+    memory_attribute_for_each_discarded_section(mgr, section, &data,
+                                                memory_attribute_rdm_replay_discarded_cb);
+}
+
+int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
+{
+    uint64_t bitmap_size;
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret;
+
+    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
+
+    mgr->mr = mr;
+    mgr->bitmap_size = bitmap_size;
+    mgr->shared_bitmap = bitmap_new(bitmap_size);
+
+    ret = memory_region_set_ram_discard_manager(mgr->mr, RAM_DISCARD_MANAGER(mgr));
+    if (ret) {
+        g_free(mgr->shared_bitmap);
+    }
+
+    return ret;
+}
+
+void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
+{
+    memory_region_set_ram_discard_manager(mgr->mr, NULL);
+
+    g_free(mgr->shared_bitmap);
+}
+
+static void memory_attribute_manager_init(Object *obj)
+{
+    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
+
+    QLIST_INIT(&mgr->rdl_list);
+}
+
+static void memory_attribute_manager_finalize(Object *obj)
+{
+}
+
+static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+    rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
+    rdmc->register_listener = memory_attribute_rdm_register_listener;
+    rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;
+    rdmc->is_populated = memory_attribute_rdm_is_populated;
+    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
+    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index 4952f4b2c7..ab07ff1442 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -15,6 +15,7 @@ system_ss.add(files(
   'dirtylimit.c',
   'dma-helpers.c',
   'globals.c',
+  'memory-attribute-manager.c',
   'memory_mapping.c',
   'qdev-monitor.c',
   'qtest.c',
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-02-17  8:18 ` [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
@ 2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-19  1:20     ` Chenyi Qiang
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2025-02-18  9:19 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 17/2/25 19:18, Chenyi Qiang wrote:
> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") highlighted, some subsystems like VFIO may
> disable ram block discard. However, guest_memfd relies on the discard
> operation to perform page conversion between private and shared memory.
> This can lead to stale IOMMU mapping issue when assigning a hardware
> device to a confidential VM via shared memory. To address this, it is
> crucial to ensure systems like VFIO refresh its IOMMU mappings.
> 
> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
> VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. One potential attempt is to implement it in
> HostMemoryBackend. This is not appropriate because guest_memfd is per
> RAMBlock. Some RAMBlocks have a memory backend but others do not. In
> particular, the ones like virtual BIOS calling
> memory_region_init_ram_guest_memfd() do not.
> 
> To manage the RAMBlocks with guest_memfd, define a new object named
> MemoryAttributeManager to implement the RamDiscardManager interface. The
> object stores guest_memfd information such as shared_bitmap, and handles
> page conversion notification. Using the name of MemoryAttributeManager is
> aimed to make it more generic. The term "Memory" emcompasses not only RAM
> but also private MMIO in TEE I/O, which might rely on this
> object/interface to handle page conversion events in the future. The
> term "Attribute" allows for the management of various attributes beyond
> shared and private. For instance, it could support scenarios where
> discard vs. populated and shared vs. private states co-exists, such as
> supporting virtio-mem or something similar in the future.
> 
> In the current context, MemoryAttributeManager signifies discarded state
> as private and populated state as shared. Memory state is tracked at the
> host page size granularity, as the minimum memory conversion size can be one
> page per request. Additionally, VFIO expects the DMA mapping for a
> specific iova to be mapped and unmapped with the same granularity.
> Confidential VMs may perform  partial conversions, e.g. conversion
> happens on a small region within a large region. To prevent such invalid
> cases and until cut_mapping operation support is introduced, all
> operations are performed with 4K granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v2:
>      - Rename the object name to MemoryAttributeManager
>      - Rename the bitmap to shared_bitmap to make it more clear.
>      - Remove block_size field and get it from a helper. In future, we
>        can get the page_size from RAMBlock if necessary.
>      - Remove the unncessary "struct" before GuestMemfdReplayData
>      - Remove the unncessary g_free() for the bitmap
>      - Add some error report when the callback failure for
>        populated/discarded section.
>      - Move the realize()/unrealize() definition to this patch.
> ---
>   include/system/memory-attribute-manager.h |  42 ++++
>   system/memory-attribute-manager.c         | 292 ++++++++++++++++++++++
>   system/meson.build                        |   1 +
>   3 files changed, 335 insertions(+)
>   create mode 100644 include/system/memory-attribute-manager.h
>   create mode 100644 system/memory-attribute-manager.c
> 
> diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
> new file mode 100644
> index 0000000000..72adc0028e
> --- /dev/null
> +++ b/include/system/memory-attribute-manager.h
> @@ -0,0 +1,42 @@
> +/*
> + * QEMU memory attribute manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
> +
> +#include "system/hostmem.h"
> +
> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
> +
> +OBJECT_DECLARE_TYPE(MemoryAttributeManager, MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
> +
> +struct MemoryAttributeManager {
> +    Object parent;
> +
> +    MemoryRegion *mr;
> +
> +    /* 1-setting of the bit represents the memory is populated (shared) */
> +    int32_t bitmap_size;

unsigned.

Also, do either s/bitmap_size/shared_bitmap_size/ or
s/shared_bitmap/bitmap/



> +    unsigned long *shared_bitmap;
> +
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +struct MemoryAttributeManagerClass {
> +    ObjectClass parent_class;
> +};
> +
> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
> +
> +#endif
> diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
> new file mode 100644
> index 0000000000..ed97e43dd0
> --- /dev/null
> +++ b/system/memory-attribute-manager.c
> @@ -0,0 +1,292 @@
> +/*
> + * QEMU memory attribute manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "system/memory-attribute-manager.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
> +                                   memory_attribute_manager,
> +                                   MEMORY_ATTRIBUTE_MANAGER,
> +                                   OBJECT,
> +                                   { TYPE_RAM_DISCARD_MANAGER },
> +                                   { })
> +
> +static int memory_attribute_manager_get_block_size(const MemoryAttributeManager *mgr)
> +{
> +    /*
> +     * Because page conversion could be manipulated in the size of at least 4K or 4K aligned,
> +     * Use the host page size as the granularity to track the memory attribute.
> +     * TODO: if necessary, switch to get the page_size from RAMBlock.
> +     * i.e. mgr->mr->ram_block->page_size.

I'd assume it is rather necessary already.

> +     */
> +    return qemu_real_host_page_size();
> +}
> +
> +
> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager *rdm,
> +                                              const MemoryRegionSection *section)
> +{
> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    uint64_t first_bit = section->offset_within_region / block_size;
> +    uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
> +    unsigned long first_discard_bit;
> +
> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
> +    return first_discard_bit > last_bit;
> +}
> +
> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s, void *arg);
> +
> +static int memory_attribute_notify_populate_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    return rdl->notify_populate(rdl, section);
> +}
> +
> +static int memory_attribute_notify_discard_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    rdl->notify_discard(rdl, section);
> +
> +    return 0;
> +}
> +
> +static int memory_attribute_for_each_populated_section(const MemoryAttributeManager *mgr,
> +                                                       MemoryRegionSection *section,
> +                                                       void *arg,
> +                                                       memory_attribute_section_cb cb)
> +{
> +    unsigned long first_one_bit, last_one_bit;
> +    uint64_t offset, size;
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret = 0;
> +
> +    first_one_bit = section->offset_within_region / block_size;
> +    first_one_bit = find_next_bit(mgr->shared_bitmap, mgr->bitmap_size, first_one_bit);
> +
> +    while (first_one_bit < mgr->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_one_bit * block_size;
> +        last_one_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->bitmap_size,
> +                                          first_one_bit + 1) - 1;
> +        size = (last_one_bit - first_one_bit + 1) * block_size;


What all this math is for if we stuck with VFIO doing 1 page at the 
time? (I think I commented on this)

> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
> +                         strerror(-ret));
> +            break;
> +        }
> +
> +        first_one_bit = find_next_bit(mgr->shared_bitmap, mgr->bitmap_size,
> +                                      last_one_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static int memory_attribute_for_each_discarded_section(const MemoryAttributeManager *mgr,
> +                                                       MemoryRegionSection *section,
> +                                                       void *arg,
> +                                                       memory_attribute_section_cb cb)
> +{
> +    unsigned long first_zero_bit, last_zero_bit;
> +    uint64_t offset, size;
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret = 0;
> +
> +    first_zero_bit = section->offset_within_region / block_size;
> +    first_zero_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->bitmap_size,
> +                                        first_zero_bit);
> +
> +    while (first_zero_bit < mgr->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_zero_bit * block_size;
> +        last_zero_bit = find_next_bit(mgr->shared_bitmap, mgr->bitmap_size,
> +                                      first_zero_bit + 1) - 1;
> +        size = (last_zero_bit - first_zero_bit + 1) * block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            error_report("%s: Failed to notify RAM discard listener: %s", __func__,
> +                         strerror(-ret));
> +            break;
> +        }
> +
> +        first_zero_bit = find_next_zero_bit(mgr->shared_bitmap, mgr->bitmap_size,
> +                                            last_zero_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint64_t memory_attribute_rdm_get_min_granularity(const RamDiscardManager *rdm,
> +                                                         const MemoryRegion *mr)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +
> +    g_assert(mr == mgr->mr);
> +    return memory_attribute_manager_get_block_size(mgr);
> +}
> +
> +static void memory_attribute_rdm_register_listener(RamDiscardManager *rdm,
> +                                                   RamDiscardListener *rdl,
> +                                                   MemoryRegionSection *section)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(section->mr == mgr->mr);
> +    rdl->section = memory_region_section_new_copy(section);
> +
> +    QLIST_INSERT_HEAD(&mgr->rdl_list, rdl, next);
> +
> +    ret = memory_attribute_for_each_populated_section(mgr, section, rdl,
> +                                                      memory_attribute_notify_populate_cb);
> +    if (ret) {
> +        error_report("%s: Failed to register RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +}
> +
> +static void memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
> +                                                     RamDiscardListener *rdl)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(rdl->section);
> +    g_assert(rdl->section->mr == mgr->mr);
> +
> +    ret = memory_attribute_for_each_populated_section(mgr, rdl->section, rdl,
> +                                                      memory_attribute_notify_discard_cb);
> +    if (ret) {
> +        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +
> +    memory_region_section_free_copy(rdl->section);
> +    rdl->section = NULL;
> +    QLIST_REMOVE(rdl, next);
> +
> +}
> +
> +typedef struct MemoryAttributeReplayData {
> +    void *fn;

ReplayRamDiscard *fn, not void*.

> +    void *opaque;
> +} MemoryAttributeReplayData;
> +
> +static int memory_attribute_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
> +{
> +    MemoryAttributeReplayData *data = arg;
> +
> +    return ((ReplayRamPopulate)data->fn)(section, data->opaque);
> +}
> +
> +static int memory_attribute_rdm_replay_populated(const RamDiscardManager *rdm,
> +                                                 MemoryRegionSection *section,
> +                                                 ReplayRamPopulate replay_fn,
> +                                                 void *opaque)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == mgr->mr);
> +    return memory_attribute_for_each_populated_section(mgr, section, &data,
> +                                                       memory_attribute_rdm_replay_populated_cb);
> +}
> +
> +static int memory_attribute_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
> +{
> +    MemoryAttributeReplayData *data = arg;
> +
> +    ((ReplayRamDiscard)data->fn)(section, data->opaque);
> +    return 0;
> +}
> +
> +static void memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                                  MemoryRegionSection *section,
> +                                                  ReplayRamDiscard replay_fn,
> +                                                  void *opaque)
> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == mgr->mr);
> +    memory_attribute_for_each_discarded_section(mgr, section, &data,
> +                                                memory_attribute_rdm_replay_discarded_cb);
> +}
> +
> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
> +{
> +    uint64_t bitmap_size;
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret;
> +
> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
> +
> +    mgr->mr = mr;
> +    mgr->bitmap_size = bitmap_size;
> +    mgr->shared_bitmap = bitmap_new(bitmap_size);
> +
> +    ret = memory_region_set_ram_discard_manager(mgr->mr, RAM_DISCARD_MANAGER(mgr));

Move it 3 lines up and avoid stale data in 
mgr->mr/bitmap_size/shared_bitmap and avoid g_free below?

> +    if (ret) {
> +        g_free(mgr->shared_bitmap);
> +    }
> +
> +    return ret;
> +}
> +
> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
> +{
> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
> +
> +    g_free(mgr->shared_bitmap);
> +}
> +
> +static void memory_attribute_manager_init(Object *obj)

Not used.

> +{
> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
> +
> +    QLIST_INIT(&mgr->rdl_list);
> +} > +
> +static void memory_attribute_manager_finalize(Object *obj)

Not used either. Thanks,

> +{
> +}
> +
> +static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
> +{
> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
> +
> +    rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
> +    rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;
> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
> +}
> diff --git a/system/meson.build b/system/meson.build
> index 4952f4b2c7..ab07ff1442 100644
> --- a/system/meson.build
> +++ b/system/meson.build
> @@ -15,6 +15,7 @@ system_ss.add(files(
>     'dirtylimit.c',
>     'dma-helpers.c',
>     'globals.c',
> +  'memory-attribute-manager.c',
>     'memory_mapping.c',
>     'qdev-monitor.c',
>     'qtest.c',

-- 
Alexey



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-02-18  9:19   ` Alexey Kardashevskiy
@ 2025-02-19  1:20     ` Chenyi Qiang
  2025-02-19  3:49       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-19  1:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 2/18/2025 5:19 PM, Alexey Kardashevskiy wrote:
> 
> 

[..]

>> diff --git a/include/system/memory-attribute-manager.h b/include/
>> system/memory-attribute-manager.h
>> new file mode 100644
>> index 0000000000..72adc0028e
>> --- /dev/null
>> +++ b/include/system/memory-attribute-manager.h
>> @@ -0,0 +1,42 @@
>> +/*
>> + * QEMU memory attribute manager
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>> +
>> +#include "system/hostmem.h"
>> +
>> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
>> +
>> +OBJECT_DECLARE_TYPE(MemoryAttributeManager,
>> MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
>> +
>> +struct MemoryAttributeManager {
>> +    Object parent;
>> +
>> +    MemoryRegion *mr;
>> +
>> +    /* 1-setting of the bit represents the memory is populated
>> (shared) */
>> +    int32_t bitmap_size;
> 
> unsigned.
> 
> Also, do either s/bitmap_size/shared_bitmap_size/ or
> s/shared_bitmap/bitmap/

Will change it. Thanks.

> 
> 
> 
>> +    unsigned long *shared_bitmap;
>> +
>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>> +
>> +struct MemoryAttributeManagerClass {
>> +    ObjectClass parent_class;
>> +};
>> +
>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>> MemoryRegion *mr);
>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
>> +
>> +#endif
>> diff --git a/system/memory-attribute-manager.c b/system/memory-
>> attribute-manager.c
>> new file mode 100644
>> index 0000000000..ed97e43dd0
>> --- /dev/null
>> +++ b/system/memory-attribute-manager.c
>> @@ -0,0 +1,292 @@
>> +/*
>> + * QEMU memory attribute manager
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "system/memory-attribute-manager.h"
>> +
>> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
>> +                                   memory_attribute_manager,
>> +                                   MEMORY_ATTRIBUTE_MANAGER,
>> +                                   OBJECT,
>> +                                   { TYPE_RAM_DISCARD_MANAGER },
>> +                                   { })
>> +
>> +static int memory_attribute_manager_get_block_size(const
>> MemoryAttributeManager *mgr)
>> +{
>> +    /*
>> +     * Because page conversion could be manipulated in the size of at
>> least 4K or 4K aligned,
>> +     * Use the host page size as the granularity to track the memory
>> attribute.
>> +     * TODO: if necessary, switch to get the page_size from RAMBlock.
>> +     * i.e. mgr->mr->ram_block->page_size.
> 
> I'd assume it is rather necessary already.

OK, Will return the page_size of RAMBlock directly.

> 
>> +     */
>> +    return qemu_real_host_page_size();
>> +}
>> +
>> +
>> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager
>> *rdm,
>> +                                              const
>> MemoryRegionSection *section)
>> +{
>> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    uint64_t first_bit = section->offset_within_region / block_size;
>> +    uint64_t last_bit = first_bit + int128_get64(section->size) /
>> block_size - 1;
>> +    unsigned long first_discard_bit;
>> +
>> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap,
>> last_bit + 1, first_bit);
>> +    return first_discard_bit > last_bit;
>> +}
>> +
>> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s,
>> void *arg);
>> +
>> +static int memory_attribute_notify_populate_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    return rdl->notify_populate(rdl, section);
>> +}
>> +
>> +static int memory_attribute_notify_discard_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    RamDiscardListener *rdl = arg;
>> +
>> +    rdl->notify_discard(rdl, section);
>> +
>> +    return 0;
>> +}
>> +
>> +static int memory_attribute_for_each_populated_section(const
>> MemoryAttributeManager *mgr,
>> +                                                      
>> MemoryRegionSection *section,
>> +                                                       void *arg,
>> +                                                      
>> memory_attribute_section_cb cb)
>> +{
>> +    unsigned long first_one_bit, last_one_bit;
>> +    uint64_t offset, size;
>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    int ret = 0;
>> +
>> +    first_one_bit = section->offset_within_region / block_size;
>> +    first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>> >bitmap_size, first_one_bit);
>> +
>> +    while (first_one_bit < mgr->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_one_bit * block_size;
>> +        last_one_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>> >bitmap_size,
>> +                                          first_one_bit + 1) - 1;
>> +        size = (last_one_bit - first_one_bit + 1) * block_size;
> 
> 
> What all this math is for if we stuck with VFIO doing 1 page at the
> time? (I think I commented on this)

Sorry, I missed your previous comment. IMHO, as we track the status in
bitmap and we want to call the cb() on the shared part within
MemoryRegionSection. Here we do the calculation to find the expected
sub-range.

> 
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            error_report("%s: Failed to notify RAM discard listener:
>> %s", __func__,
>> +                         strerror(-ret));
>> +            break;
>> +        }
>> +
>> +        first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>> >bitmap_size,
>> +                                      last_one_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +

[..]

>> +
>> +static void
>> memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
>> +                                                    
>> RamDiscardListener *rdl)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(rdl->section);
>> +    g_assert(rdl->section->mr == mgr->mr);
>> +
>> +    ret = memory_attribute_for_each_populated_section(mgr, rdl-
>> >section, rdl,
>> +                                                     
>> memory_attribute_notify_discard_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to unregister RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +
>> +    memory_region_section_free_copy(rdl->section);
>> +    rdl->section = NULL;
>> +    QLIST_REMOVE(rdl, next);
>> +
>> +}
>> +
>> +typedef struct MemoryAttributeReplayData {
>> +    void *fn;
> 
> ReplayRamDiscard *fn, not void*.

We could cast the void *fn either to ReplayRamPopulate or
ReplayRamDiscard (see below).

> 
>> +    void *opaque;
>> +} MemoryAttributeReplayData;
>> +
>> +static int
>> memory_attribute_rdm_replay_populated_cb(MemoryRegionSection *section,
>> void *arg)
>> +{
>> +    MemoryAttributeReplayData *data = arg;
>> +
>> +    return ((ReplayRamPopulate)data->fn)(section, data->opaque);
>> +}
>> +
>> +static int memory_attribute_rdm_replay_populated(const
>> RamDiscardManager *rdm,
>> +                                                 MemoryRegionSection
>> *section,
>> +                                                 ReplayRamPopulate
>> replay_fn,
>> +                                                 void *opaque)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == mgr->mr);
>> +    return memory_attribute_for_each_populated_section(mgr, section,
>> &data,
>> +                                                      
>> memory_attribute_rdm_replay_populated_cb);
>> +}
>> +
>> +static int
>> memory_attribute_rdm_replay_discarded_cb(MemoryRegionSection *section,
>> void *arg)
>> +{
>> +    MemoryAttributeReplayData *data = arg;
>> +
>> +    ((ReplayRamDiscard)data->fn)(section, data->opaque);
>> +    return 0;
>> +}
>> +
>> +static void memory_attribute_rdm_replay_discarded(const
>> RamDiscardManager *rdm,
>> +                                                  MemoryRegionSection
>> *section,
>> +                                                  ReplayRamDiscard
>> replay_fn,
>> +                                                  void *opaque)
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == mgr->mr);
>> +    memory_attribute_for_each_discarded_section(mgr, section, &data,
>> +                                               
>> memory_attribute_rdm_replay_discarded_cb);
>> +}
>> +
>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>> MemoryRegion *mr)
>> +{
>> +    uint64_t bitmap_size;
>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    int ret;
>> +
>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>> +
>> +    mgr->mr = mr;
>> +    mgr->bitmap_size = bitmap_size;
>> +    mgr->shared_bitmap = bitmap_new(bitmap_size);
>> +
>> +    ret = memory_region_set_ram_discard_manager(mgr->mr,
>> RAM_DISCARD_MANAGER(mgr));
> 
> Move it 3 lines up and avoid stale data in mgr->mr/bitmap_size/
> shared_bitmap and avoid g_free below?

Make sense. I will move it up the same as patch 02 before bitmap_new().

> 
>> +    if (ret) {
>> +        g_free(mgr->shared_bitmap);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
>> +{
>> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
>> +
>> +    g_free(mgr->shared_bitmap);
>> +}
>> +
>> +static void memory_attribute_manager_init(Object *obj)
> 
> Not used.
> 
>> +{
>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
>> +
>> +    QLIST_INIT(&mgr->rdl_list);
>> +} > +
>> +static void memory_attribute_manager_finalize(Object *obj)
> 
> Not used either. Thanks,

I think it is OK to define it as a placeholder? Just some preference.

> 
>> +{
>> +}
>> +
>> +static void memory_attribute_manager_class_init(ObjectClass *oc, void
>> *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>> +
>> +    rdmc->get_min_granularity =
>> memory_attribute_rdm_get_min_granularity;
>> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
>> +    rdmc->unregister_listener =
>> memory_attribute_rdm_unregister_listener;
>> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
>> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
>> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
>> +}
>> diff --git a/system/meson.build b/system/meson.build
>> index 4952f4b2c7..ab07ff1442 100644
>> --- a/system/meson.build
>> +++ b/system/meson.build
>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>     'dirtylimit.c',
>>     'dma-helpers.c',
>>     'globals.c',
>> +  'memory-attribute-manager.c',
>>     'memory_mapping.c',
>>     'qdev-monitor.c',
>>     'qtest.c',
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-02-19  1:20     ` Chenyi Qiang
@ 2025-02-19  3:49       ` Alexey Kardashevskiy
  2025-02-19  6:33         ` Chenyi Qiang
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2025-02-19  3:49 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 19/2/25 12:20, Chenyi Qiang wrote:
> 
> 
> On 2/18/2025 5:19 PM, Alexey Kardashevskiy wrote:
>>
>>
> 
> [..]
> 
>>> diff --git a/include/system/memory-attribute-manager.h b/include/
>>> system/memory-attribute-manager.h
>>> new file mode 100644
>>> index 0000000000..72adc0028e
>>> --- /dev/null
>>> +++ b/include/system/memory-attribute-manager.h
>>> @@ -0,0 +1,42 @@
>>> +/*
>>> + * QEMU memory attribute manager
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>> later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>>> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>>> +
>>> +#include "system/hostmem.h"
>>> +
>>> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
>>> +
>>> +OBJECT_DECLARE_TYPE(MemoryAttributeManager,
>>> MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
>>> +
>>> +struct MemoryAttributeManager {
>>> +    Object parent;
>>> +
>>> +    MemoryRegion *mr;
>>> +
>>> +    /* 1-setting of the bit represents the memory is populated
>>> (shared) */
>>> +    int32_t bitmap_size;
>>
>> unsigned.
>>
>> Also, do either s/bitmap_size/shared_bitmap_size/ or
>> s/shared_bitmap/bitmap/
> 
> Will change it. Thanks.
> 
>>
>>
>>
>>> +    unsigned long *shared_bitmap;
>>> +
>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>> +};
>>> +
>>> +struct MemoryAttributeManagerClass {
>>> +    ObjectClass parent_class;
>>> +};
>>> +
>>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>>> MemoryRegion *mr);
>>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
>>> +
>>> +#endif
>>> diff --git a/system/memory-attribute-manager.c b/system/memory-
>>> attribute-manager.c
>>> new file mode 100644
>>> index 0000000000..ed97e43dd0
>>> --- /dev/null
>>> +++ b/system/memory-attribute-manager.c
>>> @@ -0,0 +1,292 @@
>>> +/*
>>> + * QEMU memory attribute manager
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>> later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/error-report.h"
>>> +#include "system/memory-attribute-manager.h"
>>> +
>>> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
>>> +                                   memory_attribute_manager,
>>> +                                   MEMORY_ATTRIBUTE_MANAGER,
>>> +                                   OBJECT,
>>> +                                   { TYPE_RAM_DISCARD_MANAGER },
>>> +                                   { })
>>> +
>>> +static int memory_attribute_manager_get_block_size(const
>>> MemoryAttributeManager *mgr)
>>> +{
>>> +    /*
>>> +     * Because page conversion could be manipulated in the size of at
>>> least 4K or 4K aligned,
>>> +     * Use the host page size as the granularity to track the memory
>>> attribute.
>>> +     * TODO: if necessary, switch to get the page_size from RAMBlock.
>>> +     * i.e. mgr->mr->ram_block->page_size.
>>
>> I'd assume it is rather necessary already.
> 
> OK, Will return the page_size of RAMBlock directly.
> 
>>
>>> +     */
>>> +    return qemu_real_host_page_size();
>>> +}
>>> +
>>> +
>>> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager
>>> *rdm,
>>> +                                              const
>>> MemoryRegionSection *section)
>>> +{
>>> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>> +    uint64_t first_bit = section->offset_within_region / block_size;
>>> +    uint64_t last_bit = first_bit + int128_get64(section->size) /
>>> block_size - 1;
>>> +    unsigned long first_discard_bit;
>>> +
>>> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap,
>>> last_bit + 1, first_bit);
>>> +    return first_discard_bit > last_bit;
>>> +}
>>> +
>>> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s,
>>> void *arg);
>>> +
>>> +static int memory_attribute_notify_populate_cb(MemoryRegionSection
>>> *section, void *arg)
>>> +{
>>> +    RamDiscardListener *rdl = arg;
>>> +
>>> +    return rdl->notify_populate(rdl, section);
>>> +}
>>> +
>>> +static int memory_attribute_notify_discard_cb(MemoryRegionSection
>>> *section, void *arg)
>>> +{
>>> +    RamDiscardListener *rdl = arg;
>>> +
>>> +    rdl->notify_discard(rdl, section);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int memory_attribute_for_each_populated_section(const
>>> MemoryAttributeManager *mgr,
>>> +
>>> MemoryRegionSection *section,
>>> +                                                       void *arg,
>>> +
>>> memory_attribute_section_cb cb)
>>> +{
>>> +    unsigned long first_one_bit, last_one_bit;
>>> +    uint64_t offset, size;
>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>> +    int ret = 0;
>>> +
>>> +    first_one_bit = section->offset_within_region / block_size;
>>> +    first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>>>> bitmap_size, first_one_bit);
>>> +
>>> +    while (first_one_bit < mgr->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_one_bit * block_size;
>>> +        last_one_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>>>> bitmap_size,
>>> +                                          first_one_bit + 1) - 1;
>>> +        size = (last_one_bit - first_one_bit + 1) * block_size;
>>
>>
>> What all this math is for if we stuck with VFIO doing 1 page at the
>> time? (I think I commented on this)
> 
> Sorry, I missed your previous comment. IMHO, as we track the status in
> bitmap and we want to call the cb() on the shared part within
> MemoryRegionSection. Here we do the calculation to find the expected
> sub-range.


You find a largest intersection here and call cb() on it which will call 
VFIO with 1 page at the time. So you could just call cb() for every page 
from here which will make the code simpler.


>>
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            error_report("%s: Failed to notify RAM discard listener:
>>> %s", __func__,
>>> +                         strerror(-ret));
>>> +            break;
>>> +        }
>>> +
>>> +        first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>>>> bitmap_size,
>>> +                                      last_one_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
> 
> [..]
> 
>>> +
>>> +static void
>>> memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
>>> +
>>> RamDiscardListener *rdl)
>>> +{
>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(rdl->section);
>>> +    g_assert(rdl->section->mr == mgr->mr);
>>> +
>>> +    ret = memory_attribute_for_each_populated_section(mgr, rdl-
>>>> section, rdl,
>>> +
>>> memory_attribute_notify_discard_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>> %s", __func__,
>>> +                     strerror(-ret));
>>> +    }
>>> +
>>> +    memory_region_section_free_copy(rdl->section);
>>> +    rdl->section = NULL;
>>> +    QLIST_REMOVE(rdl, next);
>>> +
>>> +}
>>> +
>>> +typedef struct MemoryAttributeReplayData {
>>> +    void *fn;
>>
>> ReplayRamDiscard *fn, not void*.
> 
> We could cast the void *fn either to ReplayRamPopulate or
> ReplayRamDiscard (see below).


Hard to read, hard to maintain, and they take same parameters, only the 
return value is different (int/void) - if this is really important, have 
2 fn pointers in MemoryAttributeReplayData. It is already hard to follow 
this train on callbacks.


>>> +    void *opaque;
>>> +} MemoryAttributeReplayData;
>>> +
>>> +static int
>>> memory_attribute_rdm_replay_populated_cb(MemoryRegionSection *section,
>>> void *arg)
>>> +{
>>> +    MemoryAttributeReplayData *data = arg;
>>> +
>>> +    return ((ReplayRamPopulate)data->fn)(section, data->opaque);
>>> +}
>>> +
>>> +static int memory_attribute_rdm_replay_populated(const
>>> RamDiscardManager *rdm,
>>> +                                                 MemoryRegionSection
>>> *section,
>>> +                                                 ReplayRamPopulate
>>> replay_fn,
>>> +                                                 void *opaque)
>>> +{
>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == mgr->mr);
>>> +    return memory_attribute_for_each_populated_section(mgr, section,
>>> &data,
>>> +
>>> memory_attribute_rdm_replay_populated_cb);
>>> +}
>>> +
>>> +static int
>>> memory_attribute_rdm_replay_discarded_cb(MemoryRegionSection *section,
>>> void *arg)
>>> +{
>>> +    MemoryAttributeReplayData *data = arg;
>>> +
>>> +    ((ReplayRamDiscard)data->fn)(section, data->opaque);
>>> +    return 0;
>>> +}
>>> +
>>> +static void memory_attribute_rdm_replay_discarded(const
>>> RamDiscardManager *rdm,
>>> +                                                  MemoryRegionSection
>>> *section,
>>> +                                                  ReplayRamDiscard
>>> replay_fn,
>>> +                                                  void *opaque)
>>> +{
>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == mgr->mr);
>>> +    memory_attribute_for_each_discarded_section(mgr, section, &data,
>>> +
>>> memory_attribute_rdm_replay_discarded_cb);
>>> +}
>>> +
>>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>>> MemoryRegion *mr)
>>> +{
>>> +    uint64_t bitmap_size;
>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>> +    int ret;
>>> +
>>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>>> +
>>> +    mgr->mr = mr;
>>> +    mgr->bitmap_size = bitmap_size;
>>> +    mgr->shared_bitmap = bitmap_new(bitmap_size);
>>> +
>>> +    ret = memory_region_set_ram_discard_manager(mgr->mr,
>>> RAM_DISCARD_MANAGER(mgr));
>>
>> Move it 3 lines up and avoid stale data in mgr->mr/bitmap_size/
>> shared_bitmap and avoid g_free below?
> 
> Make sense. I will move it up the same as patch 02 before bitmap_new().
> 
>>
>>> +    if (ret) {
>>> +        g_free(mgr->shared_bitmap);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
>>> +{
>>> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
>>> +
>>> +    g_free(mgr->shared_bitmap);
>>> +}
>>> +
>>> +static void memory_attribute_manager_init(Object *obj)
>>
>> Not used.
>>
>>> +{
>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
>>> +
>>> +    QLIST_INIT(&mgr->rdl_list);
>>> +} > +
>>> +static void memory_attribute_manager_finalize(Object *obj)
>>
>> Not used either. Thanks,
> 
> I think it is OK to define it as a placeholder? Just some preference.

At very least gcc should warn on these (I am surprised it did not) and 
nobody likes this. Thanks,


>>
>>> +{
>>> +}
>>> +
>>> +static void memory_attribute_manager_class_init(ObjectClass *oc, void
>>> *data)
>>> +{
>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>> +
>>> +    rdmc->get_min_granularity =
>>> memory_attribute_rdm_get_min_granularity;
>>> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
>>> +    rdmc->unregister_listener =
>>> memory_attribute_rdm_unregister_listener;
>>> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
>>> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
>>> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
>>> +}
>>> diff --git a/system/meson.build b/system/meson.build
>>> index 4952f4b2c7..ab07ff1442 100644
>>> --- a/system/meson.build
>>> +++ b/system/meson.build
>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>      'dirtylimit.c',
>>>      'dma-helpers.c',
>>>      'globals.c',
>>> +  'memory-attribute-manager.c',
>>>      'memory_mapping.c',
>>>      'qdev-monitor.c',
>>>      'qtest.c',
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-02-19  3:49       ` Alexey Kardashevskiy
@ 2025-02-19  6:33         ` Chenyi Qiang
  2025-02-20  3:02           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-19  6:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 2/19/2025 11:49 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 19/2/25 12:20, Chenyi Qiang wrote:
>>
>>
>> On 2/18/2025 5:19 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>
>> [..]
>>
>>>> diff --git a/include/system/memory-attribute-manager.h b/include/
>>>> system/memory-attribute-manager.h
>>>> new file mode 100644
>>>> index 0000000000..72adc0028e
>>>> --- /dev/null
>>>> +++ b/include/system/memory-attribute-manager.h
>>>> @@ -0,0 +1,42 @@
>>>> +/*
>>>> + * QEMU memory attribute manager
>>>> + *
>>>> + * Copyright Intel
>>>> + *
>>>> + * Author:
>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>> later.
>>>> + * See the COPYING file in the top-level directory
>>>> + *
>>>> + */
>>>> +
>>>> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>>>> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>>>> +
>>>> +#include "system/hostmem.h"
>>>> +
>>>> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
>>>> +
>>>> +OBJECT_DECLARE_TYPE(MemoryAttributeManager,
>>>> MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
>>>> +
>>>> +struct MemoryAttributeManager {
>>>> +    Object parent;
>>>> +
>>>> +    MemoryRegion *mr;
>>>> +
>>>> +    /* 1-setting of the bit represents the memory is populated
>>>> (shared) */
>>>> +    int32_t bitmap_size;
>>>
>>> unsigned.
>>>
>>> Also, do either s/bitmap_size/shared_bitmap_size/ or
>>> s/shared_bitmap/bitmap/
>>
>> Will change it. Thanks.
>>
>>>
>>>
>>>
>>>> +    unsigned long *shared_bitmap;
>>>> +
>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>> +};
>>>> +
>>>> +struct MemoryAttributeManagerClass {
>>>> +    ObjectClass parent_class;
>>>> +};
>>>> +
>>>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>>>> MemoryRegion *mr);
>>>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
>>>> +
>>>> +#endif
>>>> diff --git a/system/memory-attribute-manager.c b/system/memory-
>>>> attribute-manager.c
>>>> new file mode 100644
>>>> index 0000000000..ed97e43dd0
>>>> --- /dev/null
>>>> +++ b/system/memory-attribute-manager.c
>>>> @@ -0,0 +1,292 @@
>>>> +/*
>>>> + * QEMU memory attribute manager
>>>> + *
>>>> + * Copyright Intel
>>>> + *
>>>> + * Author:
>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>> later.
>>>> + * See the COPYING file in the top-level directory
>>>> + *
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "system/memory-attribute-manager.h"
>>>> +
>>>> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
>>>> +                                   memory_attribute_manager,
>>>> +                                   MEMORY_ATTRIBUTE_MANAGER,
>>>> +                                   OBJECT,
>>>> +                                   { TYPE_RAM_DISCARD_MANAGER },
>>>> +                                   { })
>>>> +
>>>> +static int memory_attribute_manager_get_block_size(const
>>>> MemoryAttributeManager *mgr)
>>>> +{
>>>> +    /*
>>>> +     * Because page conversion could be manipulated in the size of at
>>>> least 4K or 4K aligned,
>>>> +     * Use the host page size as the granularity to track the memory
>>>> attribute.
>>>> +     * TODO: if necessary, switch to get the page_size from RAMBlock.
>>>> +     * i.e. mgr->mr->ram_block->page_size.
>>>
>>> I'd assume it is rather necessary already.
>>
>> OK, Will return the page_size of RAMBlock directly.
>>
>>>
>>>> +     */
>>>> +    return qemu_real_host_page_size();
>>>> +}
>>>> +
>>>> +
>>>> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager
>>>> *rdm,
>>>> +                                              const
>>>> MemoryRegionSection *section)
>>>> +{
>>>> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>> +    uint64_t first_bit = section->offset_within_region / block_size;
>>>> +    uint64_t last_bit = first_bit + int128_get64(section->size) /
>>>> block_size - 1;
>>>> +    unsigned long first_discard_bit;
>>>> +
>>>> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap,
>>>> last_bit + 1, first_bit);
>>>> +    return first_discard_bit > last_bit;
>>>> +}
>>>> +
>>>> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s,
>>>> void *arg);
>>>> +
>>>> +static int memory_attribute_notify_populate_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    RamDiscardListener *rdl = arg;
>>>> +
>>>> +    return rdl->notify_populate(rdl, section);
>>>> +}
>>>> +
>>>> +static int memory_attribute_notify_discard_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    RamDiscardListener *rdl = arg;
>>>> +
>>>> +    rdl->notify_discard(rdl, section);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int memory_attribute_for_each_populated_section(const
>>>> MemoryAttributeManager *mgr,
>>>> +
>>>> MemoryRegionSection *section,
>>>> +                                                       void *arg,
>>>> +
>>>> memory_attribute_section_cb cb)
>>>> +{
>>>> +    unsigned long first_one_bit, last_one_bit;
>>>> +    uint64_t offset, size;
>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>> +    int ret = 0;
>>>> +
>>>> +    first_one_bit = section->offset_within_region / block_size;
>>>> +    first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>>>>> bitmap_size, first_one_bit);
>>>> +
>>>> +    while (first_one_bit < mgr->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_one_bit * block_size;
>>>> +        last_one_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>>>>> bitmap_size,
>>>> +                                          first_one_bit + 1) - 1;
>>>> +        size = (last_one_bit - first_one_bit + 1) * block_size;
>>>
>>>
>>> What all this math is for if we stuck with VFIO doing 1 page at the
>>> time? (I think I commented on this)
>>
>> Sorry, I missed your previous comment. IMHO, as we track the status in
>> bitmap and we want to call the cb() on the shared part within
>> MemoryRegionSection. Here we do the calculation to find the expected
>> sub-range.
> 
> 
> You find a largest intersection here and call cb() on it which will call
> VFIO with 1 page at the time. So you could just call cb() for every page
> from here which will make the code simpler.

I prefer to keep calling cb() on a large intersection . I think in
future after cut_mapping is supported, we don't need to make VFIO call 1
page at a time. VFIO can call on the large range directly.

In addition, calling cb() for every page seems specific to VFIO usage.
It is more generic to call on a large intersection. If more RDM listener
added in future(although VFIO is the only user currently), do the split
in caller is inefficient.

> 
> 
>>>
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            error_report("%s: Failed to notify RAM discard listener:
>>>> %s", __func__,
>>>> +                         strerror(-ret));
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>>>>> bitmap_size,
>>>> +                                      last_one_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>
>> [..]
>>
>>>> +
>>>> +static void
>>>> memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
>>>> +
>>>> RamDiscardListener *rdl)
>>>> +{
>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(rdl->section);
>>>> +    g_assert(rdl->section->mr == mgr->mr);
>>>> +
>>>> +    ret = memory_attribute_for_each_populated_section(mgr, rdl-
>>>>> section, rdl,
>>>> +
>>>> memory_attribute_notify_discard_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +
>>>> +    memory_region_section_free_copy(rdl->section);
>>>> +    rdl->section = NULL;
>>>> +    QLIST_REMOVE(rdl, next);
>>>> +
>>>> +}
>>>> +
>>>> +typedef struct MemoryAttributeReplayData {
>>>> +    void *fn;
>>>
>>> ReplayRamDiscard *fn, not void*.
>>
>> We could cast the void *fn either to ReplayRamPopulate or
>> ReplayRamDiscard (see below).
> 
> 
> Hard to read, hard to maintain, and they take same parameters, only the
> return value is different (int/void) - if this is really important, have
> 2 fn pointers in MemoryAttributeReplayData. It is already hard to follow
> this train on callbacks.

Actually, I prefer to make ReplayRamDiscard and ReplayRamPopulate
unified. Make ReplayRamDiscard() also return int. Then we only need to
define one function like:

typedef int (*ReplayMemoryAttributeChange)(MemoryRegionSection *section,
void *opaque);

Maybe David can share his opinions.

> 
> 
>>>> +    void *opaque;
>>>> +} MemoryAttributeReplayData;
>>>> +
>>>> +static int
>>>> memory_attribute_rdm_replay_populated_cb(MemoryRegionSection *section,
>>>> void *arg)
>>>> +{
>>>> +    MemoryAttributeReplayData *data = arg;
>>>> +
>>>> +    return ((ReplayRamPopulate)data->fn)(section, data->opaque);
>>>> +}
>>>> +
>>>> +static int memory_attribute_rdm_replay_populated(const
>>>> RamDiscardManager *rdm,
>>>> +                                                 MemoryRegionSection
>>>> *section,
>>>> +                                                 ReplayRamPopulate
>>>> replay_fn,
>>>> +                                                 void *opaque)
>>>> +{
>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == mgr->mr);
>>>> +    return memory_attribute_for_each_populated_section(mgr, section,
>>>> &data,
>>>> +
>>>> memory_attribute_rdm_replay_populated_cb);
>>>> +}
>>>> +
>>>> +static int
>>>> memory_attribute_rdm_replay_discarded_cb(MemoryRegionSection *section,
>>>> void *arg)
>>>> +{
>>>> +    MemoryAttributeReplayData *data = arg;
>>>> +
>>>> +    ((ReplayRamDiscard)data->fn)(section, data->opaque);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void memory_attribute_rdm_replay_discarded(const
>>>> RamDiscardManager *rdm,
>>>> +                                                  MemoryRegionSection
>>>> *section,
>>>> +                                                  ReplayRamDiscard
>>>> replay_fn,
>>>> +                                                  void *opaque)
>>>> +{
>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == mgr->mr);
>>>> +    memory_attribute_for_each_discarded_section(mgr, section, &data,
>>>> +
>>>> memory_attribute_rdm_replay_discarded_cb);
>>>> +}
>>>> +
>>>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>>>> MemoryRegion *mr)
>>>> +{
>>>> +    uint64_t bitmap_size;
>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>> +    int ret;
>>>> +
>>>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>>>> +
>>>> +    mgr->mr = mr;
>>>> +    mgr->bitmap_size = bitmap_size;
>>>> +    mgr->shared_bitmap = bitmap_new(bitmap_size);
>>>> +
>>>> +    ret = memory_region_set_ram_discard_manager(mgr->mr,
>>>> RAM_DISCARD_MANAGER(mgr));
>>>
>>> Move it 3 lines up and avoid stale data in mgr->mr/bitmap_size/
>>> shared_bitmap and avoid g_free below?
>>
>> Make sense. I will move it up the same as patch 02 before bitmap_new().
>>
>>>
>>>> +    if (ret) {
>>>> +        g_free(mgr->shared_bitmap);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
>>>> +{
>>>> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
>>>> +
>>>> +    g_free(mgr->shared_bitmap);
>>>> +}
>>>> +
>>>> +static void memory_attribute_manager_init(Object *obj)
>>>
>>> Not used.
>>>
>>>> +{
>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
>>>> +
>>>> +    QLIST_INIT(&mgr->rdl_list);
>>>> +} > +
>>>> +static void memory_attribute_manager_finalize(Object *obj)
>>>
>>> Not used either. Thanks,
>>
>> I think it is OK to define it as a placeholder? Just some preference.
> 
> At very least gcc should warn on these (I am surprised it did not) and
> nobody likes this. Thanks,

I tried a little. They must be defined. The init() and finalize() calls
are used in the OBJECT_DEFINE_TYPE_WITH_INTERFACES() macro. I think it
is a common template to define in this way.

> 
> 
>>>
>>>> +{
>>>> +}
>>>> +
>>>> +static void memory_attribute_manager_class_init(ObjectClass *oc, void
>>>> *data)
>>>> +{
>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>> +
>>>> +    rdmc->get_min_granularity =
>>>> memory_attribute_rdm_get_min_granularity;
>>>> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
>>>> +    rdmc->unregister_listener =
>>>> memory_attribute_rdm_unregister_listener;
>>>> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
>>>> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
>>>> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
>>>> +}
>>>> diff --git a/system/meson.build b/system/meson.build
>>>> index 4952f4b2c7..ab07ff1442 100644
>>>> --- a/system/meson.build
>>>> +++ b/system/meson.build
>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>      'dirtylimit.c',
>>>>      'dma-helpers.c',
>>>>      'globals.c',
>>>> +  'memory-attribute-manager.c',
>>>>      'memory_mapping.c',
>>>>      'qdev-monitor.c',
>>>>      'qtest.c',
>>>
>>
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd
  2025-02-19  6:33         ` Chenyi Qiang
@ 2025-02-20  3:02           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Kardashevskiy @ 2025-02-20  3:02 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 19/2/25 17:33, Chenyi Qiang wrote:
> 
> 
> On 2/19/2025 11:49 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 19/2/25 12:20, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/18/2025 5:19 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>
>>> [..]
>>>
>>>>> diff --git a/include/system/memory-attribute-manager.h b/include/
>>>>> system/memory-attribute-manager.h
>>>>> new file mode 100644
>>>>> index 0000000000..72adc0028e
>>>>> --- /dev/null
>>>>> +++ b/include/system/memory-attribute-manager.h
>>>>> @@ -0,0 +1,42 @@
>>>>> +/*
>>>>> + * QEMU memory attribute manager
>>>>> + *
>>>>> + * Copyright Intel
>>>>> + *
>>>>> + * Author:
>>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> + *
>>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>>> later.
>>>>> + * See the COPYING file in the top-level directory
>>>>> + *
>>>>> + */
>>>>> +
>>>>> +#ifndef SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>>>>> +#define SYSTEM_MEMORY_ATTRIBUTE_MANAGER_H
>>>>> +
>>>>> +#include "system/hostmem.h"
>>>>> +
>>>>> +#define TYPE_MEMORY_ATTRIBUTE_MANAGER "memory-attribute-manager"
>>>>> +
>>>>> +OBJECT_DECLARE_TYPE(MemoryAttributeManager,
>>>>> MemoryAttributeManagerClass, MEMORY_ATTRIBUTE_MANAGER)
>>>>> +
>>>>> +struct MemoryAttributeManager {
>>>>> +    Object parent;
>>>>> +
>>>>> +    MemoryRegion *mr;
>>>>> +
>>>>> +    /* 1-setting of the bit represents the memory is populated
>>>>> (shared) */
>>>>> +    int32_t bitmap_size;
>>>>
>>>> unsigned.
>>>>
>>>> Also, do either s/bitmap_size/shared_bitmap_size/ or
>>>> s/shared_bitmap/bitmap/
>>>
>>> Will change it. Thanks.
>>>
>>>>
>>>>
>>>>
>>>>> +    unsigned long *shared_bitmap;
>>>>> +
>>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>>> +};
>>>>> +
>>>>> +struct MemoryAttributeManagerClass {
>>>>> +    ObjectClass parent_class;
>>>>> +};
>>>>> +
>>>>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>>>>> MemoryRegion *mr);
>>>>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
>>>>> +
>>>>> +#endif
>>>>> diff --git a/system/memory-attribute-manager.c b/system/memory-
>>>>> attribute-manager.c
>>>>> new file mode 100644
>>>>> index 0000000000..ed97e43dd0
>>>>> --- /dev/null
>>>>> +++ b/system/memory-attribute-manager.c
>>>>> @@ -0,0 +1,292 @@
>>>>> +/*
>>>>> + * QEMU memory attribute manager
>>>>> + *
>>>>> + * Copyright Intel
>>>>> + *
>>>>> + * Author:
>>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> + *
>>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>>> later.
>>>>> + * See the COPYING file in the top-level directory
>>>>> + *
>>>>> + */
>>>>> +
>>>>> +#include "qemu/osdep.h"
>>>>> +#include "qemu/error-report.h"
>>>>> +#include "system/memory-attribute-manager.h"
>>>>> +
>>>>> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(MemoryAttributeManager,
>>>>> +                                   memory_attribute_manager,
>>>>> +                                   MEMORY_ATTRIBUTE_MANAGER,
>>>>> +                                   OBJECT,
>>>>> +                                   { TYPE_RAM_DISCARD_MANAGER },
>>>>> +                                   { })
>>>>> +
>>>>> +static int memory_attribute_manager_get_block_size(const
>>>>> MemoryAttributeManager *mgr)
>>>>> +{
>>>>> +    /*
>>>>> +     * Because page conversion could be manipulated in the size of at
>>>>> least 4K or 4K aligned,
>>>>> +     * Use the host page size as the granularity to track the memory
>>>>> attribute.
>>>>> +     * TODO: if necessary, switch to get the page_size from RAMBlock.
>>>>> +     * i.e. mgr->mr->ram_block->page_size.
>>>>
>>>> I'd assume it is rather necessary already.
>>>
>>> OK, Will return the page_size of RAMBlock directly.
>>>
>>>>
>>>>> +     */
>>>>> +    return qemu_real_host_page_size();
>>>>> +}
>>>>> +
>>>>> +
>>>>> +static bool memory_attribute_rdm_is_populated(const RamDiscardManager
>>>>> *rdm,
>>>>> +                                              const
>>>>> MemoryRegionSection *section)
>>>>> +{
>>>>> +    const MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>> +    uint64_t first_bit = section->offset_within_region / block_size;
>>>>> +    uint64_t last_bit = first_bit + int128_get64(section->size) /
>>>>> block_size - 1;
>>>>> +    unsigned long first_discard_bit;
>>>>> +
>>>>> +    first_discard_bit = find_next_zero_bit(mgr->shared_bitmap,
>>>>> last_bit + 1, first_bit);
>>>>> +    return first_discard_bit > last_bit;
>>>>> +}
>>>>> +
>>>>> +typedef int (*memory_attribute_section_cb)(MemoryRegionSection *s,
>>>>> void *arg);
>>>>> +
>>>>> +static int memory_attribute_notify_populate_cb(MemoryRegionSection
>>>>> *section, void *arg)
>>>>> +{
>>>>> +    RamDiscardListener *rdl = arg;
>>>>> +
>>>>> +    return rdl->notify_populate(rdl, section);
>>>>> +}
>>>>> +
>>>>> +static int memory_attribute_notify_discard_cb(MemoryRegionSection
>>>>> *section, void *arg)
>>>>> +{
>>>>> +    RamDiscardListener *rdl = arg;
>>>>> +
>>>>> +    rdl->notify_discard(rdl, section);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int memory_attribute_for_each_populated_section(const
>>>>> MemoryAttributeManager *mgr,
>>>>> +
>>>>> MemoryRegionSection *section,
>>>>> +                                                       void *arg,
>>>>> +
>>>>> memory_attribute_section_cb cb)
>>>>> +{
>>>>> +    unsigned long first_one_bit, last_one_bit;


btw s/first_one_bit/first/  and  s/last_one_bit/last/  as it is quite 
obvious from the code what these are.


>>>>> +    uint64_t offset, size;
>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    first_one_bit = section->offset_within_region / block_size;
>>>>> +    first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>>>>>> bitmap_size, first_one_bit);
>>>>> +
>>>>> +    while (first_one_bit < mgr->bitmap_size) {
>>>>> +        MemoryRegionSection tmp = *section;
>>>>> +
>>>>> +        offset = first_one_bit * block_size;
>>>>> +        last_one_bit = find_next_zero_bit(mgr->shared_bitmap, mgr-
>>>>>> bitmap_size,
>>>>> +                                          first_one_bit + 1) - 1;
>>>>> +        size = (last_one_bit - first_one_bit + 1) * block_size;
>>>>
>>>>
>>>> What all this math is for if we stuck with VFIO doing 1 page at the
>>>> time? (I think I commented on this)
>>>
>>> Sorry, I missed your previous comment. IMHO, as we track the status in
>>> bitmap and we want to call the cb() on the shared part within
>>> MemoryRegionSection. Here we do the calculation to find the expected
>>> sub-range.
>>
>>
>> You find a largest intersection here and call cb() on it which will call
>> VFIO with 1 page at the time. So you could just call cb() for every page
>> from here which will make the code simpler.
> 
> I prefer to keep calling cb() on a large intersection . I think in
> future after cut_mapping is supported, we don't need to make VFIO call 1
> page at a time. VFIO can call on the large range directly.

> In addition, calling cb() for every page seems specific to VFIO usage.
> It is more generic to call on a large intersection. If more RDM listener
> added in future(although VFIO is the only user currently), do the split
> in caller is inefficient.

It is an hardly measurable optimization though. Could be a separate 
patch. I do not insist, just do not see the point, If others are fine, I 
am fine too :)

> 
>>
>>
>>>>
>>>>> +
>>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>>> size)) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        ret = cb(&tmp, arg);
>>>>> +        if (ret) {
>>>>> +            error_report("%s: Failed to notify RAM discard listener:
>>>>> %s", __func__,
>>>>> +                         strerror(-ret));
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        first_one_bit = find_next_bit(mgr->shared_bitmap, mgr-
>>>>>> bitmap_size,
>>>>> +                                      last_one_bit + 2);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>
>>> [..]
>>>
>>>>> +
>>>>> +static void
>>>>> memory_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
>>>>> +
>>>>> RamDiscardListener *rdl)
>>>>> +{
>>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>>> +    int ret;
>>>>> +
>>>>> +    g_assert(rdl->section);
>>>>> +    g_assert(rdl->section->mr == mgr->mr);
>>>>> +
>>>>> +    ret = memory_attribute_for_each_populated_section(mgr, rdl-
>>>>>> section, rdl,
>>>>> +
>>>>> memory_attribute_notify_discard_cb);
>>>>> +    if (ret) {
>>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>>> %s", __func__,
>>>>> +                     strerror(-ret));
>>>>> +    }
>>>>> +
>>>>> +    memory_region_section_free_copy(rdl->section);
>>>>> +    rdl->section = NULL;
>>>>> +    QLIST_REMOVE(rdl, next);
>>>>> +
>>>>> +}
>>>>> +
>>>>> +typedef struct MemoryAttributeReplayData {
>>>>> +    void *fn;
>>>>
>>>> ReplayRamDiscard *fn, not void*.
>>>
>>> We could cast the void *fn either to ReplayRamPopulate or
>>> ReplayRamDiscard (see below).
>>
>>
>> Hard to read, hard to maintain, and they take same parameters, only the
>> return value is different (int/void) - if this is really important, have
>> 2 fn pointers in MemoryAttributeReplayData. It is already hard to follow
>> this train on callbacks.
> 
> Actually, I prefer to make ReplayRamDiscard and ReplayRamPopulate
> unified. Make ReplayRamDiscard() also return int. Then we only need to
> define one function like:
> 
> typedef int (*ReplayMemoryAttributeChange)(MemoryRegionSection *section,
> void *opaque);

This should work.


> 
> Maybe David can share his opinions.
>>
>>
>>>>> +    void *opaque;
>>>>> +} MemoryAttributeReplayData;
>>>>> +
>>>>> +static int
>>>>> memory_attribute_rdm_replay_populated_cb(MemoryRegionSection *section,
>>>>> void *arg)
>>>>> +{
>>>>> +    MemoryAttributeReplayData *data = arg;
>>>>> +
>>>>> +    return ((ReplayRamPopulate)data->fn)(section, data->opaque);
>>>>> +}
>>>>> +
>>>>> +static int memory_attribute_rdm_replay_populated(const
>>>>> RamDiscardManager *rdm,
>>>>> +                                                 MemoryRegionSection
>>>>> *section,
>>>>> +                                                 ReplayRamPopulate
>>>>> replay_fn,
>>>>> +                                                 void *opaque)
>>>>> +{
>>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>>>>> opaque };
>>>>> +
>>>>> +    g_assert(section->mr == mgr->mr);
>>>>> +    return memory_attribute_for_each_populated_section(mgr, section,
>>>>> &data,
>>>>> +
>>>>> memory_attribute_rdm_replay_populated_cb);
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> memory_attribute_rdm_replay_discarded_cb(MemoryRegionSection *section,
>>>>> void *arg)
>>>>> +{
>>>>> +    MemoryAttributeReplayData *data = arg;
>>>>> +
>>>>> +    ((ReplayRamDiscard)data->fn)(section, data->opaque);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static void memory_attribute_rdm_replay_discarded(const
>>>>> RamDiscardManager *rdm,
>>>>> +                                                  MemoryRegionSection
>>>>> *section,
>>>>> +                                                  ReplayRamDiscard
>>>>> replay_fn,
>>>>> +                                                  void *opaque)
>>>>> +{
>>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(rdm);
>>>>> +    MemoryAttributeReplayData data = { .fn = replay_fn, .opaque =
>>>>> opaque };
>>>>> +
>>>>> +    g_assert(section->mr == mgr->mr);
>>>>> +    memory_attribute_for_each_discarded_section(mgr, section, &data,
>>>>> +
>>>>> memory_attribute_rdm_replay_discarded_cb);
>>>>> +}
>>>>> +
>>>>> +int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>>>>> MemoryRegion *mr)
>>>>> +{
>>>>> +    uint64_t bitmap_size;
>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>> +    int ret;
>>>>> +
>>>>> +    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
>>>>> +
>>>>> +    mgr->mr = mr;
>>>>> +    mgr->bitmap_size = bitmap_size;
>>>>> +    mgr->shared_bitmap = bitmap_new(bitmap_size);
>>>>> +
>>>>> +    ret = memory_region_set_ram_discard_manager(mgr->mr,
>>>>> RAM_DISCARD_MANAGER(mgr));
>>>>
>>>> Move it 3 lines up and avoid stale data in mgr->mr/bitmap_size/
>>>> shared_bitmap and avoid g_free below?
>>>
>>> Make sense. I will move it up the same as patch 02 before bitmap_new().
>>>
>>>>
>>>>> +    if (ret) {
>>>>> +        g_free(mgr->shared_bitmap);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr)
>>>>> +{
>>>>> +    memory_region_set_ram_discard_manager(mgr->mr, NULL);
>>>>> +
>>>>> +    g_free(mgr->shared_bitmap);
>>>>> +}
>>>>> +
>>>>> +static void memory_attribute_manager_init(Object *obj)
>>>>
>>>> Not used.
>>>>
>>>>> +{
>>>>> +    MemoryAttributeManager *mgr = MEMORY_ATTRIBUTE_MANAGER(obj);
>>>>> +
>>>>> +    QLIST_INIT(&mgr->rdl_list);
>>>>> +} > +
>>>>> +static void memory_attribute_manager_finalize(Object *obj)
>>>>
>>>> Not used either. Thanks,
>>>
>>> I think it is OK to define it as a placeholder? Just some preference.
>>
>> At very least gcc should warn on these (I am surprised it did not) and
>> nobody likes this. Thanks,
> 
> I tried a little. They must be defined. The init() and finalize() calls
> are used in the OBJECT_DEFINE_TYPE_WITH_INTERFACES() macro. I think it
> is a common template to define in this way.

ah, I missed that. OBJECT_DEFINE_TYPE_WITH_INTERFACES means they have to 
be defined, never mind that. Thanks,


>>
>>
>>>>
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static void memory_attribute_manager_class_init(ObjectClass *oc, void
>>>>> *data)
>>>>> +{
>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>> +
>>>>> +    rdmc->get_min_granularity =
>>>>> memory_attribute_rdm_get_min_granularity;
>>>>> +    rdmc->register_listener = memory_attribute_rdm_register_listener;
>>>>> +    rdmc->unregister_listener =
>>>>> memory_attribute_rdm_unregister_listener;
>>>>> +    rdmc->is_populated = memory_attribute_rdm_is_populated;
>>>>> +    rdmc->replay_populated = memory_attribute_rdm_replay_populated;
>>>>> +    rdmc->replay_discarded = memory_attribute_rdm_replay_discarded;
>>>>> +}
>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>> index 4952f4b2c7..ab07ff1442 100644
>>>>> --- a/system/meson.build
>>>>> +++ b/system/meson.build
>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>       'dirtylimit.c',
>>>>>       'dma-helpers.c',
>>>>>       'globals.c',
>>>>> +  'memory-attribute-manager.c',
>>>>>       'memory_mapping.c',
>>>>>       'qdev-monitor.c',
>>>>>       'qtest.c',
>>>>
>>>
>>
> 

-- 
Alexey



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change
  2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2025-02-17  8:18 ` [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
@ 2025-02-17  8:18 ` Chenyi Qiang
  2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-17  8:18 ` [PATCH v2 5/6] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
  2025-02-17  8:18 ` [PATCH v2 6/6] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
  5 siblings, 1 reply; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Introduce a new state_change() callback in MemoryAttributeManagerClass to
efficiently notify all registered RamDiscardListeners, including VFIO
listeners about the memory conversion events in guest_memfd. The
existing VFIO listener can dynamically DMA map/unmap the shared pages
based on conversion types:
- For conversions from shared to private, the VFIO system ensures the
  discarding of shared mapping from the IOMMU.
- For conversions from private to shared, it triggers the population of
  the shared mapping into the IOMMU.

Additionally, there could be some special conversion requests:
- When a conversion request is made for a page already in the desired
  state, the helper simply returns success.
- For requests involving a range partially in the desired state, only
  the necessary segments are converted, ensuring the entire range
  complies with the request efficiently.
- In scenarios where a conversion request is declined by other systems,
  such as a failure from VFIO during notify_populate(), the helper will
  roll back the request, maintaining consistency.

Opportunistically introduce a helper to trigger the state_change()
callback of the class.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v2:
    - Do the alignment changes due to the rename to MemoryAttributeManager
    - Move the state_change() helper definition in this patch.
---
 include/system/memory-attribute-manager.h |  20 +++
 system/memory-attribute-manager.c         | 148 ++++++++++++++++++++++
 2 files changed, 168 insertions(+)

diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
index 72adc0028e..c3dab4e47b 100644
--- a/include/system/memory-attribute-manager.h
+++ b/include/system/memory-attribute-manager.h
@@ -34,8 +34,28 @@ struct MemoryAttributeManager {
 
 struct MemoryAttributeManagerClass {
     ObjectClass parent_class;
+
+    int (*state_change)(MemoryAttributeManager *mgr, uint64_t offset, uint64_t size,
+                        bool shared_to_private);
 };
 
+static inline int memory_attribute_manager_state_change(MemoryAttributeManager *mgr, uint64_t offset,
+                                                        uint64_t size, bool shared_to_private)
+{
+    MemoryAttributeManagerClass *klass;
+
+    if (mgr == NULL) {
+        return 0;
+    }
+
+    klass = MEMORY_ATTRIBUTE_MANAGER_GET_CLASS(mgr);
+    if (klass->state_change) {
+        return klass->state_change(mgr, offset, size, shared_to_private);
+    }
+
+    return 0;
+}
+
 int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
 void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
 
diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
index ed97e43dd0..17c70cf677 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -241,6 +241,151 @@ static void memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
                                                 memory_attribute_rdm_replay_discarded_cb);
 }
 
+static bool memory_attribute_is_valid_range(MemoryAttributeManager *mgr,
+                                            uint64_t offset, uint64_t size)
+{
+    MemoryRegion *mr = mgr->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+
+    if (!QEMU_IS_ALIGNED(offset, block_size)) {
+        return false;
+    }
+    if (offset + size < offset || !size) {
+        return false;
+    }
+    if (offset >= region_size || offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
+                                            uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &mgr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+
+        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
+                                                    memory_attribute_notify_discard_cb);
+    }
+}
+
+static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
+                                            uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl, *rdl2;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &mgr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+
+        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
+                                                          memory_attribute_notify_populate_cb);
+        if (ret) {
+            break;
+        }
+    }
+
+    if (ret) {
+        /* Notify all already-notified listeners. */
+        QLIST_FOREACH(rdl2, &mgr->rdl_list, next) {
+            MemoryRegionSection tmp = *rdl2->section;
+
+            if (rdl2 == rdl) {
+                break;
+            }
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+                continue;
+            }
+
+            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
+                                                        memory_attribute_notify_discard_cb);
+        }
+    }
+    return ret;
+}
+
+static bool memory_attribute_is_range_populated(MemoryAttributeManager *mgr,
+                                                uint64_t offset, uint64_t size)
+{
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
+                                                uint64_t offset, uint64_t size)
+{
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
+                                         uint64_t size, bool shared_to_private)
+{
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
+        (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
+        return 0;
+    }
+
+    if (shared_to_private) {
+        memory_attribute_notify_discard(mgr, offset, size);
+    } else {
+        ret = memory_attribute_notify_populate(mgr, offset, size);
+    }
+
+    if (!ret) {
+        unsigned long first_bit = offset / block_size;
+        unsigned long nbits = size / block_size;
+
+        g_assert((first_bit + nbits) <= mgr->bitmap_size);
+
+        if (shared_to_private) {
+            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+        } else {
+            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+        }
+
+        return 0;
+    }
+
+    return ret;
+}
+
 int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
 {
     uint64_t bitmap_size;
@@ -281,8 +426,11 @@ static void memory_attribute_manager_finalize(Object *obj)
 
 static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
 {
+    MemoryAttributeManagerClass *mamc = MEMORY_ATTRIBUTE_MANAGER_CLASS(oc);
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
 
+    mamc->state_change = memory_attribute_state_change;
+
     rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
     rdmc->register_listener = memory_attribute_rdm_register_listener;
     rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change
  2025-02-17  8:18 ` [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
@ 2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-19  1:50     ` Chenyi Qiang
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2025-02-18  9:19 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 17/2/25 19:18, Chenyi Qiang wrote:
> Introduce a new state_change() callback in MemoryAttributeManagerClass to
> efficiently notify all registered RamDiscardListeners, including VFIO
> listeners about the memory conversion events in guest_memfd. The
> existing VFIO listener can dynamically DMA map/unmap the shared pages
> based on conversion types:
> - For conversions from shared to private, the VFIO system ensures the
>    discarding of shared mapping from the IOMMU.
> - For conversions from private to shared, it triggers the population of
>    the shared mapping into the IOMMU.
> 
> Additionally, there could be some special conversion requests:
> - When a conversion request is made for a page already in the desired
>    state, the helper simply returns success.
> - For requests involving a range partially in the desired state, only
>    the necessary segments are converted, ensuring the entire range
>    complies with the request efficiently.
> - In scenarios where a conversion request is declined by other systems,
>    such as a failure from VFIO during notify_populate(), the helper will
>    roll back the request, maintaining consistency.
> 
> Opportunistically introduce a helper to trigger the state_change()
> callback of the class.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v2:
>      - Do the alignment changes due to the rename to MemoryAttributeManager
>      - Move the state_change() helper definition in this patch.
> ---
>   include/system/memory-attribute-manager.h |  20 +++
>   system/memory-attribute-manager.c         | 148 ++++++++++++++++++++++
>   2 files changed, 168 insertions(+)
> 
> diff --git a/include/system/memory-attribute-manager.h b/include/system/memory-attribute-manager.h
> index 72adc0028e..c3dab4e47b 100644
> --- a/include/system/memory-attribute-manager.h
> +++ b/include/system/memory-attribute-manager.h
> @@ -34,8 +34,28 @@ struct MemoryAttributeManager {
>   
>   struct MemoryAttributeManagerClass {
>       ObjectClass parent_class;
> +
> +    int (*state_change)(MemoryAttributeManager *mgr, uint64_t offset, uint64_t size,
> +                        bool shared_to_private);
>   };
>   
> +static inline int memory_attribute_manager_state_change(MemoryAttributeManager *mgr, uint64_t offset,
> +                                                        uint64_t size, bool shared_to_private)
> +{
> +    MemoryAttributeManagerClass *klass;
> +
> +    if (mgr == NULL) {
> +        return 0;
> +    }
> +
> +    klass = MEMORY_ATTRIBUTE_MANAGER_GET_CLASS(mgr);
> +    if (klass->state_change) {
> +        return klass->state_change(mgr, offset, size, shared_to_private);
> +    }
> +
> +    return 0;


nit: MemoryAttributeManagerClass without this only callback defined 
should produce some error imho. Or assert.

> +}
> +
>   int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr);
>   void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);
>   
> diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c
> index ed97e43dd0..17c70cf677 100644
> --- a/system/memory-attribute-manager.c
> +++ b/system/memory-attribute-manager.c
> @@ -241,6 +241,151 @@ static void memory_attribute_rdm_replay_discarded(const RamDiscardManager *rdm,
>                                                   memory_attribute_rdm_replay_discarded_cb);
>   }
>   
> +static bool memory_attribute_is_valid_range(MemoryAttributeManager *mgr,
> +                                            uint64_t offset, uint64_t size)
> +{
> +    MemoryRegion *mr = mgr->mr;
> +
> +    g_assert(mr);
> +
> +    uint64_t region_size = memory_region_size(mr);
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +
> +    if (!QEMU_IS_ALIGNED(offset, block_size)) {
> +        return false;
> +    }
> +    if (offset + size < offset || !size) {
> +        return false;
> +    }
> +    if (offset >= region_size || offset + size > region_size) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static void memory_attribute_notify_discard(MemoryAttributeManager *mgr,
> +                                            uint64_t offset, uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +
> +    QLIST_FOREACH(rdl, &mgr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +
> +        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
> +                                                    memory_attribute_notify_discard_cb);
> +    }
> +}
> +
> +static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
> +                                            uint64_t offset, uint64_t size)
> +{
> +    RamDiscardListener *rdl, *rdl2;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(rdl, &mgr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +
> +        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
> +                                                          memory_attribute_notify_populate_cb);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    if (ret) {
> +        /* Notify all already-notified listeners. */
> +        QLIST_FOREACH(rdl2, &mgr->rdl_list, next) {
> +            MemoryRegionSection tmp = *rdl2->section;
> +
> +            if (rdl2 == rdl) {
> +                break;
> +            }
> +            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +                continue;
> +            }
> +
> +            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
> +                                                        memory_attribute_notify_discard_cb);
> +        }
> +    }
> +    return ret;
> +}
> +
> +static bool memory_attribute_is_range_populated(MemoryAttributeManager *mgr,
> +                                                uint64_t offset, uint64_t size)
> +{
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    /* We fake a shorter bitmap to avoid searching too far. */
> +    found_bit = find_next_zero_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
> +                                                uint64_t offset, uint64_t size)
> +{
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    /* We fake a shorter bitmap to avoid searching too far. */

Weird comment imho, why is it a "fake"? You check if all pages within 
[offset, offset+size) are discarded. You do not want to search beyond 
the end of this range anyway, right?

> +    found_bit = find_next_bit(mgr->shared_bitmap, last_bit + 1, first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset,
> +                                         uint64_t size, bool shared_to_private)

Elsewhere it is just "to_private".

> +{
> +    int block_size = memory_attribute_manager_get_block_size(mgr);
> +    int ret = 0;
> +
> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
> +                     __func__, offset, size);
> +        return -1;
> +    }
> +
> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
> +        (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) {
> +        return 0;
> +    }
> +
> +    if (shared_to_private) {
> +        memory_attribute_notify_discard(mgr, offset, size);
> +    } else {
> +        ret = memory_attribute_notify_populate(mgr, offset, size);
> +    }
> +
> +    if (!ret) {
> +        unsigned long first_bit = offset / block_size;
> +        unsigned long nbits = size / block_size;
> +
> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
> +
> +        if (shared_to_private) {
> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
> +        } else {
> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
> +        }
> +
> +        return 0;

Do not need this return. Thanks,

> +    }
> +
> +    return ret;
> +}
> +
>   int memory_attribute_manager_realize(MemoryAttributeManager *mgr, MemoryRegion *mr)
>   {
>       uint64_t bitmap_size;
> @@ -281,8 +426,11 @@ static void memory_attribute_manager_finalize(Object *obj)
>   
>   static void memory_attribute_manager_class_init(ObjectClass *oc, void *data)
>   {
> +    MemoryAttributeManagerClass *mamc = MEMORY_ATTRIBUTE_MANAGER_CLASS(oc);
>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>   
> +    mamc->state_change = memory_attribute_state_change;
> +
>       rdmc->get_min_granularity = memory_attribute_rdm_get_min_granularity;
>       rdmc->register_listener = memory_attribute_rdm_register_listener;
>       rdmc->unregister_listener = memory_attribute_rdm_unregister_listener;

-- 
Alexey



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change
  2025-02-18  9:19   ` Alexey Kardashevskiy
@ 2025-02-19  1:50     ` Chenyi Qiang
  0 siblings, 0 replies; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-19  1:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 2/18/2025 5:19 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 17/2/25 19:18, Chenyi Qiang wrote:
>> Introduce a new state_change() callback in MemoryAttributeManagerClass to
>> efficiently notify all registered RamDiscardListeners, including VFIO
>> listeners about the memory conversion events in guest_memfd. The
>> existing VFIO listener can dynamically DMA map/unmap the shared pages
>> based on conversion types:
>> - For conversions from shared to private, the VFIO system ensures the
>>    discarding of shared mapping from the IOMMU.
>> - For conversions from private to shared, it triggers the population of
>>    the shared mapping into the IOMMU.
>>
>> Additionally, there could be some special conversion requests:
>> - When a conversion request is made for a page already in the desired
>>    state, the helper simply returns success.
>> - For requests involving a range partially in the desired state, only
>>    the necessary segments are converted, ensuring the entire range
>>    complies with the request efficiently.
>> - In scenarios where a conversion request is declined by other systems,
>>    such as a failure from VFIO during notify_populate(), the helper will
>>    roll back the request, maintaining consistency.
>>
>> Opportunistically introduce a helper to trigger the state_change()
>> callback of the class.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v2:
>>      - Do the alignment changes due to the rename to
>> MemoryAttributeManager
>>      - Move the state_change() helper definition in this patch.
>> ---
>>   include/system/memory-attribute-manager.h |  20 +++
>>   system/memory-attribute-manager.c         | 148 ++++++++++++++++++++++
>>   2 files changed, 168 insertions(+)
>>
>> diff --git a/include/system/memory-attribute-manager.h b/include/
>> system/memory-attribute-manager.h
>> index 72adc0028e..c3dab4e47b 100644
>> --- a/include/system/memory-attribute-manager.h
>> +++ b/include/system/memory-attribute-manager.h
>> @@ -34,8 +34,28 @@ struct MemoryAttributeManager {
>>     struct MemoryAttributeManagerClass {
>>       ObjectClass parent_class;
>> +
>> +    int (*state_change)(MemoryAttributeManager *mgr, uint64_t offset,
>> uint64_t size,
>> +                        bool shared_to_private);
>>   };
>>   +static inline int
>> memory_attribute_manager_state_change(MemoryAttributeManager *mgr,
>> uint64_t offset,
>> +                                                        uint64_t
>> size, bool shared_to_private)
>> +{
>> +    MemoryAttributeManagerClass *klass;
>> +
>> +    if (mgr == NULL) {
>> +        return 0;
>> +    }
>> +
>> +    klass = MEMORY_ATTRIBUTE_MANAGER_GET_CLASS(mgr);
>> +    if (klass->state_change) {
>> +        return klass->state_change(mgr, offset, size,
>> shared_to_private);
>> +    }
>> +
>> +    return 0;
> 
> 
> nit: MemoryAttributeManagerClass without this only callback defined
> should produce some error imho. Or assert.

Nice catch. Will return error if !klass->state_change.

> 
>> +}
>> +
>>   int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>> MemoryRegion *mr);
>>   void memory_attribute_manager_unrealize(MemoryAttributeManager *mgr);

[..]

>> +
>> +static bool
>> memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
>> +                                                uint64_t offset,
>> uint64_t size)
>> +{
>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    /* We fake a shorter bitmap to avoid searching too far. */
> 
> Weird comment imho, why is it a "fake"? You check if all pages within
> [offset, offset+size) are discarded. You do not want to search beyond
> the end of this range anyway, right?

Yes. And I think the "fake" is aimed to describe the inconsistency with
the definition of find_next_bit(). find_next_bit() defines the second
argument as "The bitmap size in bits" but "last_bit + 1" is not the size
of shared_bitmap.

> 
>> +    found_bit = find_next_bit(mgr->shared_bitmap, last_bit + 1,
>> first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
>> uint64_t offset,
>> +                                         uint64_t size, bool
>> shared_to_private)
> 
> Elsewhere it is just "to_private".

I'm OK to change it to "to_private" to keep alignment.

> 
>> +{
>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>> +    int ret = 0;
>> +
>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -1;
>> +    }
>> +
>> +    if ((shared_to_private &&
>> memory_attribute_is_range_discarded(mgr, offset, size)) ||
>> +        (!shared_to_private &&
>> memory_attribute_is_range_populated(mgr, offset, size))) {
>> +        return 0;
>> +    }
>> +
>> +    if (shared_to_private) {
>> +        memory_attribute_notify_discard(mgr, offset, size);
>> +    } else {
>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>> +    }
>> +
>> +    if (!ret) {
>> +        unsigned long first_bit = offset / block_size;
>> +        unsigned long nbits = size / block_size;
>> +
>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>> +
>> +        if (shared_to_private) {
>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>> +        } else {
>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>> +        }
>> +
>> +        return 0;
> 
> Do not need this return. Thanks,

Removed. Thanks!

> 
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   int memory_attribute_manager_realize(MemoryAttributeManager *mgr,
>> MemoryRegion *mr)
>>   {
>>       uint64_t bitmap_size;
>> @@ -281,8 +426,11 @@ static void
>> memory_attribute_manager_finalize(Object *obj)
>>     static void memory_attribute_manager_class_init(ObjectClass *oc,
>> void *data)
>>   {
>> +    MemoryAttributeManagerClass *mamc =
>> MEMORY_ATTRIBUTE_MANAGER_CLASS(oc);
>>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>   +    mamc->state_change = memory_attribute_state_change;
>> +
>>       rdmc->get_min_granularity =
>> memory_attribute_rdm_get_min_granularity;
>>       rdmc->register_listener = memory_attribute_rdm_register_listener;
>>       rdmc->unregister_listener =
>> memory_attribute_rdm_unregister_listener;
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 5/6] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2025-02-17  8:18 ` [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
@ 2025-02-17  8:18 ` Chenyi Qiang
  2025-02-18  9:19   ` Alexey Kardashevskiy
  2025-02-17  8:18 ` [PATCH v2 6/6] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
  5 siblings, 1 reply; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

Introduce a new field, memory_attribute_manager, in RAMBlock to link to
an MemoryAttributeManager object. This change centralizes all
guest_memfd state information (like fd and shared_bitmap) within a
RAMBlock, making it easier to manage.

Use the realize()/unrealize() helpers to initialize/uninitialize the
MemoryAttributeManager object. Register/unregister the object in the
target RAMBlock's MemoryRegion when creating guest_memfd. Upon memory
state changes in kvm_convert_memory(), invoke the
memory_attribute_manager_state_change() helper to notify the registered
RamDiscardListener.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v2:
    - Introduce a new field memory_attribute_manager in RAMBlock.
    - Move the state_change() handling during page conversion in this patch.
    - Undo what we did if it fails to set.
    - Change the order of close(guest_memfd) and memory_attribute_manager cleanup.
---
 accel/kvm/kvm-all.c     |  9 +++++++++
 include/exec/ramblock.h |  2 ++
 system/physmem.c        | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c1fea69d58..c0d15c48ad 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -48,6 +48,7 @@
 #include "kvm-cpus.h"
 #include "system/dirtylimit.h"
 #include "qemu/range.h"
+#include "system/memory-attribute-manager.h"
 
 #include "hw/boards.h"
 #include "system/stats.h"
@@ -3088,6 +3089,14 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    ret = memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
+                                                offset, size, to_private);
+    if (ret) {
+        warn_report("Failed to notify the listener the state change of "
+                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
+                    start, size, to_private ? "private" : "shared");
+    }
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 0babd105c0..06fd365326 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -23,6 +23,7 @@
 #include "cpu-common.h"
 #include "qemu/rcu.h"
 #include "exec/ramlist.h"
+#include "system/memory-attribute-manager.h"
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -42,6 +43,7 @@ struct RAMBlock {
     int fd;
     uint64_t fd_offset;
     int guest_memfd;
+    MemoryAttributeManager *memory_attribute_manager;
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
diff --git a/system/physmem.c b/system/physmem.c
index c76503aea8..0ed394c5d2 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -54,6 +54,7 @@
 #include "system/hostmem.h"
 #include "system/hw_accel.h"
 #include "system/xen-mapcache.h"
+#include "system/memory-attribute-manager.h"
 #include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
+
+        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
+        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
+            error_setg(errp, "Failed to realize memory attribute manager");
+            object_unref(OBJECT(new_block->memory_attribute_manager));
+            close(new_block->guest_memfd);
+            ram_block_discard_require(false);
+            qemu_mutex_unlock_ramlist();
+            goto out_free;
+        }
     }
 
     ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
@@ -2138,6 +2149,8 @@ static void reclaim_ramblock(RAMBlock *block)
     }
 
     if (block->guest_memfd >= 0) {
+        memory_attribute_manager_unrealize(block->memory_attribute_manager);
+        object_unref(OBJECT(block->memory_attribute_manager));
         close(block->guest_memfd);
         ram_block_discard_require(false);
     }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/6] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks
  2025-02-17  8:18 ` [PATCH v2 5/6] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
@ 2025-02-18  9:19   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Kardashevskiy @ 2025-02-18  9:19 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Peng Chao P, Gao Chao, Xu Yilun,
	Li Xiaoyao



On 17/2/25 19:18, Chenyi Qiang wrote:
> Introduce a new field, memory_attribute_manager, in RAMBlock to link to
> an MemoryAttributeManager object. This change centralizes all
> guest_memfd state information (like fd and shared_bitmap) within a
> RAMBlock, making it easier to manage.
> 
> Use the realize()/unrealize() helpers to initialize/uninitialize the
> MemoryAttributeManager object. Register/unregister the object in the
> target RAMBlock's MemoryRegion when creating guest_memfd. Upon memory
> state changes in kvm_convert_memory(), invoke the
> memory_attribute_manager_state_change() helper to notify the registered
> RamDiscardListener.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>


Reviewed-by: Alexey Kardashevskiy <aik@amd.com>

> ---
> Changes in v2:
>      - Introduce a new field memory_attribute_manager in RAMBlock.
>      - Move the state_change() handling during page conversion in this patch.
>      - Undo what we did if it fails to set.
>      - Change the order of close(guest_memfd) and memory_attribute_manager cleanup.
> ---
>   accel/kvm/kvm-all.c     |  9 +++++++++
>   include/exec/ramblock.h |  2 ++
>   system/physmem.c        | 13 +++++++++++++
>   3 files changed, 24 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c1fea69d58..c0d15c48ad 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -48,6 +48,7 @@
>   #include "kvm-cpus.h"
>   #include "system/dirtylimit.h"
>   #include "qemu/range.h"
> +#include "system/memory-attribute-manager.h"
>   
>   #include "hw/boards.h"
>   #include "system/stats.h"
> @@ -3088,6 +3089,14 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>       addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
>       rb = qemu_ram_block_from_host(addr, false, &offset);
>   
> +    ret = memory_attribute_manager_state_change(MEMORY_ATTRIBUTE_MANAGER(mr->rdm),
> +                                                offset, size, to_private);
> +    if (ret) {
> +        warn_report("Failed to notify the listener the state change of "
> +                    "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
> +                    start, size, to_private ? "private" : "shared");
> +    }
> +
>       if (to_private) {
>           if (rb->page_size != qemu_real_host_page_size()) {
>               /*
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 0babd105c0..06fd365326 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -23,6 +23,7 @@
>   #include "cpu-common.h"
>   #include "qemu/rcu.h"
>   #include "exec/ramlist.h"
> +#include "system/memory-attribute-manager.h"
>   
>   struct RAMBlock {
>       struct rcu_head rcu;
> @@ -42,6 +43,7 @@ struct RAMBlock {
>       int fd;
>       uint64_t fd_offset;
>       int guest_memfd;
> +    MemoryAttributeManager *memory_attribute_manager;
>       size_t page_size;
>       /* dirty bitmap used during migration */
>       unsigned long *bmap;
> diff --git a/system/physmem.c b/system/physmem.c
> index c76503aea8..0ed394c5d2 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -54,6 +54,7 @@
>   #include "system/hostmem.h"
>   #include "system/hw_accel.h"
>   #include "system/xen-mapcache.h"
> +#include "system/memory-attribute-manager.h"
>   #include "trace.h"
>   
>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -1885,6 +1886,16 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>               qemu_mutex_unlock_ramlist();
>               goto out_free;
>           }
> +
> +        new_block->memory_attribute_manager = MEMORY_ATTRIBUTE_MANAGER(object_new(TYPE_MEMORY_ATTRIBUTE_MANAGER));
> +        if (memory_attribute_manager_realize(new_block->memory_attribute_manager, new_block->mr)) {
> +            error_setg(errp, "Failed to realize memory attribute manager");
> +            object_unref(OBJECT(new_block->memory_attribute_manager));
> +            close(new_block->guest_memfd);
> +            ram_block_discard_require(false);
> +            qemu_mutex_unlock_ramlist();
> +            goto out_free;
> +        } >       }
>   
>       ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
> @@ -2138,6 +2149,8 @@ static void reclaim_ramblock(RAMBlock *block)
>       }
>   
>       if (block->guest_memfd >= 0) {
> +        memory_attribute_manager_unrealize(block->memory_attribute_manager);
> +        object_unref(OBJECT(block->memory_attribute_manager));
>           close(block->guest_memfd);
>           ram_block_discard_require(false);
>       }

-- 
Alexey



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 6/6] RAMBlock: Make guest_memfd require coordinate discard
  2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2025-02-17  8:18 ` [PATCH v2 5/6] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
@ 2025-02-17  8:18 ` Chenyi Qiang
  5 siblings, 0 replies; 18+ messages in thread
From: Chenyi Qiang @ 2025-02-17  8:18 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Peng Chao P,
	Gao Chao, Xu Yilun, Li Xiaoyao

As guest_memfd is now managed by memory_attribute_manager with
RamDiscardManager, only block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v2:
    - Change the ram_block_discard_require(false) to
      ram_block_coordinated_discard_require(false).
---
 system/physmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 0ed394c5d2..a30cdd43ee 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         assert(kvm_enabled());
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");
@@ -1892,7 +1892,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             error_setg(errp, "Failed to realize memory attribute manager");
             object_unref(OBJECT(new_block->memory_attribute_manager));
             close(new_block->guest_memfd);
-            ram_block_discard_require(false);
+            ram_block_coordinated_discard_require(false);
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
@@ -2152,7 +2152,7 @@ static void reclaim_ramblock(RAMBlock *block)
         memory_attribute_manager_unrealize(block->memory_attribute_manager);
         object_unref(OBJECT(block->memory_attribute_manager));
         close(block->guest_memfd);
-        ram_block_discard_require(false);
+        ram_block_coordinated_discard_require(false);
     }
 
     g_free(block);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-02-20  3:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-17  8:18 [PATCH v2 0/6] Enable shared device assignment Chenyi Qiang
2025-02-17  8:18 ` [PATCH v2 1/6] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-02-17  8:18 ` [PATCH v2 2/6] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-02-18  9:19   ` Alexey Kardashevskiy
2025-02-18  9:41     ` Chenyi Qiang
2025-02-18 10:46     ` David Hildenbrand
2025-02-17  8:18 ` [PATCH v2 3/6] memory-attribute-manager: Introduce MemoryAttributeManager to manage RAMBLock with guest_memfd Chenyi Qiang
2025-02-18  9:19   ` Alexey Kardashevskiy
2025-02-19  1:20     ` Chenyi Qiang
2025-02-19  3:49       ` Alexey Kardashevskiy
2025-02-19  6:33         ` Chenyi Qiang
2025-02-20  3:02           ` Alexey Kardashevskiy
2025-02-17  8:18 ` [PATCH v2 4/6] memory-attribute-manager: Introduce a callback to notify the shared/private state change Chenyi Qiang
2025-02-18  9:19   ` Alexey Kardashevskiy
2025-02-19  1:50     ` Chenyi Qiang
2025-02-17  8:18 ` [PATCH v2 5/6] memory: Attach MemoryAttributeManager to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-02-18  9:19   ` Alexey Kardashevskiy
2025-02-17  8:18 ` [PATCH v2 6/6] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).