kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/5] Enable shared device assignment
@ 2025-06-12  8:27 Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-12  8:27 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

This is the v7 series of the shared device assignment support.

There are not many differences compared to the previous version [1]. The
primary changes include some code cleanup in patch #4 and the addition
of a new comment in patch #5 to describe the brief period of status
consistency.

Overview of this series:

- Patch 1-3: Preparation patches. These include function exposure and
  some function prototype changes.
- Patch 4: Introduce a new object to implement RamDiscardManager
  interface and a helper to notify the shared/private state change.
- Patch 5: Enable coordinated discarding of RAM with guest_memfd through
  the RamDiscardManager interface.

---

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared and private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [2] when we try to enable shared
device assignment in confidential VMs.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversions is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardmanager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Limitation and future extension
===============================
This series only supports the basic shared device assignment functionality.
There are still some limitations and areas that can be extended or
optimized in the future.

Relationship with in-place conversion
-------------------------------------
In-place page conversion is the ongoing work to allow mmap() of
guest_memfd to userspace so that both private and shared memory can use
the same physical memory as the backend. This new design eliminates the
need to discard pages during shared/private conversions. When it is
ready, shared device assignment needs be adjusted to achieve an
unmap-before-conversion-to-private and map-after-conversion-to-shared
sequence to be compatible with the change.

Partial unmap limitation
------------------------
VFIO expects the DMA mapping for a specific IOVA to be mapped and
unmapped with the same granularity. The guest may perform partial
conversion, such as converting a small region within a larger one. To
prevent such invalid cases, current operations are performed with 4K
granularity. This could be optimized after DMA mapping cut operation
[3] is introduced in the future. We can always perform a
split-before-unmap if partial conversions happens. If the split
succeeds, the unmap will succeed and be atomic. If the split fails, the
unmap process fails.

More attributes management
--------------------------
Current RamDiscardManager can only manage a pair of opposite states like
populated/discared or shared/private. If more states need to be
considered, for example, support virtio-mem in confidential VMs, three
states would be possible (shared populated/private populated/discard).
Current framework cannot handle such scenario and we need to think of
some new framework at that time [4].

Memory overhead optimization
----------------------------
A comment from Baolu [5] suggests considering using Maple Tree or a generic
interval tree to manage private/shared state instead of a bitmap, which
can reduce memory consumption. This optmization can also be considered
in other bitmap use cases like dirty bitmaps for guest RAM.

Testing
=======
This patch series is tested based on kenrel v6.16-rc1 since TDX base
support has been merged. The QEMU repo is available at
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-06-12-v2

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

Related link
============
[1] https://lore.kernel.org/qemu-devel/20250530083256.105186-1-chenyi.qiang@intel.com/
[2] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[3] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/
[4] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
[5] https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23decf@linux.intel.com/

Chenyi Qiang (5):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  memory: Change memory_region_set_ram_discard_manager() to return the
    result
  memory: Unify the definiton of ReplayRamPopulate() and
    ReplayRamDiscard()
  ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock
    with guest_memfd
  physmem: Support coordinated discarding of RAM with guest_memfd

 MAINTAINERS                   |   1 +
 accel/kvm/kvm-all.c           |   9 +
 hw/virtio/virtio-mem.c        |  83 +++----
 include/system/memory.h       | 100 ++++++--
 include/system/ramblock.h     |  22 ++
 migration/ram.c               |   5 +-
 system/memory.c               |  22 +-
 system/meson.build            |   1 +
 system/physmem.c              |  23 +-
 system/ram-block-attributes.c | 442 ++++++++++++++++++++++++++++++++++
 system/trace-events           |   3 +
 11 files changed, 627 insertions(+), 84 deletions(-)
 create mode 100644 system/ram-block-attributes.c

-- 
2.43.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v7 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
@ 2025-06-12  8:27 ` Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-12  8:27 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Rename the helper to memory_region_section_intersect_range() to make it
more generic. Meanwhile, define the @end as Int128 and replace the
related operations with Int128_* format since the helper is exported as
a wider API.

Suggested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v7:
    - Add Reivewed-by from Xiaoyao and Pankaj

Changes in v6:
    - No change.

Changes in v5:
    - Indent change for int128 ops to avoid the line over 80
    - Add two Review-by from Alexey and Zhao
---
 hw/virtio/virtio-mem.c  | 32 +++++---------------------------
 include/system/memory.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index a3d1a676e7..b3c126ea1e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -244,28 +244,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -287,7 +265,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -319,7 +297,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -355,7 +333,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -371,7 +349,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -388,7 +366,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/system/memory.h b/include/system/memory.h
index 0848690ea4..da97753e28 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -1211,6 +1211,36 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+static inline bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                                         uint64_t offset,
+                                                         uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    Int128 end = int128_min(int128_add(int128_make64(s->offset_within_region),
+                                       s->size),
+                            int128_add(int128_make64(offset),
+                                       int128_make64(size)));
+
+    if (int128_le(end, int128_make64(start))) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_sub(end, int128_make64(start));
+    return true;
+}
+
 /**
  * memory_region_init: Initialize a memory region
  *
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2025-06-12  8:27 ` Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-12  8:27 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
RamDiscardManager is already set in the MemoryRegion. The caller must
handle this failure, such as having virtio-mem undo its actions and fail
the realize() process. Opportunistically move the call earlier to avoid
complex error handling.

This change is beneficial when introducing a new RamDiscardManager
instance besides virtio-mem. After
ram_block_coordinated_discard_require(true) unlocks all
RamDiscardManager instances, only one instance is allowed to be set for
one MemoryRegion at present.

Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v7:
    - Add Reviewed-by from Pankaj, Alexey and Xiaoyao

Changes in v6:
    - Add Reviewed-by from David.

Changes in v5:
    - Nit in commit message (return false -> -EBUSY)
    - Add set_ram_discard_manager(NULL) when ram_block_discard_range()
      fails.

Changes in v3:
    - Move set_ram_discard_manager() up to avoid a g_free()
    - Clean up set_ram_discard_manager() definition
---
 hw/virtio/virtio-mem.c  | 30 +++++++++++++++++-------------
 include/system/memory.h |  6 +++---
 system/memory.c         | 10 +++++++---
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b3c126ea1e..2e491e8c44 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1047,6 +1047,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    /*
+     * Set ourselves as RamDiscardManager before the plug handler maps the
+     * memory region and exposes it via an address space.
+     */
+    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
+                                              RAM_DISCARD_MANAGER(vmem))) {
+        error_setg(errp, "Failed to set RamDiscardManager");
+        ram_block_coordinated_discard_require(false);
+        return;
+    }
+
     /*
      * We don't know at this point whether shared RAM is migrated using
      * QEMU or migrated using the file content. "x-ignore-shared" will be
@@ -1061,6 +1072,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
         if (ret) {
             error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
+            memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
             ram_block_coordinated_discard_require(false);
             return;
         }
@@ -1122,13 +1134,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
     vmem->system_reset->vmem = vmem;
     qemu_register_resettable(obj);
-
-    /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
-                                          RAM_DISCARD_MANAGER(vmem));
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -1136,12 +1141,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-
     qemu_unregister_resettable(OBJECT(vmem->system_reset));
     object_unref(OBJECT(vmem->system_reset));
 
@@ -1154,6 +1153,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
+    /*
+     * The unplug handler unmapped the memory region, it cannot be
+     * found via an address space anymore. Unset ourselves.
+     */
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     ram_block_coordinated_discard_require(false);
 }
 
diff --git a/include/system/memory.h b/include/system/memory.h
index da97753e28..60983d4977 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2499,13 +2499,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
  *
  * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
  * that does not cover RAM, or a #MemoryRegion that already has a
- * #RamDiscardManager assigned.
+ * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
  *
  * @mr: the #MemoryRegion
  * @rdm: #RamDiscardManager to set
  */
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm);
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm);
 
 /**
  * memory_region_find: translate an address/size relative to a
diff --git a/system/memory.c b/system/memory.c
index 306e9ff9eb..d0c186e9f6 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2106,12 +2106,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
     return mr->rdm;
 }
 
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm)
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm)
 {
     g_assert(memory_region_is_ram(mr));
-    g_assert(!rdm || !mr->rdm);
+    if (mr->rdm && rdm) {
+        return -EBUSY;
+    }
+
     mr->rdm = rdm;
+    return 0;
 }
 
 uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-06-12  8:27 ` Chenyi Qiang
  2025-06-19  3:06   ` Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-12  8:27 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Update ReplayRamDiscard() function to return the result and unify the
ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
the same time due to their identical definitions. This unification
simplifies related structures, such as VirtIOMEMReplayData, which makes
it cleaner.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v7:
    - Add Reviewed-by from Xiaoyao and Pankaj.

Changes in v6:
    - Add Reviewed-by from David
    - Add a documentation comment for the prototype change

Changes in v5:
    - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
    - return data->fn(s, data->opaque) instead of 0 in
      virtio_mem_rdm_replay_discarded_cb(). (Alexey)

Changes in v4:
    - Modify the commit message. We won't use Replay() operation when
      doing the attribute change like v3.
---
 hw/virtio/virtio-mem.c  | 21 +++++++-------
 include/system/memory.h | 64 ++++++++++++++++++++++++++++++-----------
 migration/ram.c         |  5 ++--
 system/memory.c         | 12 ++++----
 4 files changed, 66 insertions(+), 36 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 2e491e8c44..c46f6f9c3e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1732,7 +1732,7 @@ static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
 }
 
 struct VirtIOMEMReplayData {
-    void *fn;
+    ReplayRamDiscardState fn;
     void *opaque;
 };
 
@@ -1740,12 +1740,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
+    return data->fn(s, data->opaque);
 }
 
 static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                            MemoryRegionSection *s,
-                                           ReplayRamPopulate replay_fn,
+                                           ReplayRamDiscardState replay_fn,
                                            void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
@@ -1764,14 +1764,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    ((ReplayRamDiscard)data->fn)(s, data->opaque);
-    return 0;
+    return data->fn(s, data->opaque);
 }
 
-static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
-                                            MemoryRegionSection *s,
-                                            ReplayRamDiscard replay_fn,
-                                            void *opaque)
+static int virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                           MemoryRegionSection *s,
+                                           ReplayRamDiscardState replay_fn,
+                                           void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
     struct VirtIOMEMReplayData data = {
@@ -1780,8 +1779,8 @@ static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
     };
 
     g_assert(s->mr == &vmem->memdev->mr);
-    virtio_mem_for_each_unplugged_section(vmem, s, &data,
-                                          virtio_mem_rdm_replay_discarded_cb);
+    return virtio_mem_for_each_unplugged_section(vmem, s, &data,
+                                                 virtio_mem_rdm_replay_discarded_cb);
 }
 
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
diff --git a/include/system/memory.h b/include/system/memory.h
index 60983d4977..eb2618e1b4 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -576,8 +576,20 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
     rdl->double_discard_supported = double_discard_supported;
 }
 
-typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
-typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
+/**
+ * ReplayRamDiscardState:
+ *
+ * The callback handler for #RamDiscardManagerClass.replay_populated/
+ * #RamDiscardManagerClass.replay_discarded to invoke on populated/discarded
+ * parts.
+ *
+ * @section: the #MemoryRegionSection of populated/discarded part
+ * @opaque: pointer to forward to the callback
+ *
+ * Returns 0 on success, or a negative error if failed.
+ */
+typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
+                                     void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -651,36 +663,38 @@ struct RamDiscardManagerClass {
     /**
      * @replay_populated:
      *
-     * Call the #ReplayRamPopulate callback for all populated parts within the
-     * #MemoryRegionSection via the #RamDiscardManager.
+     * Call the #ReplayRamDiscardState callback for all populated parts within
+     * the #MemoryRegionSection via the #RamDiscardManager.
      *
      * In case any call fails, no further calls are made.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamPopulate callback
+     * @replay_fn: the #ReplayRamDiscardState callback
      * @opaque: pointer to forward to the callback
      *
      * Returns 0 on success, or a negative error if any notification failed.
      */
     int (*replay_populated)(const RamDiscardManager *rdm,
                             MemoryRegionSection *section,
-                            ReplayRamPopulate replay_fn, void *opaque);
+                            ReplayRamDiscardState replay_fn, void *opaque);
 
     /**
      * @replay_discarded:
      *
-     * Call the #ReplayRamDiscard callback for all discarded parts within the
-     * #MemoryRegionSection via the #RamDiscardManager.
+     * Call the #ReplayRamDiscardState callback for all discarded parts within
+     * the #MemoryRegionSection via the #RamDiscardManager.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamDiscard callback
+     * @replay_fn: the #ReplayRamDiscardState callback
      * @opaque: pointer to forward to the callback
+     *
+     * Returns 0 on success, or a negative error if any notification failed.
      */
-    void (*replay_discarded)(const RamDiscardManager *rdm,
-                             MemoryRegionSection *section,
-                             ReplayRamDiscard replay_fn, void *opaque);
+    int (*replay_discarded)(const RamDiscardManager *rdm,
+                            MemoryRegionSection *section,
+                            ReplayRamDiscardState replay_fn, void *opaque);
 
     /**
      * @register_listener:
@@ -721,15 +735,31 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
 bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
                                       const MemoryRegionSection *section);
 
+/**
+ * ram_discard_manager_replay_populated:
+ *
+ * A wrapper to call the #RamDiscardManagerClass.replay_populated callback
+ * of the #RamDiscardManager.
+ *
+ * Returns 0 on success, or a negative error if any notification failed.
+ */
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamDiscardState replay_fn,
                                          void *opaque);
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque);
+/**
+ * ram_discard_manager_replay_discarded:
+ *
+ * A wrapper to call the #RamDiscardManagerClass.replay_discarded callback
+ * of the #RamDiscardManager.
+ *
+ * Returns 0 on success, or a negative error if any notification failed.
+ */
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque);
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
diff --git a/migration/ram.c b/migration/ram.c
index d26dbd37c4..2d4af497b5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -848,8 +848,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
-static void dirty_bitmap_clear_section(MemoryRegionSection *section,
-                                       void *opaque)
+static int dirty_bitmap_clear_section(MemoryRegionSection *section,
+                                      void *opaque)
 {
     const hwaddr offset = section->offset_within_region;
     const hwaddr size = int128_get64(section->size);
@@ -868,6 +868,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
     }
     *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
     bitmap_clear(rb->bmap, start, npages);
+    return 0;
 }
 
 /*
diff --git a/system/memory.c b/system/memory.c
index d0c186e9f6..76b44b8220 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2138,7 +2138,7 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamDiscardState replay_fn,
                                          void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
@@ -2147,15 +2147,15 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
     return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque)
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_discarded);
-    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+    return rdmc->replay_discarded(rdm, section, replay_fn, opaque);
 }
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2025-06-12  8:27 ` [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-06-12  8:27 ` Chenyi Qiang
  2025-06-20  2:53   ` Chenyi Qiang
  2025-06-12  8:27 ` [PATCH v7 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-12  8:27 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") highlighted that subsystems like VFIO may disable RAM block
discard. However, guest_memfd relies on discard operations for page
conversion between private and shared memory, potentially leading to
the stale IOMMU mapping issue when assigning hardware devices to
confidential VMs via shared memory. To address this and allow shared
device assignement, it is crucial to ensure the VFIO system refreshes
its IOMMU mappings.

RamDiscardManager is an existing interface (used by virtio-mem) to
adjust VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.

Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. Implementing it in HostMemoryBackend is
not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
have a memory backend while others do not. Notably, virtual BIOS
RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
backend.

To manage RAMBlocks with guest_memfd, define a new object named
RamBlockAttributes to implement the RamDiscardManager interface. This
object can store the guest_memfd information such as the bitmap for
shared memory and the registered listeners for event notifications. A
new state_change() helper function is provided to notify listeners, such
as VFIO, allowing VFIO to do dynamically DMA map and unmap for the shared
memory according to conversion events. Note that in the current context
of RamDiscardManager for guest_memfd, the shared state is analogous to
being populated, while the private state can be considered discarded for
simplicity. In the future, it would be more complicated if considering
more states like private/shared/discarded at the same time.

In current implementation, memory state tracking is performed at the
host page size granularity, as the minimum conversion size can be one
page per request. Additionally, VFIO expected the DMA mapping for a
specific IOVA to be mapped and unmapped with the same granularity.
Confidential VMs may perform partial conversions, such as conversions on
small regions within a larger one. To prevent such invalid cases and
until support for DMA mapping cut operations is available, all
operations are performed with 4K granularity.

In addition, memory conversion failures cause QEMU to quit rather than
resuming the guest or retrying the operation at present. It would be
future work to add more error handling or rollback mechanisms once
conversion failures are allowed. For example, in-place conversion of
guest_memfd could retry the unmap operation during the conversion from
shared to private. For now, keep the complex error handling out of the
picture as it is not required.

Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v7:
    - Unwrap the two helpers(is_range_populated() and is
      is_range_discarded()). (Alexey)
    - Use bit to do the iteration instead of offset without additional
      variables of "cur" and "end" (Alexey)
    - Add Reviewed-by from Pankaj and Alexey.

Changes in v6:
    - Change the object type name from RamBlockAttribute to
      RamBlockAttributes. (David)
    - Save the associated RAMBlock instead MemoryRegion in
      RamBlockAttributes. (David)
    - Squash the state_change() helper introduction in this commit as
      well as the mixture conversion case handling. (David)
    - Change the block_size type from int to size_t and some cleanup in
      validation check. (Alexey)
    - Add a tracepoint to track the state changes. (Alexey)

Changes in v5:
    - Revert to use RamDiscardManager interface instead of introducing
      new hierarchy of class to manage private/shared state, and keep
      using the new name of RamBlockAttribute compared with the
      MemoryAttributeManager in v3.
    - Use *simple* version of object_define and object_declare since the
      state_change() function is changed as an exported function instead
      of a virtual function in later patch.
    - Move the introduction of RamBlockAttribute field to this patch and
      rename it to ram_shared. (Alexey)
    - call the exit() when register/unregister failed. (Zhao)
    - Add the ram-block-attribute.c to Memory API related part in
      MAINTAINERS.

Changes in v4:
    - Change the name from memory-attribute-manager to
      ram-block-attribute.
    - Implement the newly-introduced PrivateSharedManager instead of
      RamDiscardManager and change related commit message.
    - Define the new object in ramblock.h instead of adding a new file.
---
 MAINTAINERS                   |   1 +
 include/system/ramblock.h     |  21 ++
 system/meson.build            |   1 +
 system/ram-block-attributes.c | 442 ++++++++++++++++++++++++++++++++++
 system/trace-events           |   3 +
 5 files changed, 468 insertions(+)
 create mode 100644 system/ram-block-attributes.c

diff --git a/MAINTAINERS b/MAINTAINERS
index aa6763077e..6a86cee73a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3172,6 +3172,7 @@ F: system/memory.c
 F: system/memory_mapping.c
 F: system/physmem.c
 F: system/memory-internal.h
+F: system/ram-block-attributes.c
 F: scripts/coccinelle/memory-region-housekeeping.cocci
 
 Memory devices
diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index d8a116ba99..1bab9e2dac 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -22,6 +22,10 @@
 #include "exec/cpu-common.h"
 #include "qemu/rcu.h"
 #include "exec/ramlist.h"
+#include "system/hostmem.h"
+
+#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
+OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -91,4 +95,21 @@ struct RAMBlock {
     ram_addr_t postcopy_length;
 };
 
+struct RamBlockAttributes {
+    Object parent;
+
+    RAMBlock *ram_block;
+
+    /* 1-setting of the bitmap represents ram is populated (shared) */
+    unsigned bitmap_size;
+    unsigned long *bitmap;
+
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
+void ram_block_attributes_destroy(RamBlockAttributes *attr);
+int ram_block_attributes_state_change(RamBlockAttributes *attr, uint64_t offset,
+                                      uint64_t size, bool to_discard);
+
 #endif
diff --git a/system/meson.build b/system/meson.build
index 7514bf3455..6d21ff9faa 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -17,6 +17,7 @@ system_ss.add(files(
   'dma-helpers.c',
   'globals.c',
   'ioport.c',
+  'ram-block-attributes.c',
   'memory_mapping.c',
   'memory.c',
   'physmem.c',
diff --git a/system/ram-block-attributes.c b/system/ram-block-attributes.c
new file mode 100644
index 0000000000..dbb8c9675b
--- /dev/null
+++ b/system/ram-block-attributes.c
@@ -0,0 +1,442 @@
+/*
+ * QEMU ram block attributes
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "system/ramblock.h"
+#include "trace.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
+                                          ram_block_attributes,
+                                          RAM_BLOCK_ATTRIBUTES,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static size_t
+ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
+{
+    /*
+     * Because page conversion could be manipulated in the size of at least 4K
+     * or 4K aligned, Use the host page size as the granularity to track the
+     * memory attribute.
+     */
+    g_assert(attr && attr->ram_block);
+    g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
+    return attr->ram_block->page_size;
+}
+
+
+static bool
+ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
+                                      const MemoryRegionSection *section)
+{
+    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    const uint64_t first_bit = section->offset_within_region / block_size;
+    const uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
+    unsigned long first_discarded_bit;
+
+    first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
+                                           first_bit);
+    return first_discarded_bit > last_bit;
+}
+
+typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
+                                               void *arg);
+
+static int
+ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
+                                        void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int
+ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
+                                       void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+    return 0;
+}
+
+static int
+ram_block_attributes_for_each_populated_section(const RamBlockAttributes *attr,
+                                                MemoryRegionSection *section,
+                                                void *arg,
+                                                ram_block_attributes_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                              first_bit);
+
+    while (first_bit < attr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
+                                      first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                  last_bit + 2);
+    }
+
+    return ret;
+}
+
+static int
+ram_block_attributes_for_each_discarded_section(const RamBlockAttributes *attr,
+                                                MemoryRegionSection *section,
+                                                void *arg,
+                                                ram_block_attributes_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
+                                   first_bit);
+
+    while (first_bit < attr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                 first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_zero_bit(attr->bitmap,
+                                       attr->bitmap_size,
+                                       last_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t
+ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                             const MemoryRegion *mr)
+{
+    const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+
+    g_assert(mr == attr->ram_block->mr);
+    return ram_block_attributes_get_block_size(attr);
+}
+
+static void
+ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
+                                           RamDiscardListener *rdl,
+                                           MemoryRegionSection *section)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    int ret;
+
+    g_assert(section->mr == attr->ram_block->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
+
+    ret = ram_block_attributes_for_each_populated_section(attr, section, rdl,
+                                    ram_block_attributes_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s",
+                     __func__, strerror(-ret));
+        exit(1);
+    }
+}
+
+static void
+ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
+                                             RamDiscardListener *rdl)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == attr->ram_block->mr);
+
+    if (rdl->double_discard_supported) {
+        rdl->notify_discard(rdl, rdl->section);
+    } else {
+        ret = ram_block_attributes_for_each_populated_section(attr,
+                rdl->section, rdl, ram_block_attributes_notify_discard_cb);
+        if (ret) {
+            error_report("%s: Failed to unregister RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            exit(1);
+        }
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+}
+
+typedef struct RamBlockAttributesReplayData {
+    ReplayRamDiscardState fn;
+    void *opaque;
+} RamBlockAttributesReplayData;
+
+static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection *section,
+                                              void *arg)
+{
+    RamBlockAttributesReplayData *data = arg;
+
+    return data->fn(section, data->opaque);
+}
+
+static int
+ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscardState replay_fn,
+                                          void *opaque)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == attr->ram_block->mr);
+    return ram_block_attributes_for_each_populated_section(attr, section, &data,
+                                            ram_block_attributes_rdm_replay_cb);
+}
+
+static int
+ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscardState replay_fn,
+                                          void *opaque)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
+    RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == attr->ram_block->mr);
+    return ram_block_attributes_for_each_discarded_section(attr, section, &data,
+                                            ram_block_attributes_rdm_replay_cb);
+}
+
+static bool
+ram_block_attributes_is_valid_range(RamBlockAttributes *attr, uint64_t offset,
+                                    uint64_t size)
+{
+    MemoryRegion *mr = attr->ram_block->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+
+    if (!QEMU_IS_ALIGNED(offset, block_size) ||
+        !QEMU_IS_ALIGNED(size, block_size)) {
+        return false;
+    }
+    if (offset + size <= offset) {
+        return false;
+    }
+    if (offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void ram_block_attributes_notify_discard(RamBlockAttributes *attr,
+                                                uint64_t offset,
+                                                uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        rdl->notify_discard(rdl, &tmp);
+    }
+}
+
+static int
+ram_block_attributes_notify_populate(RamBlockAttributes *attr,
+                                     uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        ret = rdl->notify_populate(rdl, &tmp);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
+int ram_block_attributes_state_change(RamBlockAttributes *attr,
+                                      uint64_t offset, uint64_t size,
+                                      bool to_discard)
+{
+    const size_t block_size = ram_block_attributes_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    const unsigned long last_bit = first_bit + nbits - 1;
+    const bool is_discarded = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                            first_bit) > last_bit;
+    const bool is_populated = find_next_zero_bit(attr->bitmap,
+                                attr->bitmap_size, first_bit) > last_bit;
+    unsigned long bit;
+    int ret = 0;
+
+    if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -EINVAL;
+    }
+
+    trace_ram_block_attributes_state_change(offset, size,
+                                            is_discarded ? "discarded" :
+                                            is_populated ? "populated" :
+                                            "mixture",
+                                            to_discard ? "discarded" :
+                                            "populated");
+    if (to_discard) {
+        if (is_discarded) {
+            /* Already private */
+        } else if (is_populated) {
+            /* Completely shared */
+            bitmap_clear(attr->bitmap, first_bit, nbits);
+            ram_block_attributes_notify_discard(attr, offset, size);
+        } else {
+            /* Unexpected mixture: process individual blocks */
+            for (bit = first_bit; bit < first_bit + nbits; bit++) {
+                if (!test_bit(bit, attr->bitmap)) {
+                    continue;
+                }
+                clear_bit(bit, attr->bitmap);
+                ram_block_attributes_notify_discard(attr, bit * block_size,
+                                                    block_size);
+            }
+        }
+    } else {
+        if (is_populated) {
+            /* Already shared */
+        } else if (is_discarded) {
+            /* Completely private */
+            bitmap_set(attr->bitmap, first_bit, nbits);
+            ret = ram_block_attributes_notify_populate(attr, offset, size);
+        } else {
+            /* Unexpected mixture: process individual blocks */
+            for (bit = first_bit; bit < first_bit + nbits; bit++) {
+                if (test_bit(bit, attr->bitmap)) {
+                    continue;
+                }
+                set_bit(bit, attr->bitmap);
+                ret = ram_block_attributes_notify_populate(attr,
+                                                           bit * block_size,
+                                                           block_size);
+                if (ret) {
+                    break;
+                }
+            }
+        }
+    }
+
+    return ret;
+}
+
+RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
+{
+    const int block_size  = qemu_real_host_page_size();
+    RamBlockAttributes *attr;
+    MemoryRegion *mr = ram_block->mr;
+
+    attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES));
+
+    attr->ram_block = ram_block;
+    if (memory_region_set_ram_discard_manager(mr, RAM_DISCARD_MANAGER(attr))) {
+        object_unref(OBJECT(attr));
+        return NULL;
+    }
+    attr->bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
+    attr->bitmap = bitmap_new(attr->bitmap_size);
+
+    return attr;
+}
+
+void ram_block_attributes_destroy(RamBlockAttributes *attr)
+{
+    g_assert(attr);
+
+    g_free(attr->bitmap);
+    memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL);
+    object_unref(OBJECT(attr));
+}
+
+static void ram_block_attributes_init(Object *obj)
+{
+    RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj);
+
+    QLIST_INIT(&attr->rdl_list);
+}
+
+static void ram_block_attributes_finalize(Object *obj)
+{
+}
+
+static void ram_block_attributes_class_init(ObjectClass *klass,
+                                            const void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
+
+    rdmc->get_min_granularity = ram_block_attributes_rdm_get_min_granularity;
+    rdmc->register_listener = ram_block_attributes_rdm_register_listener;
+    rdmc->unregister_listener = ram_block_attributes_rdm_unregister_listener;
+    rdmc->is_populated = ram_block_attributes_rdm_is_populated;
+    rdmc->replay_populated = ram_block_attributes_rdm_replay_populated;
+    rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded;
+}
diff --git a/system/trace-events b/system/trace-events
index be12ebfb41..82856e44f2 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -52,3 +52,6 @@ dirtylimit_state_finalize(void)
 dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
 dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
 dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
+
+# ram-block-attributes.c
+ram_block_attributes_state_change(uint64_t offset, uint64_t size, const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64" from '%s' to '%s'"
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 5/5] physmem: Support coordinated discarding of RAM with guest_memfd
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2025-06-12  8:27 ` [PATCH v7 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
@ 2025-06-12  8:27 ` Chenyi Qiang
  2025-06-18 21:58 ` [PATCH v7 0/5] Enable shared device assignment Peter Xu
  2025-06-18 22:22 ` Peter Xu
  6 siblings, 0 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-12  8:27 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao, Cédric Le Goater,
	Alex Williamson

A new field, attributes, was introduced in RAMBlock to link to a
RamBlockAttributes object, which centralizes all guest_memfd related
information (such as fd and status bitmap) within a RAMBlock.

Create and initialize the RamBlockAttributes object upon ram_block_add().
Meanwhile, register the object in the target RAMBlock's MemoryRegion.
After that, guest_memfd-backed RAMBlock is associated with the
RamDiscardManager interface, and the users can execute RamDiscardManager
specific handling. For example, VFIO will register the
RamDiscardListener and get notifications when the state_change() helper
invokes.

As coordinate discarding of RAM with guest_memfd is now supported, only
block uncoordinated discard.

Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v7:
    - Add some document about the attribute and status consistency
      (David).
    - Add Reviewed-by and Acked-by from Alexey and David.

Changes in v6:
    - Squash the unblocking of cooridnate discard into this commit.
    - Remove the checks in migration path.

Changes in v5:
    - Revert to use RamDiscardManager interface.
    - Move the object_new() into the ram_block_attribute_create()
      helper.
    - Add some check in migration path.

Changes in v4:
    - Remove the replay operations for attribute changes which will be
      handled in a listener in following patches.
    - Add some comment in the error path of realize() to remind the
      future development of the unified error path.

Changes in v3:
    - Use ram_discard_manager_reply_populated/discarded() to set the
      memory attribute and add the undo support if state_change()
      failed.
    - Didn't add Reviewed-by from Alexey due to the new changes in this
      commit.
---
 accel/kvm/kvm-all.c       |  9 +++++++++
 include/system/ramblock.h |  1 +
 system/physmem.c          | 23 +++++++++++++++++++++--
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 51526d301b..3b390bbb09 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3089,6 +3089,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    ret = ram_block_attributes_state_change(RAM_BLOCK_ATTRIBUTES(mr->rdm),
+                                            offset, size, to_private);
+    if (ret) {
+        error_report("Failed to notify the listener the state change of "
+                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
+                     start, size, to_private ? "private" : "shared");
+        goto out_unref;
+    }
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index 1bab9e2dac..87e847e184 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -46,6 +46,7 @@ struct RAMBlock {
     int fd;
     uint64_t fd_offset;
     int guest_memfd;
+    RamBlockAttributes *attributes;
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca309e..ff0ca40222 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         }
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");
@@ -1931,6 +1931,24 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             goto out_free;
         }
 
+        /*
+         * The attribute bitmap of the RamBlockAttributes is default to
+         * discarded, which mimics the behavior of kvm_set_phys_mem() when it
+         * calls kvm_set_memory_attributes_private(). This leads to a brief
+         * period of inconsistency between the creation of the RAMBlock and its
+         * mapping into the physical address space. However, this is not
+         * problematic, as no users rely on the attribute status to perform
+         * any actions during this interval.
+         */
+        new_block->attributes = ram_block_attributes_create(new_block);
+        if (!new_block->attributes) {
+            error_setg(errp, "Failed to create ram block attribute");
+            close(new_block->guest_memfd);
+            ram_block_coordinated_discard_require(false);
+            qemu_mutex_unlock_ramlist();
+            goto out_free;
+        }
+
         /*
          * Add a specific guest_memfd blocker if a generic one would not be
          * added by ram_block_add_cpr_blocker.
@@ -2287,8 +2305,9 @@ static void reclaim_ramblock(RAMBlock *block)
     }
 
     if (block->guest_memfd >= 0) {
+        ram_block_attributes_destroy(block->attributes);
         close(block->guest_memfd);
-        ram_block_discard_require(false);
+        ram_block_coordinated_discard_require(false);
     }
 
     g_free(block);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 0/5] Enable shared device assignment
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2025-06-12  8:27 ` [PATCH v7 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
@ 2025-06-18 21:58 ` Peter Xu
  2025-06-24  9:51   ` David Hildenbrand
  2025-06-18 22:22 ` Peter Xu
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-06-18 21:58 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

Hi, Chenyi,

On Thu, Jun 12, 2025 at 04:27:41PM +0800, Chenyi Qiang wrote:
> Relationship with in-place conversion
> -------------------------------------
> In-place page conversion is the ongoing work to allow mmap() of
> guest_memfd to userspace so that both private and shared memory can use
> the same physical memory as the backend. This new design eliminates the
> need to discard pages during shared/private conversions. When it is
> ready, shared device assignment needs be adjusted to achieve an
> unmap-before-conversion-to-private and map-after-conversion-to-shared
> sequence to be compatible with the change.

Is it intentional to be prepared for this?

The question more or less come from the read of patch 5, where I see a
bunch of very similar code versus virtio-mem, like:

        ram_block_attributes_for_each_populated_section
        ram_block_attributes_for_each_discarded_section
        ram_block_attributes_rdm_register_listener
        ram_block_attributes_rdm_unregister_listener

Fundamentally, IIUC it's because of the similar structure of bitmap used,
and the listeners.  IOW, I wonder if it's possible to move the shared
elements into RamDisgardManager for:

    unsigned bitmap_size;
    unsigned long *bitmap;
    QLIST_HEAD(, RamDiscardListener) rdl_list;

But if we know it'll be a tri-state some day, maybe that means it won't
apply anymore.  However the rdl_list is still applicable to be merged if we
want, it's just that it'll be a smaller portion to be shared.

Other than that, even if I don't know how to test this.. I read the patches
today and they look all good.  The duplication is a pure question I have
above, but even if so it can also be done on top.

I do plan to pick this up. Paolo/David, any comments before I do?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 0/5] Enable shared device assignment
  2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
                   ` (5 preceding siblings ...)
  2025-06-18 21:58 ` [PATCH v7 0/5] Enable shared device assignment Peter Xu
@ 2025-06-18 22:22 ` Peter Xu
  2025-06-19  3:09   ` Chenyi Qiang
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-06-18 22:22 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

On Thu, Jun 12, 2025 at 04:27:41PM +0800, Chenyi Qiang wrote:
> This is the v7 series of the shared device assignment support.

Building doc fails, see:

  https://gitlab.com/peterx/qemu/-/jobs/10396029551

You should be able to reproduce with --enable-docs.  I think you need to
follow the rest with kernel-doc format.

If you want, you can provide "git --fixup" appended to the reply (one fixup
for each patch that needs fixing) to avoid a full repost.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-06-12  8:27 ` [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-06-19  3:06   ` Chenyi Qiang
  2025-06-19 14:29     ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-19  3:06 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/12/2025 4:27 PM, Chenyi Qiang wrote:
> Update ReplayRamDiscard() function to return the result and unify the
> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
> the same time due to their identical definitions. This unification
> simplifies related structures, such as VirtIOMEMReplayData, which makes
> it cleaner.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v7:
>     - Add Reviewed-by from Xiaoyao and Pankaj.
> 
> Changes in v6:
>     - Add Reviewed-by from David
>     - Add a documentation comment for the prototype change
> 
> Changes in v5:
>     - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
>     - return data->fn(s, data->opaque) instead of 0 in
>       virtio_mem_rdm_replay_discarded_cb(). (Alexey)
> 
> Changes in v4:
>     - Modify the commit message. We won't use Replay() operation when
>       doing the attribute change like v3.
> ---
>  hw/virtio/virtio-mem.c  | 21 +++++++-------
>  include/system/memory.h | 64 ++++++++++++++++++++++++++++++-----------
>  migration/ram.c         |  5 ++--
>  system/memory.c         | 12 ++++----
>  4 files changed, 66 insertions(+), 36 deletions(-)
> 

To fix the build error with --enable-docs configuration, Add the below fixup

======

From 41bf404651b180f886bd174d36ae62be2b0da61f Mon Sep 17 00:00:00 2001
From: Chenyi Qiang <chenyi.qiang@intel.com>
Date: Thu, 19 Jun 2025 10:49:46 +0800
Subject: [PATCH] fixup! memory: Unify the definiton of ReplayRamPopulate() and
 ReplayRamDiscard()

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/system/memory.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index eb2618e1b4..46248d4a52 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -577,7 +577,7 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
 }
 
 /**
- * ReplayRamDiscardState:
+ * typedef ReplayRamDiscardState:
  *
  * The callback handler for #RamDiscardManagerClass.replay_populated/
  * #RamDiscardManagerClass.replay_discarded to invoke on populated/discarded
@@ -741,6 +741,11 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
  * A wrapper to call the #RamDiscardManagerClass.replay_populated callback
  * of the #RamDiscardManager.
  *
+ * @rdm: the #RamDiscardManager
+ * @section: the #MemoryRegionSection
+ * @replay_fn: the #ReplayRamDiscardState callback
+ * @opaque: pointer to forward to the callback
+ *
  * Returns 0 on success, or a negative error if any notification failed.
  */
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
@@ -754,6 +759,11 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
  * A wrapper to call the #RamDiscardManagerClass.replay_discarded callback
  * of the #RamDiscardManager.
  *
+ * @rdm: the #RamDiscardManager
+ * @section: the #MemoryRegionSection
+ * @replay_fn: the #ReplayRamDiscardState callback
+ * @opaque: pointer to forward to the callback
+ *
  * Returns 0 on success, or a negative error if any notification failed.
  */
 int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-- 
2.43.5




^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 0/5] Enable shared device assignment
  2025-06-18 22:22 ` Peter Xu
@ 2025-06-19  3:09   ` Chenyi Qiang
  0 siblings, 0 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-19  3:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Alexey Kardashevskiy, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/19/2025 6:22 AM, Peter Xu wrote:
> On Thu, Jun 12, 2025 at 04:27:41PM +0800, Chenyi Qiang wrote:
>> This is the v7 series of the shared device assignment support.
> 
> Building doc fails, see:
> 
>   https://gitlab.com/peterx/qemu/-/jobs/10396029551
> 
> You should be able to reproduce with --enable-docs.  I think you need to
> follow the rest with kernel-doc format.
> 
> If you want, you can provide "git --fixup" appended to the reply (one fixup
> for each patch that needs fixing) to avoid a full repost.

Thanks to point it out. I missed to build with --enable-docs. I have replied the
related fixup in patch #3.

> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-06-19  3:06   ` Chenyi Qiang
@ 2025-06-19 14:29     ` Peter Xu
  2025-06-20  4:04       ` Chenyi Qiang
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-06-19 14:29 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: David Hildenbrand, Alexey Kardashevskiy, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson

On Thu, Jun 19, 2025 at 11:06:46AM +0800, Chenyi Qiang wrote:
> To fix the build error with --enable-docs configuration, Add the below fixup

Thanks, this works for me.

Though I just noticed it has more than the doc issue.. please see:

https://gitlab.com/peterx/qemu/-/jobs/10403528070

So 6 failures (and you can ignore the rust warnings):

https://gitlab.com/peterx/qemu/-/pipelines/1878653850

WASM complains a bit more than others, but worth double check from your
side too.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd
  2025-06-12  8:27 ` [PATCH v7 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
@ 2025-06-20  2:53   ` Chenyi Qiang
  0 siblings, 0 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-20  2:53 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/12/2025 4:27 PM, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") highlighted that subsystems like VFIO may disable RAM block
> discard. However, guest_memfd relies on discard operations for page
> conversion between private and shared memory, potentially leading to
> the stale IOMMU mapping issue when assigning hardware devices to
> confidential VMs via shared memory. To address this and allow shared
> device assignement, it is crucial to ensure the VFIO system refreshes
> its IOMMU mappings.
> 
> RamDiscardManager is an existing interface (used by virtio-mem) to
> adjust VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. Implementing it in HostMemoryBackend is
> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
> have a memory backend while others do not. Notably, virtual BIOS
> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
> backend.
> 
> To manage RAMBlocks with guest_memfd, define a new object named
> RamBlockAttributes to implement the RamDiscardManager interface. This
> object can store the guest_memfd information such as the bitmap for
> shared memory and the registered listeners for event notifications. A
> new state_change() helper function is provided to notify listeners, such
> as VFIO, allowing VFIO to do dynamically DMA map and unmap for the shared
> memory according to conversion events. Note that in the current context
> of RamDiscardManager for guest_memfd, the shared state is analogous to
> being populated, while the private state can be considered discarded for
> simplicity. In the future, it would be more complicated if considering
> more states like private/shared/discarded at the same time.
> 
> In current implementation, memory state tracking is performed at the
> host page size granularity, as the minimum conversion size can be one
> page per request. Additionally, VFIO expected the DMA mapping for a
> specific IOVA to be mapped and unmapped with the same granularity.
> Confidential VMs may perform partial conversions, such as conversions on
> small regions within a larger one. To prevent such invalid cases and
> until support for DMA mapping cut operations is available, all
> operations are performed with 4K granularity.
> 
> In addition, memory conversion failures cause QEMU to quit rather than
> resuming the guest or retrying the operation at present. It would be
> future work to add more error handling or rollback mechanisms once
> conversion failures are allowed. For example, in-place conversion of
> guest_memfd could retry the unmap operation during the conversion from
> shared to private. For now, keep the complex error handling out of the
> picture as it is not required.
> 
> Tested-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---

Fix the issue when build with cross platform compiling. Opportunistically resolve
a "line over 80 characters" warning from checkpatch.

===

From 66d6edfb78a6059362a1de3d5028c4159782554b Mon Sep 17 00:00:00 2001
From: Chenyi Qiang <chenyi.qiang@intel.com>
Date: Fri, 20 Jun 2025 10:29:14 +0800
Subject: [PATCH] fixup! ram-block-attributes: Introduce RamBlockAttributes to
 manage RAMBlock with guest_memfd

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/ram-block-attributes.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/system/ram-block-attributes.c b/system/ram-block-attributes.c
index dbb8c9675b..68e8a02703 100644
--- a/system/ram-block-attributes.c
+++ b/system/ram-block-attributes.c
@@ -42,7 +42,8 @@ ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
     const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
     const size_t block_size = ram_block_attributes_get_block_size(attr);
     const uint64_t first_bit = section->offset_within_region / block_size;
-    const uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
+    const uint64_t last_bit =
+        first_bit + int128_get64(section->size) / block_size - 1;
     unsigned long first_discarded_bit;
 
     first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
@@ -333,8 +334,8 @@ int ram_block_attributes_state_change(RamBlockAttributes *attr,
     int ret = 0;
 
     if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
-        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
-                     __func__, offset, size);
+        error_report("%s, invalid range: offset 0x%" PRIx64 ", size "
+                     "0x%" PRIx64, __func__, offset, size);
         return -EINVAL;
     }
 
@@ -402,7 +403,8 @@ RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block)
         object_unref(OBJECT(attr));
         return NULL;
     }
-    attr->bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
+    attr->bitmap_size =
+        ROUND_UP(int128_get64(mr->size), block_size) / block_size;
     attr->bitmap = bitmap_new(attr->bitmap_size);
 
     return attr;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-06-19 14:29     ` Peter Xu
@ 2025-06-20  4:04       ` Chenyi Qiang
  0 siblings, 0 replies; 14+ messages in thread
From: Chenyi Qiang @ 2025-06-20  4:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Alexey Kardashevskiy, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth,
	qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, Cédric Le Goater, Alex Williamson



On 6/19/2025 10:29 PM, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 11:06:46AM +0800, Chenyi Qiang wrote:
>> To fix the build error with --enable-docs configuration, Add the below fixup
> 
> Thanks, this works for me.
> 
> Though I just noticed it has more than the doc issue.. please see:
> 
> https://gitlab.com/peterx/qemu/-/jobs/10403528070

Thanks for reporting. I didn't try the different configure combination..

> 
> So 6 failures (and you can ignore the rust warnings):
> 
> https://gitlab.com/peterx/qemu/-/pipelines/1878653850
> 
> WASM complains a bit more than others, but worth double check from your
> side too.

Most of the warnings are not from this series. I think they can be ignored.
I have sent a fixup for patch #4. I think it can resolve those 6 failures
which report the same error.

> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 0/5] Enable shared device assignment
  2025-06-18 21:58 ` [PATCH v7 0/5] Enable shared device assignment Peter Xu
@ 2025-06-24  9:51   ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2025-06-24  9:51 UTC (permalink / raw)
  To: Peter Xu, Chenyi Qiang
  Cc: Alexey Kardashevskiy, Gupta Pankaj, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth, qemu-devel, kvm,
	Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao, Xu Yilun,
	Li Xiaoyao, Cédric Le Goater, Alex Williamson

On 18.06.25 23:58, Peter Xu wrote:
> Hi, Chenyi,
> 
> On Thu, Jun 12, 2025 at 04:27:41PM +0800, Chenyi Qiang wrote:
>> Relationship with in-place conversion
>> -------------------------------------
>> In-place page conversion is the ongoing work to allow mmap() of
>> guest_memfd to userspace so that both private and shared memory can use
>> the same physical memory as the backend. This new design eliminates the
>> need to discard pages during shared/private conversions. When it is
>> ready, shared device assignment needs be adjusted to achieve an
>> unmap-before-conversion-to-private and map-after-conversion-to-shared
>> sequence to be compatible with the change.
> 
> Is it intentional to be prepared for this?
> 
> The question more or less come from the read of patch 5, where I see a
> bunch of very similar code versus virtio-mem, like:
> 
>          ram_block_attributes_for_each_populated_section
>          ram_block_attributes_for_each_discarded_section
>          ram_block_attributes_rdm_register_listener
>          ram_block_attributes_rdm_unregister_listener
> 
> Fundamentally, IIUC it's because of the similar structure of bitmap used,
> and the listeners.  IOW, I wonder if it's possible to move the shared
> elements into RamDisgardManager for:
> 
>      unsigned bitmap_size;
>      unsigned long *bitmap;
>      QLIST_HEAD(, RamDiscardListener) rdl_list;
> 
> But if we know it'll be a tri-state some day, maybe that means it won't
> apply anymore.

We discussed some of that in one of the revisions, and it's not clear 
yet how virtio-mem will interact in confidential VMs with that (we will 
have to have both active at the same time). I shared some ideas, but 
they are all stuff for the future.

So how the bitmap(s) will look like in the future and how multiple 
RamDiscardManagers will interact remains to be seen.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-06-24  9:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12  8:27 [PATCH v7 0/5] Enable shared device assignment Chenyi Qiang
2025-06-12  8:27 ` [PATCH v7 1/5] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-06-12  8:27 ` [PATCH v7 2/5] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-06-12  8:27 ` [PATCH v7 3/5] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-06-19  3:06   ` Chenyi Qiang
2025-06-19 14:29     ` Peter Xu
2025-06-20  4:04       ` Chenyi Qiang
2025-06-12  8:27 ` [PATCH v7 4/5] ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd Chenyi Qiang
2025-06-20  2:53   ` Chenyi Qiang
2025-06-12  8:27 ` [PATCH v7 5/5] physmem: Support coordinated discarding of RAM " Chenyi Qiang
2025-06-18 21:58 ` [PATCH v7 0/5] Enable shared device assignment Peter Xu
2025-06-24  9:51   ` David Hildenbrand
2025-06-18 22:22 ` Peter Xu
2025-06-19  3:09   ` Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).